PHP Cookbook/Internationalization and Localization

From WikiContent

< PHP Cookbook
Revision as of 13:36, 7 March 2008 by Docbook2Wiki (Talk)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search
PHP Cookbook


Contents

Introduction

While everyone who programs in PHP has to learn some English eventually to get a handle on its function names and language constructs, PHP can create applications that speak just about any language. Some applications need to be used by speakers of many different languages. Taking an application written for French speakers and making it useful for German speakers is made easier by PHP's support for internationalization and localization.

Internationalization (often abbreviated I18N[1]) is the process of taking an application designed for just one locale and restructuring it so that it can be used in many different locales. Localization (often abbreviated L10N[2]) is the process of adding support for a new locale to an internationalized application.

A locale is a group of settings that describe text formatting and language customs in a particular area of the world. The settings are divided into six categories:

LC_COLLATE
These settings control text sorting: which letters go before and after others in alphabetical order.
LC_CTYPE
These settings control mapping between uppercase and lowercase letters as well as which characters fall into the different character classes, such as alphanumeric characters.
LC_MONETARY
These settings describe the preferred format of currency information, such as what character to use as a decimal point and how to indicate negative amounts.
LC_NUMERIC
These settings describe the preferred format of numeric information, such as how to group numbers and what character is used as a thousands separator.
LC_TIME
These settings describe the preferred format of time and date information, such as names of months and days and whether to use 24- or 12-hour time.
LC_MESSAGES
This category contains text messages used by applications that need to display information in multiple languages.

There is also a metacategory, LC_ALL, that encompasses all the categories.

A locale name generally has three components. The first, an abbreviation that indicates a language, is mandatory. For example, "en" for English or "pt" for Portuguese. Next, after an underscore, comes an optional country specifier, to distinguish between different countries that speak different versions of the same language. For example, "en_US" for U.S. English and "en_GB" for British English, or "pt_BR" for Brazilian Portuguese and "pt_PT" for Portuguese Portuguese. Last, after a period, comes an optional character-set specifier. For example, "zh_TW.Big5" for Taiwanese Chinese using the Big5 character set. While most locale names follow these conventions, some don't. One difficulty in using locales is that they can be arbitrarily named. Finding and setting a locale is discussed in Section 16.2 through Section 16.4.

Different techniques are necessary for correct localization of plain text, dates and times, and currency. Localization can also be applied to external entities your program uses, such as images and included files. Localizing these kinds of content is covered in Section 16.5 through Section 16.9.

Systems for dealing with large amounts of localization data are discussed in Section 16.10 and Section 16.11. Section 16.10 shows some simple ways to manage the data, and Section 16.11 introduces GNU gettext, a full-featured set of tools that provide localization support.

PHP also has limited support for Unicode. Converting data to and from the Unicode UTF-8 encoding is addressed in Section 16.12.

Listing Available Locales

Problem

You want to know what locales your system supports.

Solution

Use the locale program to list available locales; locale -a prints the locales your system supports.

Discussion

On Linux and Solaris systems, you can find locale at /usr/bin/locale. On Windows, locales are listed in the Regional Options section of the Control Panel.

Your mileage varies on other operating systems. BSD, for example, includes locale support but has no locale program to list locales. BSD locales are often stored in /usr/share/locale, so looking in that directory may yield a list of usable locales.

While the locale system helps with many localization tasks, its lack of standardization can be frustrating. Systems aren't guaranteed to have the same locales or even use the same names for equivalent locales.

See Also

Your system's locale(1) manpage.

Using a Particular Locale

Problem

You want to tell PHP to use the settings of a particular locale.

Solution

Call setlocale( ) with the appropriate category and locale. Here's how to use the es_US (U.S. Spanish) locale for all categories:

setlocale(LC_ALL,'es_US');

Here's how to use the de_AT (Austrian German) locale for time and date formatting:

setlocale(LC_TIME,'de_AT');

Discussion

To find the current locale without changing it, call setlocale( ) with a NULL locale:

print setlocale(LC_ALL,NULL);
en_US
            

Many systems also support a set of aliases for common locales, listed in a file such as /usr/share/locale/locale.alias. This file is a series of lines including:

russian         ru_RU.ISO-8859-5
slovak          sk_SK.ISO-8859-2
slovene         sl_SI.ISO-8859-2
slovenian       sl_SI.ISO-8859-2
spanish         es_ES.ISO-8859-1
swedish         sv_SE.ISO-8859-1

The first column of each line is an alias; the second column shows the locale and character set the alias points to. You can use the alias in calls to setlocale( ) instead of the corresponding string the alias points to. For example, you can do:

setlocale(LC_ALL,'swedish');

instead of:

setlocale(LC_ALL,'sv_SE.ISO-8859-1');

On Windows, to change the locale, visit the Control Panel. In the Regional Options section, you can pick a new locale and customize its settings.

See Also

Section 16.4 shows how to set a default locale; documentation on setlocale( ) at http://www.php.net/setlocale.

Setting the Default Locale

Problem

You want to set a locale that all your PHP programs can use.

Solution

At the beginning of a file loaded by the auto_prepend_file configuration directive, call setlocale( ) to set your desired locale:

setlocale(LC_ALL,'es_US');

Discussion

Even if you set up appropriate environment variables before you start your web server or PHP binary, PHP doesn't change its locale until you call setlocale( ). After setting environment variable LC_ALL to es_US, for example, PHP still runs in the default C locale.

See Also

Section 16.3 shows how to use a particular locale; documentation on setlocale( ) at http://www.php.net/setlocale and auto_prepend_file at http://www.php.net/manual/en/configuration.directives.php#ini.auto-prepend-file.

Localizing Text Messages

Problem

You want to display text messages in a locale-appropriate language.

Solution

Maintain a message catalog of words and phrases and retrieve the appropriate string from the message catalog before printing it. Here's a simple message catalog with some foods in American and British English and a function to retrieve words from the catalog:

$messages = array ('en_US' => 
             array(
              'My favorite foods are' => 'My favorite foods are',
              'french fries' => 'french fries',
              'biscuit'      => 'biscuit',
              'candy'        => 'candy',
              'potato chips' => 'potato chips',
              'cookie'       => 'cookie',
              'corn'         => 'corn',
              'eggplant'     => 'eggplant'
             ),
           'en_GB' => 
             array(
              'My favorite foods are' => 'My favourite foods are',
              'french fries' => 'chips',
              'biscuit'      => 'scone',
              'candy'        => 'sweets',
              'potato chips' => 'crisps',
              'cookie'       => 'biscuit',
              'corn'         => 'maize',
              'eggplant'     => 'aubergine'
             )
            );

function msg($s) {
  global $LANG;
  global $messages;
  if (isset($messages[$LANG][$s])) {
    return $messages[$LANG][$s];
  } else {
    error_log("l10n error: LANG: $lang, message: '$s'");
  }
}

Discussion

This short program uses the message catalog to print out a list of foods:

$LANG = 'en_GB';
print msg('My favorite foods are').":\n";
print msg('french fries')."\n";
print msg('potato chips')."\n";
print msg('corn')."\n";
print msg('candy')."\n";
My favourite foods are:
               chips
               crisps
               maize
               sweets
            

To have the program output in American English instead of British English, just set $LANG to en_US.

You can combine the msg( ) message retrieval function with sprintf( ) to store phrases that require values to be substituted into them. For example, consider the English sentence "I am 12 years old." In Spanish, the corresponding phrase is "Tengo 12 años." The Spanish phrase can't be built by stitching together translations of "I am," the numeral 12, and "years old." Instead, store them in the message catalogs as sprintf( )-style format strings:

$messages = array ('en_US' => array('I am X years old.' => 'I am %d years old.'),
                   'es_US' => array('I am X years old.' => 'Tengo %d años.')
            );

You can then pass the results of msg( ) to sprintf( ) as a format string:

$LANG = 'es_US';
print sprintf(msg('I am X years old.'),12);
Tengo 12 años.

For phrases that require the substituted values to be in a different order in different language, sprintf( ) supports changing the order of the arguments:

$messages = array ('en_US' => 
                    array('I am X years and Y months old.' => 
                          'I am %d years and %d months old.'),
                   'es_US' =>
                    array('I am X years and Y months old.' => 
                          'Tengo %2$d meses y %1$d años.')
            );

With either language, call sprintf( ) with the same order of arguments (i.e., first years, then months):

$LANG = 'es_US';
print sprintf(msg('I am X years and Y months old.'),12,7);
Tengo 7 meses y 12 años.

In the format string, %2$ tells sprintf( ) to use the second argument, and %1$ tells it to use the first.

These phrases can also be stored as a function's return value instead of as a string in an array. Storing the phrases as functions removes the need to use sprintf( ). Functions that return a sentence look like this:

// English version
function i_am_X_years_old($age) {
 return "I am $age years old.";
}

// Spanish version
function i_am_X_years_old($age) {
 return "Tengo $age años.";
}

If some parts of the message catalog belong in an array, and some parts belong in functions, an object is a helpful container for a language's message catalog. A base object and two simple message catalogs look like this:

class pc_MC_Base {
  var $messages;
  var $lang;

  function msg($s) {
    if (isset($this->messages[$s])) {
      return $this->messages[$s];
    } else {
      error_log("l10n error: LANG: $this->lang, message: '$s'");
    }
  }

}

class pc_MC_es_US extends pc_MC_Base {

  function pc_MC_es_US() {
    $this->lang = 'es_US';
    $this->messages = array ('chicken' => 'pollo',
                 'cow'     => 'vaca',
                 'horse'   => 'caballo'
                 );
  }
   
  function i_am_X_years_old($age) {
    return "Tengo $age años";
  }
}

class pc_MC_en_US extends pc_MC_Base {
  
  function pc_MC_en_US() {
    $this->lang = 'en_US';
    $this->messages = array ('chicken' => 'chicken',
                 'cow'     => 'cow',
                 'horse'   => 'horse'
                 );
  }
   
  function i_am_X_years_old($age) {
    return "I am $age years old.";
  }
}

Each message catalog object extends the pc_MC_Base class to get the msg( ) method, and then defines its own messages (in its constructor) and its own functions that return phrases. Here's how to print text in Spanish:

$MC = new pc_MC_es_US;

print $MC->msg('cow');
print $MC->i_am_X_years_old(15);

To print the same text in English, $MC just needs to be instantiated as a pc_MC_en_US object instead of a pc_MC_es_US object. The rest of the code remains unchanged.

See Also

The introduction to Chapter 7 discusses object inheritance; documentation on sprintf( ) at http://www.php.net/sprintf.

Localizing Dates and Times

Problem

You want to display dates and times in a locale-specific manner.

Solution

Use strftime( ) 's %c format string:

 print strftime('%c');

You can also store strftime( ) format strings as messages in your message catalog:

$MC = new pc_MC_es_US;
print strftime($MC->msg('%Y-%m-%d'));

Discussion

The %c format string tells strftime( ) to return the preferred date and time representation for the current locale. Here's the quickest way to a locale-appropriate formatted time string:

print strftime('%c');

This code produces a variety of results:

Tue Aug 13 18:37:11 2002     // in the default C locale
mar 13 ago 2002 18:37:11 EDT // in the es_US locale
mar 13 aoÛ 2002 18:37:11 EDT // in the fr_FR locale

The formatted time string that %c produces, while locale-appropriate, isn't very flexible. If you just want the time, for example, you must pass a different format string to strftime( ). But these format strings themselves vary in different locales. In some locales, displaying an hour from 1 to 12 with an A.M./P.M. designation may be appropriate, while in others the hour should range from 0 to 23. To display appropriate time strings for a locale, add elements to the locale's $messages array for each time format you want. The key for a particular time format, such as %H:%M, is always the same in each locale. The value, however, can vary, such as %H:%M for 24-hour locales or %I:%M %P for 12-hour locales. Then, look up the appropriate format string and pass it to strftime( ):

$MC = new pc_MC_es_US;

print strftime($MC->msg('%H:%M'));

Changing the locale doesn't change the time zone, it changes only the formatting of the displayed result.

See Also

Section 3.5 discusses the format strings that strftime( ) accepts; Section 3.12 covers changing time zones in your program; documentation on strftime( ) at http://www.php.net/strftime.

Localizing Currency Values

Problem

You want to display currency amounts in a locale-specific format.

Solution

Use the pc_format_currency( ) function, shown in Example 16-1, to produce an appropriately formatted string. For example:

setlocale(LC_ALL,'fr_CA');
print pc_format_currency(-12345678.45);
(12 345 678,45 $)
            

Discussion

The pc_format_currency( ) function, shown in Example 16-1, gets the currency formatting information from localeconv( ) and then uses number_format( ) and some logic to construct the correct string.

Example 16-1. pc_format_currency

function pc_format_currency($amt) {
    // get locale-specific currency formatting information 
    $a = localeconv();
    
    // compute sign of $amt and then remove it
    if ($amt < 0) { $sign = -1; } else { $sign = 1; }
    $amt = abs($amt);
    // format $amt with appropriate grouping, decimal point, and fractional digits 
    $amt = number_format($amt,$a['frac_digits'],$a['mon_decimal_point'],
                         $a['mon_thousands_sep']);
    
    // figure out where to put the currency symbol and positive or negative signs
    $currency_symbol = $a['currency_symbol'];
    // is $amt >= 0 ? 
    if (1 == $sign) {
        $sign_symbol  = 'positive_sign';
        $cs_precedes  = 'p_cs_precedes';
        $sign_posn    = 'p_sign_posn';
        $sep_by_space = 'p_sep_by_space';
    } else {
        $sign_symbol  = 'negative_sign';
        $cs_precedes  = 'n_cs_precedes';
        $sign_posn    = 'n_sign_posn';
        $sep_by_space = 'n_sep_by_space';
    }
    if ($a[$cs_precedes]) {
        if (3 == $a[$sign_posn]) {
            $currency_symbol = $a[$sign_symbol].$currency_symbol;
        } elseif (4 == $a[$sign_posn]) {
            $currency_symbol .= $a[$sign_symbol];
        }
        // currency symbol in front 
        if ($a[$sep_by_space]) {
            $amt = $currency_symbol.' '.$amt;
        } else {
            $amt = $currency_symbol.$amt;
        }
    } else {
        // currency symbol after amount 
        if ($a[$sep_by_space]) {
            $amt .= ' '.$currency_symbol;
        } else {
            $amt .= $currency_symbol;
        }
    }
    if (0 == $a[$sign_posn]) {
        $amt = "($amt)";
    } elseif (1 == $a[$sign_posn]) {
        $amt = $a[$sign_symbol].$amt;
    } elseif (2 == $a[$sign_posn]) {
        $amt .= $a[$sign_symbol];
    }
    return $amt;
}

The code in pc_format_currency( ) that puts the currency symbol and sign in the correct place is almost identical for positive and negative amounts; it just uses different elements of the array returned by localeconv( ). The relevant elements of localeconv( )'s returned array are shown in Table 16-1.

Table 16-1. Currency-related information from localeconv( )

Array element Description
currency_symbol Local currency symbol
mon_decimal_point Monetary decimal point character
mon_thousands_sep Monetary thousands separator
positive_sign Sign for positive values
negative_sign Sign for negative values
frac_digits Number of fractional digits
p_cs_precedes 1 if currency_symbol should precede a positive value, 0 if it should follow
p_sep_by_space 1 if a space should separate the currency symbol from a positive value, 0 if not
n_cs_precedes 1 if currency_symbol should precede a negative value, 0 if it should follow
n_sep_by_space 1 if a space should separate currency_symbol from a negative value, 0 if not
p_sign_posn Positive sign position:0if parenthesis should surround the quantity and currency_symbol1 if the sign string should precede the quantity and currency_symbol2 if the sign string should follow the quantity and currency_symbol3 if the sign string should immediately precede currency_symbol4 if the sign string should immediately follow currency_symbol
n_sign_posn Negative sign position: same possible values as p_sign_posn


There is a function in the C library called strfmon( ) that does for currency what strftime( ) does for dates and times; however, it isn't implemented in PHP. The pc_format_currency( ) function provides most of the same capabilities.

See Also

Section 2.10 also discusses number_format( ); documentation on localeconv( ) at http://www.php.net/localeconv and number_format( ) at http://www.php.net/number-format.

Localizing Images

Problem

You want to display images that have text in them and have that text in a locale-appropriate language.

Solution

Make an image directory for each locale you want to support, as well as a global image directory for images that have no locale-specific information in them. Create copies of each locale-specific image in the appropriate locale-specific directory. Make sure that the images have the same filename in the different directories. Instead of printing out image URLs directly, use a wrapper function similar to the msg( ) function in Section 16.5 that prints out locale-specific text.

Discussion

The img( ) wrapper function looks for a locale-specific version of an image first, then a global one. If neither are present, it prints a message to the error log:

$image_base_path = '/usr/local/www/images';
$image_base_url  = '/images';

function img($f) {
    global $LANG;
    global $image_base_path;
    global $image_base_url;

    if (is_readable("$image_base_path/$LANG/$f")) {
        print "$image_base_url/$LANG/$f";
    } elseif (is_readable("$image_base_path/global/$f")) {
        print "$image_base_url/global/$f";
    } else {
        error_log("l10n error: LANG: $lang, image: '$f'");
    }
}

This function needs to know both the path to the image file in the filesystem ($image_base_path) and the path to the image from the base URL of your site (/images). It uses the first to test if the file can be read and the second to construct an appropriate URL for the image.

A localized image must have the same filename in each localization directory. For example, an image that says "New!" on a yellow starburst should be called new.gif in both the images/en_US directory and the images/es_US directory, even though the file images/es_US/new.gif is a picture of a yellow starburst with "¡Nuevo!" on it.

Don't forget that the alt text you display in your image tags also needs to be localized. A complete localized <img> tag looks like:

printf('<img src="%s" alt="%s">',img('cancel.png'),msg('Cancel'));

If the localized versions of a particular image have varied dimensions, store image height and width in the message catalog as well:

printf('<img src="%s" alt="%s" height="%d" width="%d">',
       img('cancel.png'),msg('Cancel'),
       msg('img-cancel-height'),msg('img-cancel-width'));

The localized messages for img-cancel-height and img-cancel-width are not text strings, but integers that describe the dimensions of the cancel.png image in each locale.

See Also

Section 16.5 discusses locale-specific message catalogs.

Localizing Included Files

Problem

You want to include locale-specific files in your pages.

Solution

Dynamically modify the include_path once you've determined the appropriate locale:

$base = '/usr/local/php-include';
$LANG = 'en_US';

$include_path = ini_get('include_path');
ini_set('include_path',"$base/$LANG:$base/global:$include_path");

Discussion

The $base variable holds the name of the base directory for your included localized files. Files that are not locale-specific go in the global subdirectory of $base, and locale-specific files go in a subdirectory named after their locale (e.g., en_US). Prepending the locale-specific directory and then the global directory to the include path makes them the first two places PHP looks when you include a file. Putting the locale-specific directory first ensures that nonlocalized information is loaded only if localized information isn't available.

This technique is similar to what the img( ) function does in the Section 16.8. Here, however, you can take advantage of PHP's include_path feature to have the directory searching happen automatically. For maximum utility, reset include_path as early as possible in your code, preferably at the top of a file loaded via auto_prepend_file on every request.

See Also

Documentation on include_path at http://www.php.net/manual/en/configuration.directives.php#ini.include-path and auto_prepend_file at http://www.php.net/manual/en/configuration.directives.php#ini.auto-prepend-file.

Managing Localization Resources

Problem

You need to keep track of your various message catalogs and images.

Solution

Two techniques simplify the management of your localization resources. The first is making a new language's object, for example Canadian English, extend from a similar existing language, such as American English. You only have to change the words and phrases in the new object that differ from the original language.

The second technique: to track what phrases still need to be translated in new languages, put stubs in the new language object that have the same value as in your base language. By finding which values are the same in the base language and the new language, you can then generate a list of words and phrases to translate.

Discussion

The catalog-compare.php program shown in Example 16-2 prints out messages that are the same in two catalogs, as well as messages that are missing from one catalog but present in another.

Example 16-2. catalog-compare.php

$base = 'pc_MC_'.$_SERVER['argv'][1];
$other  = 'pc_MC_'.$_SERVER['argv'][2];

require 'pc_MC_Base.php';
require "$base.php";
require "$other.php";

$base_obj = new $base;
$other_obj = new $other;

/* Check for messages in the other class that
 * are the same as the base class or are in
 * the base class but missing from the other class */ 
foreach ($base_obj->messages as $k => $v) {
    if (isset($other_obj->messages[$k])) {
        if ($v == $other_obj->messages[$k]) {
            print "SAME: $k\n";
        }
    } else {
        print "MISSING: $k\n";
    }
}

/* Check for messages in the other class but missing
 * from the base class */
foreach ($other_obj->messages as $k => $v) {
    if (! isset($base_obj->messages[$k])) {
        print "MISSING (BASE): $k\n";
    }
}

To use this program, put each message catalog object in a file with the same name as the object (e.g., the pc_MC_en_US class should be in a file named pc_MC_en_US.php, and the pc_MC_es_US class should be in a file named pc_MC_es_US.php). You then call the program with the two locale names as arguments on the command line:

% php catalog-compare.php en_US es_US
            

In a web context, it can be useful to use a different locale and message catalog on a per-request basis. The locale to use may come from the browser (in an Accept-Language header), or it may be explicitly set by the server (different virtual hosts may be set up to display the same content in different languages). If the same code needs to select a message catalog on a per-request basis, the message catalog class can be instantiated like this:

$classname = "pc_MC_$locale";

require 'pc_MC_Base.php';
require $classname.'.php';

$MC = new $classname;

See Also

Section 16.5 discusses message catalogs; Section 7.11 for information on finding the methods and properties of an object.

Using gettext

Problem

You want a comprehensive system to create, manage, and deploy message catalogs.

Solution

Use PHP's gettext extension, which allows you to use GNU's gettext utilities:

bindtextdomain('gnumeric','/usr/share/locale');
textdomain('gnumeric');

$languages = array('en_CA','da_DK','de_AT','fr_FR');
foreach ($languages as $language) {
  setlocale(LC_ALL, $language);
  print gettext(" Unknown formula")."\n";
}

Discussion

gettext is a set of tools that makes it easier for your application to produce multilingual messages. Compiling PHP with the --with-gettext option enables functions to retrieve the appropriate text from gettext-format message catalogs, and there are a number of external tools to edit the message catalogs.

With gettext, messages are divided into domains, and all messages for a particular domain are stored in the same file. bindtextdomain( ) tells gettext where to find the message catalog for a particular domain. A call to:

bindtextdomain('gnumeric','/usr/share/locale') 

indicates that the message catalog for the gnumeric domain in the en_CA locale is in the file /usr/share/locale/en_CA/LC_MESSAGES/gnumeric.mo.

The textdomain('gnumeric') function sets the default domain to gnumeric. Calling gettext( ) retrieves a message from the default domain. There are other functions, such as dgettext( ) , that let you retrieve a message from a different domain. When gettext( ) (or dgettext( )) is called, it returns the appropriate message for the current locale. If there's no message in the catalog for the current locale that corresponds to the argument passed to it, gettext( ) (or dgettext( )) returns just its argument. As a result, if you haven't translated all your messages, your code prints out English (or whatever your base language is) for those untranslated messages.

Setting the default domain with textdomain( ) makes each subsequent retrieval of a message from that domain more concise, because you just have to call gettext('Good morning') instead of dgettext('domain','Good morning'). However, if even gettext('Good morning') is too much typing, you can take advantage of an undocumented function alias: _( ) for gettext( ). Instead of gettext('Good morning'), use _('Good morning').

The gettext web site has helpful and detailed information for managing the information flow between programmers and translators and how to efficiently use gettext. It also includes information on other tools you can use to manage your message catalogs, such as a special GNU Emacs mode.

See Also

Documentation on gettext at http://www.php.net/gettext; the gettext library at http://www.gnu.org/software/gettext/gettext.html.

Reading or Writing Unicode Characters

Problem

You want to read Unicode-encoded characters from a file, database, or form; or, you want to write Unicode-encoded characters.

Solution

Use utf8_encode( ) to convert single-byte ISO-8859-1 encoded characters to UTF-8:

print utf8_encode('Kurt Gödel is swell.');

Use utf8_decode( ) to convert UTF-8 encoded characters to single-byte ISO-8859-1 encoded characters:

print utf8_decode("Kurt G\xc3\xb6del is swell.");

Discussion

There are 256 possible ASCII characters. The characters between codes 0 and 127 are standardized: control characters, letters and numbers, and punctuation. There are different rules, however, for the characters that codes 128-255 map to. One encoding is called ISO-8859-1, which includes characters necessary for writing most European languages, such as the ö in Gödel or the ñ in pestaña. Many languages, though, require more than 256 characters, and a character set that can express more than one language requires even more characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million characters.

This increased functionality comes at the cost of space. ASCII characters are stored in just one byte; UTF-8 encoded characters need up to four bytes. Table 16-2 shows the byte representations of UTF-8 encoded characters.

Table 16-2. UTF-8 byte representation

Character code range Bytes used Byte 1 Byte 2 Byte 3 Byte 4
0x00000000 - 0x0000007F 1 0xxxxxxx
0x00000080 - 0x000007FF 2 110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


In Table 16-2, the x positions represent bits used for actual character data. The least significant bit is the rightmost bit in the rightmost byte. In multibyte characters, the number of leading 1 bits in the leftmost byte is the same as the number of bytes in the character.

See Also

Documentation on utf8_encode( ) at http://www.php.net/utf8-encode and utf8_decode( ) at http://www.php.net/utf8-decode; more information on Unicode is available at the Unicode Consortium's home page, http://www.unicode.org; the UTF-8 and Unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html is also helpful.

Notes

  1. The word "internationalization" has 18 letters between the first "i" and the last "n."
  2. The word "localization" has 10 letters between the first "l" and the "n."
Personal tools