Handling UTF-8 with PHP]-lsstarboy-ChinaUnix博客

lsstarboy的学习日记lsstarboy.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

lsstarboy

博客访问： 4306342
博文数量： 601
博客积分： 15410
博客等级：上将
技术积分： 6884
用户组：普通用户
注册时间： 2007-05-16 08:11

个人简介

独学而无友，则孤陋而寡闻！

文章分类

全部博文（601）

JavaScript（17）
windows（7）
课廊使用指南（9）
数据库（14）
龙芯（12）
linux（25）
心情（1）
安全（6）
开源教育软件（4）
硬件相关（21）
生活社会（26）

中医（2）
web系统相关（130）

magento（8）

BBS（16）

cms（22）

微博（8）

课程管理（51）
nginx（16）
php相关（90）
问题集（4）
编程（12）

prolog（2）
教育教学（32）
BSD（174）
未分配的博文（1）

文章存档

2020年（1）

2018年（4）

2017年（7）

2016年（42）

2015年（25）

2014年（15）

2013年（36）

2012年（46）

2011年（117）

2010年（148）

2009年（82）

2008年（37）

2007年（41）

我的朋友

最近访客

推荐博文

Handling UTF-8 with PHP]

分类：

2009-12-23 08:20:37

This page is intended as a reference for functionality and pages.

Note that this page applies to

The following functions / functionality in PHP may pose issues when used in conjunction with UTF-8, depending on what you’re doing.

Important Note the words “it depends” are critical to bear in mind here - do not blindly replace all use of these functions without understanding why you’re doing so. Remember ASCII (aka US-ASCII or ASCII7) is a subset of UTF-8 and that UTF-8 has been designed so that no character sequence in a well formed UTF-8 string can be mistaken as a sub-sequence of another, longer character. These two facts will often mean you can survive with PHP‘s own string functions depending on the exact nature of what you are doing with them - see the strpos discussion below. Blindly replacing all uses “just in case” is likely to lead to apps with run like lame dogs.

Note on Locales the discussion below could be read to suggest “locales are evil”, which would be to misunderstand the problem.

If you’re writing code for yourself, to be used on a server you control, locales could be made to work if your server has locales installed which support UTF-8. That would mean functions like behave correctly.

But this is no use if you’re writing applications which will be installed by third parties (like for example) because it’s system specific (it’s not even just but in practice that requires two things; that there is a locale available on the system which supports UTF-8 (not guaranteed) and that the correct locale identifier string can be found (there a definately differences between Windows and *Nix locale identifiers and even amongst the Unixes believe there are variations e.g. ). What’s more, you can’t rely on users to be able to change the locale correctly to suit your applications needs - on a shared host they probably won’t be able to change the locale for the user that Apache is running with. Bottom line - locales are not the way to go for applications intended to be “write once, run anywhere”.

Update: You can downgrade your character type locale to the POSIX (C) locale via setlocale(), like

 ( LC_CTYPE, 'C' );

This should work on all platforms. It would mean functions like strtolower() are only considering characters in the for details of how to check for well formedness. The point there is you should check UTF-8 strings for well formedness when using functions like (see below) which will work with UTF-8 so long as it is well formed.

Note that you can find “UTF-8 aware” implementations of many of these functions under CVS here.

Official docs at .

Official documentation:
Risk: high
Impact: could corrupt a UTF-8 string

Unless the /u modifier is used as well, picks up it’s understanding of upper and lowercase from the server’s locale. Depending on what you’re doing, this may result in false matches which in turn lead to corrupt UTF-8 strings.

Official documentation:
Risk: low
Impact: matches 5 and 6 byte sequences which are not Unicode

UTF-8 allows for 5 and 6 byte character sequences but these have no meaning in Unicode (ie. there are displayable characters for these sequences). This might lead to “junk” in a web page (browsers would display a ?). See - you should filter for 5/6 byte sequences

Official documentation:
Risk: high
Impact: could result in corrupt UTF-8

The \w means “word character”, the meaning of which is loaded from the servers current locale. From the manual;

A “word” character is any letter or digit or the underscore character, that is, any character which can be part of a (available since

Official docs at .

Official documentation:

Risk: high

Impact: could corrupt a UTF-8 string

Rumour - although this function (claims) to have UTF-8 support, bug reports claim it’s broken at least until .

Official documentation:

Risk: high

Impact: could corrupt a UTF-8 string

Highly suspect - see comments on htmlentities above - this does the reverse.

Official documentation:

Risk: low

Impact: in theory (not confirmed) should not damage a UTF-8 string

htmlspecialchars should (not confirmed) do the right thing by default (without the third argument specifying UTF-8) if it is given a well formed UTF-8 string because the characters it replaces are all within the ($utf8_string, ENT_COMPAT, 'UTF-8');

Official documentation:

Risk: medium

Impact: The problematic features are probably rare.

TODO

Yet to investigate in detail. The x and X type specifiers are probably an issue. See also - have not verified .

The string padding functionality assumes single-byte input, so it won’t pad correctly if there are multi-byte utf8 characters in the arguments.

TODO , , and - same arguments probably apply
Official documentation:

Risk: high

Impact: could corrupt a UTF-8 string

str_ireplace() relies of the server’s locale setting to convert all characters to lower case. If the locale setting is something other than

Official documentation:

Risk: high

Impact: could corrupt a UTF-8 string

str_split() breaks up a string given a length argument. The length is a length in bytes not characters. That means it could break a multibyte UTF-8 sequence into invalid parts.

That said, if you know for sure that a given UTF-8 string contains, say, only 2 byte sequences, you might reasonably want to use to break it up into single character sequences.
Official documentation:

Risk: medium

Impact: results cannot be trusted

strcasecmp() internally converts the two strings it is comparing to lowercase, based on the server locale settings. As such, it cannot be relied upon to be able to convert appropriate multibyte characters in UTF-8 to lowercase and, depending on the actual locale, may have internally corrupted the UTF-8 strings it is comparing, having falsely matched byte sequences. It won’t actually damage the UTF-8 string but the result of the comparison cannot be trusted.

That said, if two given UTF-8 strings are known to contain only characters in the

Official documentation:

Risk: medium

Impact: results cannot be trusted

strcspn() will return a length in bytes not characters, which may not always be what you require.

Also if the mask you provide it contains multibyte characters, these will be split, internally, into their component bytes, perhaps meaning results which are not semantically true - 10xxxxxx bytes in a sequence could be matched .

Official documentation:

Risk: high

Impact: could return a corrupt a UTF-8 string

stristr internally converts characters to lower case using the server’s locale and in determining the substring to return, the result may be a corrupted UTF-8 string and the matching will be undpredictable (locale dependent).

Official documentation:

Risk: low

Impact: results in bytes not characters

strlen simply counts the number of bytes in a string, not the number of characters. This means for UTF-8 the integer it returns is actually longer than the number of characters in the string.

Note that this may not always be a problem - see the strpos discussion below for an example where working in bytes not characters produces expected results.
Official documentation:

Risk: low

Impact: results in bytes not characters

strpos will behave correctly with well formed UTF-8 but the result it returns will be in bytes not characters, which may to may not be what you desire, depending on what you want to do with that result.

You would be able to use the result in conjunction with for example (remember each UTF-8 sequence is unique) but if you want to validate a string in some manner, based on character length not byte length, strpos may not be semantically correct.

Consider the following example;
php
 ('Content-type: text/html; charset=utf-8');
$haystack = 'Iñtërnâtiônàlizætiøn';
$needle = 'ô';
 
$pos = ($haystack, $needle);
 
 "Position in bytes is $pos
";
 
$substr = ($haystack, 0, $pos);
 
 "Substr: $substr
";
This will display;
  Position in bytes is 12
  Substr: Iñtërnâti
The point being it “works” despite the fact the string is UTF-8 - there’s no need to replace the use of or in the case.

By contrast, pulling out an arbitrary substring which happens to cut a 2 byte UTF-8 sequence breaks the string;
php
 ('Content-type: text/html; charset=utf-8');
 
$haystack = 'Iñtërnâtiônàlizætiøn';
 
$substr = ($haystack, 0, 13); // Position 13 is in the middle of the ô char
 
 "Substr: $substr
";
$substr now contains badly formed UTF-8 and your browser should display something wierd as a result (probably a ?)
Official documentation:

Risk: high

Impact: could return a corrupt a UTF-8 string

strrev first has to split a string into an array of bytes then reverse their order - this would corrupt multibyte characters in a UTF-8 string.

Note you could still use strrev() if you know that a given UTF-8 string only contains characters in the

Official documentation:

Risk: low

Impact: results in bytes not characters

strrpos will return an answer in bytes not characters. See strpos above for more info.

Official documentation:

Risk: low

Impact: results in bytes not characters

strspn will return an answer in bytes not characters - See strpos above for more info - similar arguments apply
Official documentation:

Risk: high

Impact: could return a corrupt a UTF-8 string

strtolower uses the servers locale setting to understand the meaning of “uppercase” and “lowercase”. Depending on the locale character set, this could mean it falsely matches parts of a UTF-8 string with sequences in the character set it thinks it’s using - the result would be “corrupt” UTF-8.

Otherwise strtolower would fail to be able to understand the meaning of “uppercase” and “lowercase” in UTF-8 if the locale does not support UTF-8 (your locale might be US-

Official documentation:

Risk: high

Impact: could return a corrupt a UTF-8 string

See notes on strtolower above.

Official documentation:

Risk: medium to high

Impact: accepts arguments in bytes positions not characters - could corrupt a UTF-8 string

If used in an arbitrary manner to chop off part of a string, it could potentially split UTF-8 sequences resulting in corruption. At the same time if used in conjunction with functions like strpos (see notes above), would be able to extract a portion of a UTF-8 string without corrupting it, although you’ll be passing it arguments in terms of byte positions not character positions.

Official documentation:

Risk: medium to high

Impact: accepts arguments in bytes positions not characters - could corrupt a UTF-8 string

If arbitrary start and length arguments are supplied, could corrupt a UTF-8 string. Otherwise could be used in some instances when working with relative UTF-8 character positions - see notes on substr above.

Official documentation: , ,

Risk: low

Impart: could corrupt a UTF-8 string if second (optional) charlist arg is used

Used in the “default” manner (without the second charlist argument) these functions are safe to use on a UTF-8 string, because the whitespace characters they are searching for are all in the could be trimmed from other multibyte sequences in the subject string. Probably (unconfirmed) this can only happen when trimming from the right hand side of the string, so this problem may only affect and .

Official documentation:

Risk: high

Impact: could return a corrupt a UTF-8 string

See notes to strtolower above

Official documentation:

Risk: high

Impact: could return a corrupt a UTF-8 string

See notes to strtolower above

Official documentation:

Risk: medium to high

Impact: could return a corrupt a UTF-8 string

If the fourth “cut” argument is used, could split a UTF-8 sequence, resulting in corruption.

To be confirmed - what is the meaning of a “word” to this function. Is it the same as ;

The definition of a word is any string of characters that is immediately after a whitespace (These are: space, form-feed, newline, carriage return, horizontal tab, and vertical tab).

If that is correct, wordwrap will only be dangerous if the cut argument is used.
Official docs at .

- needs to become an explicit list of functions. Just a description right now.

The main issue related to arrays is sorting and (thankfully) this will be non-critical to most applications.

Functions like , when sorting alphanumerically, will lack the knowledge to know how to sort multi byte UTF-8 characters in a manner which is semantically correct. will still sort
- and UTF-8 - content type headers? base64 encoding?

As mentioned at (compared to UTF-7);
> UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using quoted printable or base64.
Some links;

utf-mail - download utf-mail.zip - plugin for Wordpress but shows one way to do it (without using mb_string).

- how to do it with mb_string

Seems to be two approach (at least specific to the body of the email - ignoring subject / headers) - if you want to send plain text you have to encode that body with something like . Alternative you could “attach” an

See official docs at .

Official documentation: ,

Risk: low

Impact: problem when using these for stuff like

Official documentation: ,

Risk: low

Impact: lengths reported on strings will be in bytes

Just a potential debugging “gotcha” - if web page encoded as UTF-8, you may only see 3 characters, for example, while these functions report, say, 5 as string length

Official docs at .

The SAX parser (officially) supports three encodings . It distinguishes between source encoding (the encoding of an XML document it is parsing) and target encoding - the encoding of strings passed to your SAX callback functions.

The source encoding is either passed explicitly to xml_parser_create or (since PHP 5) determined automatically from the charset declaration in the XML document. If no source encoding is specified, PHP defaults to ISO-8859-1 (perhaps a design flaw - would have been smarter to default to UTF-8). If the source encoding contains byte sequences PHP doesn’t understand, it will raise an error e.g. the XML_ERROR_UNKNOWN_ENCODING or XML_ERROR_INCORRECT_ENCODING error codes.

The target encoding can be controlled with the xml_parser_set_option function. Any incoming characters outside the range of the target encoding are replaced with a question mark. That means if the source encoding is UTF-8 and the target encoding is US-ASCII, multibyte UTF-8 characters will be replaced with a question mark.

Note that the XML SAX extension should (not confirmed) spot badly formed UTF-8 in the source encoding. Also it’s definition of what is UTF-8 is only those within the the Unicode range (unlike the PCRE extension) - i.e. doesn’t regard 5 and 6 byte sequences as being UTF-8.

See PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss. See also which implements a work around for detecting / converting other character sets (currently in the rss.parse.inc file).

Both PHP4 + PHP5 xml-dom extensions use UTF-8 as internal encoding. This means that they mostly get it right, however there is one major GOTCHA, since they extect input strings to be utf8-encoded. If you use iso-8859-1 as your internal encoding (which you most likely do), this means that each and every string that you input to the DOM api should be encoded with utf8_encode. It’s important to realize that you have to do this regardless of which encoding the document is out in. Annoying to say the least, but atleast it’s consistent.

utf8_encode and utf8_decode

Official documentation: ,

Risk: medium

Impact: will result in corrupt UTF-8 if used incorrectly - they are used to convert only between UTF-8 and ISO-8859-1 - use on another other charset (excepting ASCII-7) would result in junk / lost characters

These functions are designed to convert between ISO-8859-1 and UTF-8 (nothing more, nothing less). In particular older versions of IE / Win98 used CP1252 (a Windows encoding similar to but not the same as ISO-8859-1). See this manual entry.

Some links

utf-8 encodes decodes utf-8 encoded strings into

Is it a good idea to use UTF-8 in URLs (security issues / mapping to filesystem / DB primary keys etc.)?

Official documentation: ,

Risk: low

Impact: encoding a string that has previously been utf-8 encoded is generally safe - it’ll appear as a multibyte sequence compliant with . The multibyte sequence will present correctly on a page declared to be encoded with the UTF-8 charset. However, a utf-8 encoder other than should be used to convert unicode entities to a utf-8 encoded string.

Official documentation: ,

Risk: medium

Impact: incoming compliant strings will be correctly decoded as valid utf-8. Note however that these functions operate on bytes rather than characters, and thus encoded strings that do not represent valid utf-8 (e.g. “%80%80%80” or “%c0%bc”) will be decoded without error. ECMAScript %uNNNN-style encodings are not supported.

Some links

(which needs ) can safely replace php’s built-in (, ) decoding functions.

decodes utf-8 encoded strings into HTML unicode entities (&#NNNN;) or javascript ones (%uNNNN) .

uriescape

Official docs at .

Stuff todo here. In particular functions like imagettftext. Guessing it will depend largely on what the GD font you are using is able to support.

Some links;

- slightly suspect (e.g. interchange between use of

Otherwise suspect Gallery v2 has this nailed these days - need to look

Official docs at .

Stuff to research here - what are the issues in reading exif data - are exotic charsets used? etc.

Some links;

- aside from having built in UTF-8 support, very cool library

Special mentions for stuff which may be “surprisingly” safe with UTF-8. Note if “well formedness” is mentioned, it may mean you should be checking the strings for well formedness before using these functions.

Official documentation:

Risk: none

So long as all arguments used are well formed UTF-8, no problems.

This works because every complete character sequence in a UTF-8 string is unique (cannot be mistaken as part of a longer sequence)

Official documentation:

Risk: none

So long as all arguments used are well formed UTF-8, no problems.

This works because every complete character sequence in a UTF-8 string is unique (cannot be mistaken as part of a longer sequence).

see table here

referring to the table here

PHP5 uses libxml2 which supports more encodings - rumour has it (not confirmed) that creating the parser like xml_parser_create(”“); will be it to support more than just the three official character sets, auto-detecting from the charset declaration

Note that this is not compatible with the ECMAScript %uNNNN-style encoding used by the escape() and unescape() functions, but is compatible with the new encodeURIComponent() and friends.

阅读(2388) | 评论(0) | 转发(0) |

0

上一篇：BSD下statusnet的中文tag修正

下一篇：missing pkg-descr错误

给主人留下些什么吧！~~

评论热议

请登录后评论。
登录注册
关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册
Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们
16024965号-6