独学而无友,则孤陋而寡闻!
分类:
2009-12-23 08:20:37
This page is intended as a reference for functionality and pages.
Note that this page applies to
The following functions / functionality in PHP may pose issues when used in conjunction with UTF-8, depending on what you’re doing.
Important Note the words “it depends” are critical to bear in mind here - do not blindly replace all use of these functions without understanding why you’re doing so. Remember ASCII (aka US-ASCII or ASCII7) is a subset of UTF-8 and that UTF-8 has been designed so that no character sequence in a well formed UTF-8 string can be mistaken as a sub-sequence of another, longer character. These two facts will often mean you can survive with PHP‘s own string functions depending on the exact nature of what you are doing with them - see the strpos discussion below. Blindly replacing all uses “just in case” is likely to lead to apps with run like lame dogs.
Note on Locales the discussion below could be read to suggest “locales are evil”, which would be to misunderstand the problem.
If you’re writing code for yourself, to be used on a server you control, locales could be made to work if your server has locales installed which support UTF-8. That would mean functions like behave correctly.
But this is no use if you’re writing applications which will be installed by third parties (like for example) because it’s system specific (it’s not even just but in practice that requires two things; that there is a locale available on the system which supports UTF-8 (not guaranteed) and that the correct locale identifier string can be found (there a definately differences between Windows and *Nix locale identifiers and even amongst the Unixes believe there are variations e.g. ). What’s more, you can’t rely on users to be able to change the locale correctly to suit your applications needs - on a shared host they probably won’t be able to change the locale for the user that Apache is running with. Bottom line - locales are not the way to go for applications intended to be “write once, run anywhere”.
Update: You can downgrade your character type locale to the POSIX (C) locale via setlocale(), like
( LC_CTYPE, 'C' );
This should work on all platforms. It would mean functions like strtolower()
are only considering characters in the
for details of how to check for well formedness. The point there is you
should check UTF-8 strings for well formedness when using functions
like (see below) which will work with UTF-8 so long as it is well formed.
Note that you can find “UTF-8 aware” implementations of many of these functions under CVS here.
Official docs at .
Unless the /u modifier is used as well, picks up it’s understanding of upper and lowercase from the server’s locale. Depending on what you’re doing, this may result in false matches which in turn lead to corrupt UTF-8 strings.
UTF-8 allows for 5 and 6 byte character sequences but these have no meaning in Unicode (ie. there are displayable characters for these sequences). This might lead to “junk” in a web page (browsers would display a ?). See - you should filter for 5/6 byte sequences
The \w means “word character”, the meaning of which is loaded from the servers current locale. From the manual;
A “word” character is any letter or digit or the underscore character, that is, any character which can be part of a (available sinceOfficial docs at .
Official documentation: Risk: high Impact: could corrupt a UTF-8 stringRumour - although this function (claims) to have UTF-8 support, bug reports claim it’s broken at least until .
Official documentation: Risk: high Impact: could corrupt a UTF-8 stringHighly suspect - see comments on htmlentities above - this does the reverse.
Official documentation: Risk: low Impact: in theory (not confirmed) should not damage a UTF-8 stringhtmlspecialchars should (not confirmed) do the right thing by default (without the third argument specifying UTF-8) if it is given a well formed UTF-8 string because the characters it replaces are all within the ($utf8_string, ENT_COMPAT, 'UTF-8');
Official documentation: Risk: medium Impact: The problematic features are probably rare. TODOYet to investigate in detail. The
x
andX
type specifiers are probably an issue. See also - have not verified .The string padding functionality assumes single-byte input, so it won’t pad correctly if there are multi-byte utf8 characters in the arguments.
TODO , , and - same arguments probably apply
Official documentation: Risk: high Impact: could corrupt a UTF-8 stringstr_ireplace() relies of the server’s locale setting to convert all characters to lower case. If the locale setting is something other than
Official documentation: Risk: high Impact: could corrupt a UTF-8 stringstr_split() breaks up a string given a length argument. The length is a length in bytes not characters. That means it could break a multibyte UTF-8 sequence into invalid parts.
That said, if you know for sure that a given UTF-8 string contains, say, only 2 byte sequences, you might reasonably want to use to break it up into single character sequences.
Official documentation: Risk: medium Impact: results cannot be trustedstrcasecmp() internally converts the two strings it is comparing to lowercase, based on the server locale settings. As such, it cannot be relied upon to be able to convert appropriate multibyte characters in UTF-8 to lowercase and, depending on the actual locale, may have internally corrupted the UTF-8 strings it is comparing, having falsely matched byte sequences. It won’t actually damage the UTF-8 string but the result of the comparison cannot be trusted.
That said, if two given UTF-8 strings are known to contain only characters in the
Official documentation: Risk: medium Impact: results cannot be trustedstrcspn() will return a length in bytes not characters, which may not always be what you require.
Also if the mask you provide it contains multibyte characters, these will be split, internally, into their component bytes, perhaps meaning results which are not semantically true -
10xxxxxx
bytes in a sequence could be matched .
Official documentation: Risk: high Impact: could return a corrupt a UTF-8 stringstristr internally converts characters to lower case using the server’s locale and in determining the substring to return, the result may be a corrupted UTF-8 string and the matching will be undpredictable (locale dependent).
Official documentation: Risk: low Impact: results in bytes not charactersstrlen simply counts the number of bytes in a string, not the number of characters. This means for UTF-8 the integer it returns is actually longer than the number of characters in the string.
Note that this may not always be a problem - see the strpos discussion below for an example where working in bytes not characters produces expected results.
Official documentation: Risk: low Impact: results in bytes not charactersstrpos will behave correctly with well formed UTF-8 but the result it returns will be in bytes not characters, which may to may not be what you desire, depending on what you want to do with that result.
You would be able to use the result in conjunction with for example (remember each UTF-8 sequence is unique) but if you want to validate a string in some manner, based on character length not byte length, strpos may not be semantically correct.
Consider the following example;
php
('Content-type: text/html; charset=utf-8');
$haystack = 'Iñtërnâtiônàlizætiøn';
$needle = 'ô';
$pos = ($haystack, $needle);
"Position in bytes is $pos
";
$substr = ($haystack, 0, $pos);
"Substr: $substr
";This will display;
Position in bytes is 12
Substr: IñtërnâtiThe point being it “works” despite the fact the string is UTF-8 - there’s no need to replace the use of or in the case.
By contrast, pulling out an arbitrary substring which happens to cut a 2 byte UTF-8 sequence breaks the string;
php
('Content-type: text/html; charset=utf-8');
$haystack = 'Iñtërnâtiônàlizætiøn';
$substr = ($haystack, 0, 13); // Position 13 is in the middle of the ô char
"Substr: $substr
";
$substr
now contains badly formed UTF-8 and your browser should display something wierd as a result (probably a ?)
Official documentation: Risk: high Impact: could return a corrupt a UTF-8 stringstrrev first has to split a string into an array of bytes then reverse their order - this would corrupt multibyte characters in a UTF-8 string.
Note you could still use strrev() if you know that a given UTF-8 string only contains characters in the
Official documentation: Risk: low Impact: results in bytes not charactersstrrpos will return an answer in bytes not characters. See strpos above for more info.
Official documentation: Risk: low Impact: results in bytes not charactersstrspn will return an answer in bytes not characters - See strpos above for more info - similar arguments apply
Official documentation: Risk: high Impact: could return a corrupt a UTF-8 stringstrtolower uses the servers locale setting to understand the meaning of “uppercase” and “lowercase”. Depending on the locale character set, this could mean it falsely matches parts of a UTF-8 string with sequences in the character set it thinks it’s using - the result would be “corrupt” UTF-8.
Otherwise strtolower would fail to be able to understand the meaning of “uppercase” and “lowercase” in UTF-8 if the locale does not support UTF-8 (your locale might be US-
Official documentation: Risk: high Impact: could return a corrupt a UTF-8 stringSee notes on strtolower above.
Official documentation: Risk: medium to high Impact: accepts arguments in bytes positions not characters - could corrupt a UTF-8 stringIf used in an arbitrary manner to chop off part of a string, it could potentially split UTF-8 sequences resulting in corruption. At the same time if used in conjunction with functions like strpos (see notes above), would be able to extract a portion of a UTF-8 string without corrupting it, although you’ll be passing it arguments in terms of byte positions not character positions.
Official documentation: Risk: medium to high Impact: accepts arguments in bytes positions not characters - could corrupt a UTF-8 stringIf arbitrary start and length arguments are supplied, could corrupt a UTF-8 string. Otherwise could be used in some instances when working with relative UTF-8 character positions - see notes on substr above.
Official documentation: , , Risk: low Impart: could corrupt a UTF-8 string if second (optional) charlist arg is usedUsed in the “default” manner (without the second charlist argument) these functions are safe to use on a UTF-8 string, because the whitespace characters they are searching for are all in the could be trimmed from other multibyte sequences in the subject string. Probably (unconfirmed) this can only happen when trimming from the right hand side of the string, so this problem may only affect and .
Official documentation: Risk: high Impact: could return a corrupt a UTF-8 stringSee notes to strtolower above
Official documentation: Risk: high Impact: could return a corrupt a UTF-8 stringSee notes to strtolower above
Official documentation: Risk: medium to high Impact: could return a corrupt a UTF-8 stringIf the fourth “cut” argument is used, could split a UTF-8 sequence, resulting in corruption.
To be confirmed - what is the meaning of a “word” to this function. Is it the same as ;
The definition of a word is any string of characters that is immediately after a whitespace (These are: space, form-feed, newline, carriage return, horizontal tab, and vertical tab).If that is correct, wordwrap will only be dangerous if the cut argument is used.
Official docs at .
- needs to become an explicit list of functions. Just a description right now.
The main issue related to arrays is sorting and (thankfully) this will be non-critical to most applications.
Functions like , when sorting alphanumerically, will lack the knowledge to know how to sort multi byte UTF-8 characters in a manner which is semantically correct. will still sort
- and UTF-8 - content type headers? base64 encoding?
As mentioned at (compared to UTF-7);
> UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using quoted printable or base64.Some links;
utf-mail - download utf-mail.zip - plugin for Wordpress but shows one way to do it (without using mb_string). - how to do it with mb_stringSeems to be two approach (at least specific to the body of the email - ignoring subject / headers) - if you want to send plain text you have to encode that body with something like . Alternative you could “attach” an
See official docs at .
Official documentation: , Risk: low Impact: problem when using these for stuff like
Official documentation: , Risk: low Impact: lengths reported on strings will be in bytesJust a potential debugging “gotcha” - if web page encoded as UTF-8, you may only see 3 characters, for example, while these functions report, say, 5 as string length
Official docs at .
The SAX parser (officially) supports three encodings . It distinguishes between source encoding (the encoding of an XML document it is parsing) and target encoding - the encoding of strings passed to your SAX callback functions.
The source encoding is either passed explicitly to xml_parser_create or (since PHP 5) determined automatically from the charset declaration in the XML document. If no source encoding is specified, PHP defaults to ISO-8859-1 (perhaps a design flaw - would have been smarter to default to UTF-8). If the source encoding contains byte sequences PHP doesn’t understand, it will raise an error e.g. the XML_ERROR_UNKNOWN_ENCODING or XML_ERROR_INCORRECT_ENCODING error codes.
The target encoding can be controlled with the xml_parser_set_option function. Any incoming characters outside the range of the target encoding are replaced with a question mark. That means if the source encoding is UTF-8 and the target encoding is US-ASCII, multibyte UTF-8 characters will be replaced with a question mark.
Note that the XML SAX extension should (not confirmed) spot badly formed UTF-8 in the source encoding. Also it’s definition of what is UTF-8 is only those within the the Unicode range (unlike the PCRE extension) - i.e. doesn’t regard 5 and 6 byte sequences as being UTF-8.
See PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss. See also which implements a work around for detecting / converting other character sets (currently in the rss.parse.inc file).
Both PHP4 + PHP5 xml-dom extensions use UTF-8 as internal encoding. This means that they mostly get it right, however there is one major GOTCHA, since they extect input strings to be utf8-encoded. If you use iso-8859-1 as your internal encoding (which you most likely do), this means that each and every string that you input to the DOM api should be encoded with utf8_encode. It’s important to realize that you have to do this regardless of which encoding the document is out in. Annoying to say the least, but atleast it’s consistent.
utf8_encode and utf8_decode
Official documentation: , Risk: medium Impact: will result in corrupt UTF-8 if used incorrectly - they are used to convert only between UTF-8 and ISO-8859-1 - use on another other charset (excepting ASCII-7) would result in junk / lost charactersThese functions are designed to convert between ISO-8859-1 and UTF-8 (nothing more, nothing less). In particular older versions of IE / Win98 used CP1252 (a Windows encoding similar to but not the same as ISO-8859-1). See this manual entry.
Some links
utf-8 encodes decodes utf-8 encoded strings intoIs it a good idea to use UTF-8 in URLs (security issues / mapping to filesystem / DB primary keys etc.)?
Official documentation: , Risk: low Impact: encoding a string that has previously been utf-8 encoded is generally safe - it’ll appear as a multibyte sequence compliant with . The multibyte sequence will present correctly on a page declared to be encoded with the UTF-8 charset. However, a utf-8 encoder other than should be used to convert unicode entities to a utf-8 encoded string.
Official documentation: , Risk: medium Impact: incoming compliant strings will be correctly decoded as valid utf-8. Note however that these functions operate on bytes rather than characters, and thus encoded strings that do not represent valid utf-8 (e.g. “%80%80%80” or “%c0%bc”) will be decoded without error. ECMAScript %uNNNN-style encodings are not supported.Some links
(which needs ) can safely replace php’s built-in (, ) decoding functions. decodes utf-8 encoded strings into HTML unicode entities (NNNN;) or javascript ones (%uNNNN) .Official docs at .
Stuff todo here. In particular functions like imagettftext. Guessing it will depend largely on what the GD font you are using is able to support.
Some links;
- slightly suspect (e.g. interchange between use ofOtherwise suspect Gallery v2 has this nailed these days - need to look
Official docs at .
Stuff to research here - what are the issues in reading exif data - are exotic charsets used? etc.
Some links;
- aside from having built in UTF-8 support, very cool librarySpecial mentions for stuff which may be “surprisingly” safe with UTF-8. Note if “well formedness” is mentioned, it may mean you should be checking the strings for well formedness before using these functions.
Official documentation: Risk: noneSo long as all arguments used are well formed UTF-8, no problems.
This works because every complete character sequence in a UTF-8 string is unique (cannot be mistaken as part of a longer sequence)
Official documentation: Risk: noneSo long as all arguments used are well formed UTF-8, no problems.
This works because every complete character sequence in a UTF-8 string is unique (cannot be mistaken as part of a longer sequence).
see table herereferring to the table herePHP5 uses libxml2 which supports more encodings - rumour has it (not confirmed) that creating the parser likexml_parser_create(”“);
will be it to support more than just the three official character sets, auto-detecting from the charset declarationNote that this is not compatible with the ECMAScript %uNNNN-style encoding used by the escape() and unescape() functions, but is compatible with the new encodeURIComponent() and friends.阅读(2356) | 评论(0) | 转发(0) |给主人留下些什么吧!~~