Sweet 16

When is Unicode not Unicode? When it's UTF-16 instead of UTF-8. Both are properly Unicode character sets, but for reasons that escape me they are not fully compatible. In today's installment of "Fix Microsoft's bugs", we'll look at how to deal with that little problem.

I've been tracking the "smart quote" issue in a few past articles, and for the most part have it licked. The important thing is to standardize on a character set, which for most uses should be UTF-8, and then force every single step of the way to explicitly use UTF-8.

Sometimes, though, you can't get every piece of software (or the software's administrator) to behave. Take Microsoft SQL Server 2000. (Please!) On a recent project, we were having all sorts of extended character corruption issues, even after we forced the browser into UTF-8 mode. Eventually we were able to track the problem down to the database server.

When the site was running on Windows, IIS, and Microsoft SQL Server we hadn't had a problem. My best guess is that somewhere along the way, Microsoft was cleaning up after itself. Once we switched the web server to Linux and Apache, however, extended characters started to get corrupted left and right. In testing, we determined that with the UTF-8 fix the data was safe until it went into the database. As soon as a string with a non-ASCII extended character was stored in the database and pulled back out it was immediately corrupted.

On the old site they'd been using a little code snippet to mutate characters, but apparently it wasn't actually doing anything since smart quotes remained in the data:

<?php
function cleanup($str) {
   
$trans = get_html_translation_table(HTML_ENTITIES);
   
    unset(
$trans[">"]);
    unset(
$trans["<"]);
    unset(
$trans['"']);
    unset(
$trans[" "]);
    unset(
$trans["&"]);
   
   
$encoded = strtr($str, $trans);
    return
$encoded;
}
?>

That's the verbose way of applying htmlentities() to a string. Why the verbose way? Because we need to skip over the HTML control characters (<, >, ", &) so that users could still enter HTML. At least that was the theory. None of the extended characters were ever properly translated into HTML entities.

That still wasn't a problem until we switched away from an all-Microsoft stack. Microsoft, apparently deciding that "if UTF-8 is good, then UTF-16 must be twice as good!" set the default character set for SQL Server to be UTF-16. UTF-8 and UTF-16 are not compatible. Therefore, when a UTF-8 string with extended characters was inserted into a UTF-16 database, anything more complex than basic ASCII was garbled. Go team!

The solution might seem obvious: just change the encoding of the database from UTF-16 to UTF-8. If only life were that simple. You see, while enterprise-ready SQL databases like MySQL and Postgres have allowed user-configurable character sets and collations for years (since MySQL 4.1 in 2003 or so; not sure of Postgres' history), Microsoft SQL Server doesn't support any character sets other than UTF-16. Microsoft even acknowledges that limitation quite openly on their web site. The money quote: "UTF-8 is not valid character data to SQL Server". It's only valid to the other 6 billion people in the world. Nice.

So how do we, quickly and efficiently, convert UTF-8 extended characters into a string that a second-rate UTF-16 database can handle? We fold it to ASCII. That is, we do exactly what we were doing before, but we do it right this time.

Instead of storing extended characters like smart quotes, we'll encode them down to HTML entities, just as the original code was attempting to do. The problem is that PHP's get_html_translation_table() function only speaks ISO-8859-1 (the formal name for ASCII). You can't get back a translation table for any other character set. htmlentities(), however, does support a variety of character sets, including UTF-8. If we use that directly, however, then we end up encoding all encodable characters, including <, >, and so on, which means we can't use HTML in our text string. But since we can't get the translation table directly, how can we skip those entities?

We don't. The solution is to make a double-pass over the string:

<?php
function cleanup($str) {
   
$quoted = htmlentities($str, NULL, 'UTF-8');
   
   
$trans = get_html_translation_table(HTML_SPECIALCHARS);
   
$trans = array_flip($trans);
   
$quoted = strtr($quoted, $trans);
   
    return
$quoted;
}
?>

First, we HTML-encode the entire string, explicitly specifying UTF-8 character encoding. Then we retrieve the translation table not for all entities, but just for those that have extra-special meaning in HTML (<, >, ", &). That's the list used by htmlspecialchars(). We then flip it so that the translation goes the other way, and manually run that translation. That un-converts the special characters from HTML entities back to their native, ASCII versions. That way we preserve HTML delimters but encode everything else we can, neatly side-stepping UTF-16 issues while still preserving the visual appearance of the data when displayed.

If you're on PHP 5.1 or later, then you can also use htmlspecialchars_decode() instead of pulling the translation table directly. Then the code becomes even easier:

<?php
function cleanup($str) {
   
$quoted = htmlentities($str, NULL, 'UTF-8');
   
$quoted = htmlspecialchars_decode($quoted);
    return
$quoted;
}
?>

You could even nest that into a single line if you wanted to, but I won't here for clarity.

Once we did that, we were able to store and retrieve strings from SQL Server without extended characters and smart quotes becoming corrupted, since we weren't, technically, even using them anymore. This method does violate the principle of not altering user data on input, only on output, but unfortunately Microsoft leaves us little option here.

Just wait for PHP 6, which should support full "Unicode" natively. Which Unicode, and how will it address these sorts of conflicts? I guess we'll find out.

Happy Coding!

Comments

MS and PHP's fault

I want to point out that this is not a 'compatibility problem' between UTF-8 and UTF-16, but plain and simply shitty implementation both on PHP and MS' side.

PHP sucks for insisting on 8-bit strings for so long, as well as a host of C libraries that stick to the ancient convention or null-delimited strings (which can't take UTF-16 by definition). MS on the other hand has always used UTF-16 internally and their UTF-8 practices are quite terrible (see "UTF-8 byte order mark" on google).

My guess is that the MSSQL-PHP bridge is hardcoded to use the system's ANSI code page for the input and converts it to and from UTF-16.

Supposedly you can use the pseudo ansi-codepage CP_UTF8 for ADODB to get around this.

$oDb = new COM('ADODB.Connection', NULL, CP_UTF8);

Not mssql_*

We weren't actually using the mssql_ functions in PHP, but ODBC. All kinds messy. I suppose the problem could also be in the ODBC driver, but the UTF mismatch makes sense, too. As for that mismatch, see some of the links in the previous articles before this one. One of them talks about UTF-8 not being a strict subset of UTF-16 the way ASCII is a strict subset of both.

I certainly won't deny that MS' and PHP's character implementations both need work. :-)

Hello, Guys If you use

Hello, Guys
If you use using SQL Server, and want Unicode Support. Please ALWAYS use nvarchar type, which also is always UTF-16 Encoding.
I use the paragraph at http://www.unicode.org/faq/utf_bom.html#gen10

"UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-32 is used by various Unix systems. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing."

to explain it make sense to only use UTF-16 for storage processing in SQL Server. Today, Windows, .Net, Java both use UTF-16 as string format, and they both have Encoding conversion function to do the converstion between these Encodings. The really issue is that PHP does not have encoding conversion routine natively. I hope the PHP 6 will resole this issue once the native Unicode support was added. Also please use http://msdn.microsoft.com/en-us/library/cc626307(SQL.90).aspx as reference for SQL Server and PHP support.