A perennial problem fer anyone in IT is th' infernal beast known as "smart quotes". Smart quotes, also known as "curly quotes", refers t' th' angled apostrophe an' quotation characters that be often used in print but be not found on any conventional keyboard. Oho, I'll warrant ye! There's a number o' problems with them, I'll warrant ye. First o' all, most scallywags dern't realize what they be, pass the grog! Then most scallywags dern't understan' how they work. Oho, yo ho, ho And finally, Microsoft broke them.
I am not an expert on th' subject meself, but these be, at least, me experiences. First o' all, one needs t' understan' character sets. In simplest terms, a character set is simply an agreed-upon mappin' from bits t' human-readable characters, and dinna spare the whip! The most basic character set is th' well-known 7-bit ASCII, which covers American English an' some supportin' bits an' pieces, an' little else (not surprisin', given that it stands fer American Standard Code for Information Interchange), Ya horn swogglin' scurvy cur! ASCII is also generally padded t' 8 bits t' make processin' easier. Of course, if ye want t' represent a character that's not on yer normal keyboard then ye need a larger character space. That led, in th' bad auld days, t' different scallywags usin' that extra bit differently (since it doubled th' potential number o' characters), creatin' incompatible "ASCII+" character sets. The sharks will eat well tonight! And swab the deck! Finally, everyone realized how silly that were bein' an' standardized on Unicode, a fancier encodin' scheme that were bein' deliberately backward-compatible with standard ASCII. More specifically, UTF-8 is an 8-bit encodin' scheme that is backward compatible with ASCII, an' UTF-16 is a 16-bit character set (that's twice as wide) that uses th' same numeric encodin's as ASCII an' UTF-8 but because 'tis double-width needs extra translation t' work on systems that expect 8-bit charcaters. The Wikipedia articles linked above provide more details fer th' so-inclined.
So what's th' problem, an' what does that have t' do with web development? The first problem is Microsoft Office. MS Word, quite contrary t' th' rest o' th' cosmos, still uses its own super-ASCII encodin', avast. For th' most part that doesn't cause trouble, until ye run into its auto-replace functionality. By default, MS Word will replace straight quotes (th' button t' th' left o' yer Enter key on US keyboards) with curly quotes (“these thin's”) automagically, Dance the Hempen Jig It also replaces two hyphens with an em-dash character, apostrophes with a a curly single quote, an' vari'us other auto-typesettin' thin's. Fire the cannons! If all ye're goin' t' do is type an' hit print, that doesn't cause a problem, avast. If ye then want t' use that text in any sort o' programmatic processin', however, that's a major problem.
The problem comes when ye run such a character through a character-based system that doesn't understan' Microsoft's encodin' scheme. Sometimes those characters come through cleanly; often they dern't, pass the grog! In a typical web application, data passes through several layers both inbound an' outbound. That's a lot o' layers at which thin's can get messed up, on a dead man's chest! Fire the cannons! "Messed up", in this context, could mean anythin' from displayin' ugly squares in place o' th' actual characters (quotes, dashes, etc.) t' displayin' accented foreign characters where they make no sense instead — such as th' Japanese Yen symbol, a Euro symbol, th' cojoined ae or eo used in some languages, etc. — t' causin' an XSLT processin' script t' fail completely. Aye, all o' th' above have happened t' me.
Vari'us solutions t' this problem have been proposed. (The real one, not usin' Microsoft Word in th' first place, is sadly not widely implemented.) Most revolve aroun' strin' replacement: When a strin' is submitted by th' user t' a script on a web server, do a find/replace on it t' replace th' Microsoft characters with their ASCII equivalents or th' proper HTML entities fer such characters. There's a rather nice one available in Perl playfully called "demoronizer" (which also includes a good if irreverent description o' th' problem), an' PHP guru Chris Schiflett has a nice little PHP routine (well, a few lines) that does th' same thin' fer a smaller set o' characters. There's just one problem: They dern't always work.
Why not? Well, because they're not really ASCII. They're Windows Extended ASCII, an' what Windows ASCII does with character 147 (which Microsoft says is a left curly double quote) is not necessarily what other systems will do with character 147 if they're runnin' some other system that uses base ASCII, its own extended ASCII, or Unicode. Sometimes ye can get away with copyin' an' pastin' th' character from Word into yer replacement script, but I find that doesn't work as often as it does, especially if ye're not on Windows.
So how do ye fix them? The sharks will eat well tonight! The real answer is t' not create them in th' first place, by disablin' them in MS Word or, better, writin' th' proper HTML Entities in th' first place. That doesn't work if ye allow users t' submit content, however, an' let's face it, what worthwhile site doesn't these days?
In me experience, an' although I cannot fully explain it, th' best solution is t' force every component in th' system t' Unicode UTF-8; not UTF-16, as that requires extra translation t' get down t' UTF-8. That includes yer SQL database (if ye're usin' MySQL afore than 4.1 then ye're SOL an' should upgrade fer this reason alone, t' say nothin' o' th' other benefits o' MySQL 4.1), any ODBC drivers ye may be usin', an' make sure yer web server can handle them. That's not enough, though, because th' data is corrupted before it even gets t' yer server, by Blackbeard's sword. Remember HTTP!
The HTTP protocol includes an optional header that tells th' browser what encodin' 'tis usin'. Most modern browsers support a wide range o' character sets, an' can be set t' a specific character set or can try t' auto-detect from th' page. That HTTP header is how th' browser "knows" what t' guess.
At least, 'tis supposed t'. Unfortunately, once again Microsoft decided it knew better than international standards agencies an' started a trend o' ignorin' th' HTTP header an' instead relyin' on th' HTML
<meta> tag. Perhaps they figured web developers were too stupid t' know what an HTTP header were bein' but could blindly copy an' paste a tag in an HTML header, me Jolly Roger Many other browsers now follow suit. As a result, thar's two (2) thin's ye have t' do in yer pages t' force th' system into UTF-8 mode:
// Do this before you generate any output
header("Content-type: text/html; charset=utf-8");
// Then include this between the <head> and </head> tags, by whatever output method you prefer:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Between th' two o' those, th' browser can't help but think UTF-8 (You hope). Be sure t' do that fer all pages, includin' pages where users submit content as well as those where they view content.
What I've found is that smart quotes (an' similar Microsoft ickiness) that be submitted through a form that is set t' UTF-8, stored in a database or in a flat file, an' th' displayed in a browser that is set t' UTF-8 will pass through cleanly an' end up displayed as smart quotes rather than as straight quotes or garbage characters. It may be possible t' filter an' translate them when set t' UTF-8, but I've not tried. I've yet t' find a browser where forcin' every step o' th' way t' UTF-8 doesn't at least avoid garbage characters, although that won't do anythin' fer XSLT processin' or other systems that require real UTF-8 data (that is, anythin' that is based on XML rather than HTML tag soup).
Until we can wean scallywags off o' archaic, pre-Unicode applications that have yet t' get with th' 21st century (like Microsoft Office an' Microsoft Windows), these sorts o' issues will continue t' break scallywags's brains. Hopefully this article will make them break a little less badly. I also highly recommend all o' th' links above, as they go into more technical detail as well as typographic detail.