A perennial problem for anyone in IT is the infernal beast known as "smart quotes". Smart quotes, also known as "curly quotes", refers to the angled apostrophe and quotation characters that are often used in print but are not found on any conventional keyboard. There's a number of problems with them. First of all, most people don't realize what they are. Then most people don't understand how they work. And finally, Microsoft broke them.
I am not an expert on the subject myself, but these are, at least, my experiences. First of all, one needs to understand character sets. In simplest terms, a character set is simply an agreed-upon mapping from bits to human-readable characters. The most basic character set is the well-known 7-bit ASCII, which covers American English and some supporting bits and pieces, and little else (not surprising, given that it stands for American Standard Code for Information Interchange). ASCII is also generally padded to 8 bits to make processing easier. Of course, if you want to represent a character that's not on your normal keyboard then you need a larger character space. That led, in the bad old days, to different people using that extra bit differently (since it doubled the potential number of characters), creating incompatible "ASCII+" character sets. Finally, everyone realized how silly that was and standardized on Unicode, a fancier encoding scheme that was deliberately backward-compatible with standard ASCII. More specifically, UTF-8 is an 8-bit encoding scheme that is backward compatible with ASCII, and UTF-16 is a 16-bit character set (that's twice as wide) that uses the same numeric encodings as ASCII and UTF-8 but because it's double-width needs extra translation to work on systems that expect 8-bit charcaters. The Wikipedia articles linked above provide more details for the so-inclined.
So what's the problem, and what does that have to do with web development? The first problem is Microsoft Office. MS Word, quite contrary to the rest of the cosmos, still uses its own super-ASCII encoding. For the most part that doesn't cause trouble, until you run into its auto-replace functionality. By default, MS Word will replace straight quotes (the button to the left of your Enter key on US keyboards) with curly quotes (“these things”) automagically. It also replaces two hyphens with an em-dash character, apostrophes with a a curly single quote, and various other auto-typesetting things. If all you're going to do is type and hit print, that doesn't cause a problem. If you then want to use that text in any sort of programmatic processing, however, that's a major problem.
The problem comes when you run such a character through a character-based system that doesn't understand Microsoft's encoding scheme. Sometimes those characters come through cleanly; often they don't. In a typical web application, data passes through several layers both inbound and outbound. That's a lot of layers at which things can get messed up. "Messed up", in this context, could mean anything from displaying ugly squares in place of the actual characters (quotes, dashes, etc.) to displaying accented foreign characters where they make no sense instead — such as the Japanese Yen symbol, a Euro symbol, the cojoined ae or eo used in some languages, etc. — to causing an XSLT processing script to fail completely. Yes, all of the above have happened to me.
Various solutions to this problem have been proposed. (The real one, not using Microsoft Word in the first place, is sadly not widely implemented.) Most revolve around string replacement: When a string is submitted by the user to a script on a web server, do a find/replace on it to replace the Microsoft characters with their ASCII equivalents or the proper HTML entities for such characters. There's a rather nice one available in Perl playfully called "demoronizer" (which also includes a good if irreverent description of the problem), and PHP guru Chris Schiflett has a nice little PHP routine (well, a few lines) that does the same thing for a smaller set of characters. There's just one problem: They don't always work.
Why not? Well, because they're not really ASCII. They're Windows Extended ASCII, and what Windows ASCII does with character 147 (which Microsoft says is a left curly double quote) is not necessarily what other systems will do with character 147 if they're running some other system that uses base ASCII, its own extended ASCII, or Unicode. Sometimes you can get away with copying and pasting the character from Word into your replacement script, but I find that doesn't work as often as it does, especially if you're not on Windows.
So how do you fix them? The real answer is to not create them in the first place, by disabling them in MS Word or, better, writing the proper HTML Entities in the first place. That doesn't work if you allow users to submit content, however, and let's face it, what worthwhile site doesn't these days?
In my experience, and although I cannot fully explain it, the best solution is to force every component in the system to Unicode UTF-8; not UTF-16, as that requires extra translation to get down to UTF-8. That includes your SQL database (if you're using MySQL earlier than 4.1 then you're SOL and should upgrade for this reason alone, to say nothing of the other benefits of MySQL 4.1), any ODBC drivers you may be using, and make sure your web server can handle them. That's not enough, though, because the data is corrupted before it even gets to your server. Remember HTTP!
The HTTP protocol includes an optional header that tells the browser what encoding it's using. Most modern browsers support a wide range of character sets, and can be set to a specific character set or can try to auto-detect from the page. That HTTP header is how the browser "knows" what to guess.
At least, it's supposed to. Unfortunately, once again Microsoft decided it knew better than international standards agencies and started a trend of ignoring the HTTP header and instead relying on the HTML
tag. Perhaps they figured web developers were too stupid to know what an HTTP header was but could blindly copy and paste a tag in an HTML header. Many other browsers now follow suit. As a result, there's two (2) things you have to do in your pages to force the system into UTF-8 mode:
// Do this before you generate any output
header("Content-type: text/html; charset=utf-8");
// Then include this between the and tags, by whatever output method you prefer:
Between the two of those, the browser can't help but think UTF-8 (You hope). Be sure to do that for all pages, including pages where users submit content as well as those where they view content.
What I've found is that smart quotes (and similar Microsoft ickiness) that are submitted through a form that is set to UTF-8, stored in a database or in a flat file, and the displayed in a browser that is set to UTF-8 will pass through cleanly and end up displayed as smart quotes rather than as straight quotes or garbage characters. It may be possible to filter and translate them when set to UTF-8, but I've not tried. I've yet to find a browser where forcing every step of the way to UTF-8 doesn't at least avoid garbage characters, although that won't do anything for XSLT processing or other systems that require real UTF-8 data (that is, anything that is based on XML rather than HTML tag soup).
Until we can wean people off of archaic, pre-Unicode applications that have yet to get with the 21st century (like Microsoft Office and Microsoft Windows), these sorts of issues will continue to break people's brains. Hopefully this article will make them break a little less badly. I also highly recommend all of the links above, as they go into more technical detail as well as typographic detail.