More on Stupid Quotes

Submitted by Larry on 8 December 2006 - 11:59pm

In an earlier entry I talked about different character encodings and how Microsoft manages to break the rest of the world with theirs. Thanks to a chance reading of a SitePoint forum post, I have a little more information on the problem. At least now it has a proper name.

I've found SitePoint to be a strong up-and-coming web developer resource for some time now. Their books are good and to the point, and they have some really helpful articles at times. For instance, an HTML FAQ tucked into a forum post touched on the subject of character encodings. Here's what it has to say on the subject of "smart quotes".

Under Microsoft Windows, a common encoding is Windows-1252. It is very similar to ISO 8859-1, but there are differences. In ISO 8859-1, the range of code points between 128 and 159 (0x80-0x9F) is reserved for C1 control characters. In Windows-1252, that range is instead used for a number of useful characters that are missing from the ISO encoding, e.g., typographically correct quotation marks. This is not an encoding that I would recommend for use on the Web, since it's Windows specific. It is, however, the default encoding in many text editors under Windows.

"Not an encoding that I would recommend for use on the Web" indeed. Considering how much content ends up on the web sooner or later these days, that pretty much eliminates it from the realm of usefulness. Unfortunately, that also eliminates Microsoft Office and "many text editors under Windows" from the realm of usefulness. Why in 2006 are we still using ancient alterna-ISO character encodings in market-leading software? Well, because it's market-leading software and therefore doesn't need to update actual useful features in order to stay competitive. But I digress...

Now that we've identified the offending character set, what do we do with it? The investigation continues...

james l selden (not verified)

13 December 2006 - 3:09pm

Yes, I agree and I can certainly relate to this, should I call it, an injustice (?). The best way to illustrate this difficiency in character compatibility encoding is to simply copy and paste verbiage from MS Word to your favorite open-source CMS editor. Watch it blow up into the land of ?'s. haha~