This is not your father's Internet. When the Web was first emerging onto the scene, it was simple. Individual web pages were self-contained static blobs of text, with, if you were lucky maybe an image or two. The HTTP protocol was designed to be "dumb". It knew nothing of the relationship between an HTML page and the images it contained. There was no need to. Every request for a URI (web page, image, download, etc.) was a completely separate request. That kept everything simple, and made it very fault tolerant. A server never sat around waiting for a browser to tell it "OK, I'm done!"
Much e-ink has been spilled (can you even do that?) already discussing the myriad of ways in which the web is different today, mostly in the context of either HTML5 or web applications (or both). Most of it is completely true, although there's plenty of hyperbole to go around. One area that has not gotten much attention at all, though, is HTTP.
Well, that's not entirely true. HTTP is actually a fairly large spec, with a lot of exciting moving parts that few people think about because browsers offer no way to use them from HTML or just implement them very very badly. (Did you know that there is a PATCH command defined in HTTP? Really.) A good web services implementation (like we're trying to bake into Drupal 8 as part of the Web Services and Context Core Initiative </shamelessplug>) should leverage those lesser-known parts, certainly, but the modern web has more challenges than just using all of a decades-old spec.
Most significantly, HTTP still treats all URIs as separate, only coincidentally-related resources.
Which brings us to an extremely important challenge of the modern web that is deceptively simple: Caching.
Caching is broken
The web naturally does a lot of caching. When you request a page from a server, rarely is it pulled directly off of the hard drive at the other end. The file, assuming it is actually a file (this is important), may get cached by the operating system's file system cache, by a reverse proxy cache such as Varnish, by a Content Delivery Network, by an intermediary server somewhere in the middle, and finally by your browser. On a subsequent request, the layer closest to you with an unexpired cache will respond with its cached version.
In concept that's great, as it means the least amount of work is done to get what you want. In practice, it doesn't work so well for a variety of reasons.
For one, that model was built on the assumption of a mostly-static web. All URIs are just physical files sitting on disk that change every once in a while. Of course, we don't live in that web anymore. Most web "pages" are dynamically generated from a content management system of some sort.
For another, that totally sucks during development. Who remembers the days of telling your client "no, really, I did upload a new version of the file. You need to clear your browser cache. Hold down shift and reload. Er, wait, that's the other browser. Hit F5 twice. No, really fast. Faster." Yeah, it sucked. There's ways to configure the HTTP headers to not cache files, but that is a pain (how many web developers know how to mess with Apache .htaccess files?), and you have to remember to turn that off for production or you totally hose performance. Even now, Drupal appends junk characters to the end of CSS URLs just to bypass this sort of caching.
Finally, there's the browsers. Their handling of HTTP cache headers (which are surprisingly complex) has historically not been all that good. What's more, in many cases the browser will simply bypass its own cache and still check the network for a new version.
Now, normally, that's OK. The HTTP spec says, and most browsers obey, that when requesting a resource that a browser already has an older cached copy of it should include the last updated date of its version of the file in the request, saying in essence "I want file foo.png, my copy is from October 1st." The server can then respond with either a
304 Not Modified ("Yep, that's still the right one") or a
200 OK ("Dude, that's so old, here's the new one"). The 304 response saves resending the file, but doesn't help with the overhead of the HTTP request itself. That request is not cheap, especially on high-latency mobile networks, and especially when browsers refuse to have more than 4-6 requests outstanding.
Now look at WhiteHouse.Gov. 95 HTTP requests for the front page... nearly all of them 304 Not Modified (assuming you've hit the page at least once). ESPN.com, 95 requests, again mostly 304 Not Modified. Forbes.com, over 200.
These are not sites built by fly-by-night hackers. These are high-end professional sites whose teams do know how to do things "right". And the page is not actually "done" until all of those requests go out and complete, just in case something changed. The amount of shear waste involved is utterly mindboggling. It's the same old polling problem on a distributed scale.
The underlying problem, of course, is that a web page is no longer a single resource that makes use of one or two other resources. A web page -- not a web application or anything so fancy but just an ordinary, traditional web page -- is the product of dozens of different resources at different URIs. And our caching strategies simply do not know how to handle that.
A couple of possible workarounds for this issue exist, and are used to a greater or lesser extent.
- Multi-domain image servers
- Many high-end sites that are able to afford it will put their rarely-changing resource files on a separate domain, or multiple separate domains. The idea here is to bypass the browser throttling feature that refuses to send more than a handful of HTTP requests to a given domain at the same time in an effort to not overload it. Even if the domains all point to the same server, that can help parallelize the requests far better. That helps, to be sure, but there's still a potentially huge number of "Is it new yet?" HTTP requests that don't need to happen. Especially on a high-latency mobile network that can be a serious problem.
- The HTML spec supports a mechanism called Data-URIs. (It's actually been in the spec since HTML 4.01, but no one paid attention until the recent surge of interest in HTML5.) In short, a dependent resource, such as an image, is base64-encoded and sent inline as part of the HTML page. It's then decoded by the browser and read as an image. That eliminates the separate HTTP request overhead, but also completely kills caching. The inlined image has to be resent every single time with the HTML page. It can also be a pain to encode on the server side. That makes it useful in practice only for very small files.
- Google, with their usual flair for "open source is great but we'll do it ourselves", has proposed (and implemented in Chrome) an HTTP alternative called SPDY (which stands for "speedy"). Without going into too much detail, the big feature it that a single connection can be used for many resources. That eliminates the overhead of opening and closing dozens of connections, but there's still the (now more efficient) "are we there yet?" queries. SPDY is still not widly used. Unfortunately I don't know much else about it at the moment.
- HTML5 Manifest
- I thought this was the most promising. HTML5 supports a concept called the appcache, which is a local, offline storage area for a web page to stick resources. It is controlled by a Manifest file, referenced from the HTML page, that tells the browser "I am part of a web application that includes these other files. Save us all offline and keep working if you have no connection". That's actually really really cool, and if you're building a web application using it is a no-brainer.
There are a number of issues with the Manifest file, however, something that most people acknowledge. They mostly boil down to it being too aggressive. For instance, you cannot avoid the HTML page itself also being cached. An appcache-using resource will never be redownloaded from the web unless the Manifest file itself changes (and the browser redownloads a new version of it), in which case everything will be downloaded again.
I ran into this problem while trying to write a Manifest module for Drupal. The idea was to build a Manifest file on the fly that contained all of the 99% static resources (theme-level image files, UI widgets, etc.) so that those could be skipped on subsquent page loads, since they practically never change, and avoid all of that HTTP overhead. Unfortunately as soon as you add a Manifest file to an HTML page, that page is permanently cached offline and not rechecked. Given that Drupal is by design a dynamic CMS where page content can change regularly for user messages and such, that's a rather fatal flaw that I have been unable to work around.
A better solution
So what do we do? Remember up at the top of this article we noted that most web "pages" these days (which are still the majority of the web and will remain so for a long time) are dynamically built by a CMS. CMSes these days are pretty darned smart about what it is they are serving up. If a file has changed, they either know or can easily find out by checking the file modification date locally, on the server, without any round-trip connection at all. We can and should leverage that.
Perhaps we should.
I would propose instead that we allow and empower the application level on the server to take a more active and controlling role in cache management. Rather than an all-or-nothing Manifest file, which is in practice only useful for single-page full-on applications, we should allow the page to have more fine-grained control over how the browser treates resource files.
There are many forms such support could take. As a simple starting point, I will offer a reuse of the link tag:
<!-- Indicates that this image will be used by this page somewhere, and its last modified date is 1pm UTC on 6 October. If the browser has a cached version already, it knows whether or not it needs to request a new version without having to send out another HTTP request. -->
<link href="background.png" cache="last-modified:2011-10-06T13:00:00" />
<!-- It works for stylesheets, too. What's more, we can tell the browser to cache that file for a day. The value here would override the normal HTTP expires header of that file, just as a meta http-equiv tag would were it an HTML page. -->
<link href="styles.css" rel="stylesheet" cache="last-modified:2011-10-06T13:00:00; expire:2011-10-07T13:00:00" />
<!-- By specifying related pages, we can tell the browser that the user will probably go there next so go ahead and start loading that page. Paged news stories could be vastly sped up with this approach. This is not the old "web accelerator" approach, as that tried to just blanket-download everything and played havoc with web apps. -->
<link href="page2.html" rel="next" cache="last-modified:2011-10-06T13:00:00; fetch:prefetch" />
<!-- Not only do we tell the browser whether or not it needs to be cached, but we tell the browser that the file will not be used immediately when the page loads. Perhaps its a rollover image, so it needs to be loaded before the user rolls over something but that can happen after all of the immediately-visible images are downloaded. Alternatively this could be a numeric priority for even more fine-graied
<link href="hoverimage.png" cache="last-modified:2011-10-06T13:00:00; fetch:defer" />
<!-- If there's too many resources in use to list individually, link to a central master list. Any file listed here is treated as if it were listed individually, and should include the contents of the cache attribute. Normal caching rules apply for this file, including setting an explicit cache date for it. Naturally multiple of these files could be referenced in a single page, whereas there can be only a single Manifest file. The syntax of this file I leave for a later discussion. -->
<link href="resources.list" rel="resources" />
In practice, a CMS knows what those values should be. It can simply tell the browser, on demand, what other resources it is going to need, when they were last updated, the smartest order in which to download them, even what to prefetch based on where the user is likely to go next.
Imagine if, for instance, a Drupal site could dynamically build a resource file listing all image files used in a theme, or provided by a module. Those are usually a large number of very small images. So just build that list once and store it, then include that reference in the page header. The browser can see that, know the full list of what it will need, when they were last updated, even how soon it will need them. If one is not used on a particular page, that's OK. The browser will still load it just like with a Manifest file. On subsequent page loads, it knows it will still need those files but it also knows that its versions are already up to date, and leaves it at that. When it needs those images, it just loads them out of its local cache.
And when a resource does change, the page tells the browser about it immediately so that it doesn't have to guess if there is a new version. It already knows, and can act accordingly to download just the new files it needs.
Any CMS could do the exact same thing. A really good one could even dynamically track a user session (anonymously) to see what the most likely next pages are for a given user, and adjust its list of probable next pages over time so that the browser knows what's coming.
Naturally all of this assumes that a page is coming from a CMS or web app framework of some sort (Drupal, Symfony2, Sharepoint, Joomla, whatever). In practice, that's a pretty good assumption these days. And if not, a statically coded page just omits the cache attribute and the browser behaves normally as it does today, asking "are we there yet?" over and over again and getting told by the server "304 No Not Yet".
There are likely many details I am missing here, but I believe the concept is sound. Modern web pages are dynamic on the server side, not just on the client side. Let the server give the browser the information it needs to be smart about caching. Don't go all-or-nothing; that is fine for a pure app but most sites are not pure apps. Server-side developers are smart cookies. Let them help the browser be faster, smarter.
I now don the obligatory flame-retardant suit. (And if you think this is actually a good idea, someone point me to where to propose it besides my blog!)