The future of caching

Submitted by Larry on 7 October 2011 - 1:36am

This is not your father's Internet. When the Web was first emerging onto the scene, it was simple. Individual web pages were self-contained static blobs of text, with, if you were lucky maybe an image or two. The HTTP protocol was designed to be "dumb". It knew nothing of the relationship between an HTML page and the images it contained. There was no need to. Every request for a URI (web page, image, download, etc.) was a completely separate request. That kept everything simple, and made it very fault tolerant. A server never sat around waiting for a browser to tell it "OK, I'm done!"

Much e-ink has been spilled (can you even do that?) already discussing the myriad of ways in which the web is different today, mostly in the context of either HTML5 or web applications (or both). Most of it is completely true, although there's plenty of hyperbole to go around. One area that has not gotten much attention at all, though, is HTTP.

Well, that's not entirely true. HTTP is actually a fairly large spec, with a lot of exciting moving parts that few people think about because browsers offer no way to use them from HTML or just implement them very very badly. (Did you know that there is a PATCH command defined in HTTP? Really.) A good web services implementation (like we're trying to bake into Drupal 8 as part of the Web Services and Context Core Initiative </shamelessplug>) should leverage those lesser-known parts, certainly, but the modern web has more challenges than just using all of a decades-old spec.

Most significantly, HTTP still treats all URIs as separate, only coincidentally-related resources.

Which brings us to an extremely important challenge of the modern web that is deceptively simple: Caching.

Caching is broken

The web naturally does a lot of caching. When you request a page from a server, rarely is it pulled directly off of the hard drive at the other end. The file, assuming it is actually a file (this is important), may get cached by the operating system's file system cache, by a reverse proxy cache such as Varnish, by a Content Delivery Network, by an intermediary server somewhere in the middle, and finally by your browser. On a subsequent request, the layer closest to you with an unexpired cache will respond with its cached version.

In concept that's great, as it means the least amount of work is done to get what you want. In practice, it doesn't work so well for a variety of reasons.

For one, that model was built on the assumption of a mostly-static web. All URIs are just physical files sitting on disk that change every once in a while. Of course, we don't live in that web anymore. Most web "pages" are dynamically generated from a content management system of some sort.

For another, that totally sucks during development. Who remembers the days of telling your client "no, really, I did upload a new version of the file. You need to clear your browser cache. Hold down shift and reload. Er, wait, that's the other browser. Hit F5 twice. No, really fast. Faster." Yeah, it sucked. There's ways to configure the HTTP headers to not cache files, but that is a pain (how many web developers know how to mess with Apache .htaccess files?), and you have to remember to turn that off for production or you totally hose performance. Even now, Drupal appends junk characters to the end of CSS URLs just to bypass this sort of caching.

Finally, there's the browsers. Their handling of HTTP cache headers (which are surprisingly complex) has historically not been all that good. What's more, in many cases the browser will simply bypass its own cache and still check the network for a new version.

Now, normally, that's OK. The HTTP spec says, and most browsers obey, that when requesting a resource that a browser already has an older cached copy of it should include the last updated date of its version of the file in the request, saying in essence "I want file foo.png, my copy is from October 1st." The server can then respond with either a 304 Not Modified ("Yep, that's still the right one") or a 200 OK ("Dude, that's so old, here's the new one"). The 304 response saves resending the file, but doesn't help with the overhead of the HTTP request itself. That request is not cheap, especially on high-latency mobile networks, and especially when browsers refuse to have more than 4-6 requests outstanding.

As a semi-random example, have a look at Drupal.org in Firebug. By my count there are 17 different HTTP requests involved in that page, for the page itself, CSS files, image files, Javascript files, and so forth. 11 of those return a 304 Not Modified, but still have to get sent, and still block further requests while they're active.

Now look at WhiteHouse.Gov. 95 HTTP requests for the front page... nearly all of them 304 Not Modified (assuming you've hit the page at least once). ESPN.com, 95 requests, again mostly 304 Not Modified. Forbes.com, over 200.

These are not sites built by fly-by-night hackers. These are high-end professional sites whose teams do know how to do things "right". And the page is not actually "done" until all of those requests go out and complete, just in case something changed. The amount of shear waste involved is utterly mindboggling. It's the same old polling problem on a distributed scale.

The underlying problem, of course, is that a web page is no longer a single resource that makes use of one or two other resources. A web page -- not a web application or anything so fancy but just an ordinary, traditional web page -- is the product of dozens of different resources at different URIs. And our caching strategies simply do not know how to handle that.

Half-hearted solutions

A couple of possible workarounds for this issue exist, and are used to a greater or lesser extent.

Multi-domain image servers
Many high-end sites that are able to afford it will put their rarely-changing resource files on a separate domain, or multiple separate domains. The idea here is to bypass the browser throttling feature that refuses to send more than a handful of HTTP requests to a given domain at the same time in an effort to not overload it. Even if the domains all point to the same server, that can help parallelize the requests far better. That helps, to be sure, but there's still a potentially huge number of "Is it new yet?" HTTP requests that don't need to happen. Especially on a high-latency mobile network that can be a serious problem.
Data-URIs
The HTML spec supports a mechanism called Data-URIs. (It's actually been in the spec since HTML 4.01, but no one paid attention until the recent surge of interest in HTML5.) In short, a dependent resource, such as an image, is base64-encoded and sent inline as part of the HTML page. It's then decoded by the browser and read as an image. That eliminates the separate HTTP request overhead, but also completely kills caching. The inlined image has to be resent every single time with the HTML page. It can also be a pain to encode on the server side. That makes it useful in practice only for very small files.
SPDY
Google, with their usual flair for "open source is great but we'll do it ourselves", has proposed (and implemented in Chrome) an HTTP alternative called SPDY (which stands for "speedy"). Without going into too much detail, the big feature it that a single connection can be used for many resources. That eliminates the overhead of opening and closing dozens of connections, but there's still the (now more efficient) "are we there yet?" queries. SPDY is still not widly used. Unfortunately I don't know much else about it at the moment.
HTML5 Manifest
I thought this was the most promising. HTML5 supports a concept called the appcache, which is a local, offline storage area for a web page to stick resources. It is controlled by a Manifest file, referenced from the HTML page, that tells the browser "I am part of a web application that includes these other files. Save us all offline and keep working if you have no connection". That's actually really really cool, and if you're building a web application using it is a no-brainer.

There are a number of issues with the Manifest file, however, something that most people acknowledge. They mostly boil down to it being too aggressive. For instance, you cannot avoid the HTML page itself also being cached. An appcache-using resource will never be redownloaded from the web unless the Manifest file itself changes (and the browser redownloads a new version of it), in which case everything will be downloaded again.

I ran into this problem while trying to write a Manifest module for Drupal. The idea was to build a Manifest file on the fly that contained all of the 99% static resources (theme-level image files, UI widgets, etc.) so that those could be skipped on subsquent page loads, since they practically never change, and avoid all of that HTTP overhead. Unfortunately as soon as you add a Manifest file to an HTML page, that page is permanently cached offline and not rechecked. Given that Drupal is by design a dynamic CMS where page content can change regularly for user messages and such, that's a rather fatal flaw that I have been unable to work around.

A better solution

So what do we do? Remember up at the top of this article we noted that most web "pages" these days (which are still the majority of the web and will remain so for a long time) are dynamically built by a CMS. CMSes these days are pretty darned smart about what it is they are serving up. If a file has changed, they either know or can easily find out by checking the file modification date locally, on the server, without any round-trip connection at all. We can and should leverage that.

If a web page is an amalgam of many resource URIs, we should trust the page to know more about its resources than the browser does. That doesn't mean "cache everything as one". It means that if we assume part of the page will be dynamic, the HTML itself, then we can trust it to tell the browser about its dependent resources. We already do, in fact. We trust it to specify images (via img tags or CSS references), CSS (via link and style tags), Javascript (via script tags), and so on. But we don't trust the page to tell us anything about those files beyond their address.

Perhaps we should.

I would propose instead that we allow and empower the application level on the server to take a more active and controlling role in cache management. Rather than an all-or-nothing Manifest file, which is in practice only useful for single-page full-on applications, we should allow the page to have more fine-grained control over how the browser treates resource files.

There are many forms such support could take. As a simple starting point, I will offer a reuse of the link tag:

<!-- Indicates that this image will be used by this page somewhere, and its last modified date is 1pm UTC on 6 October. If the browser has a cached version already, it knows whether or not it needs to request a new version without having to send out another HTTP request. -->
<link href="background.png" cache="last-modified:2011-10-06T13:00:00" />

<!-- It works for stylesheets, too. What's more, we can tell the browser to cache that file for a day.  The value here would override the normal HTTP expires header of that file, just as a meta http-equiv tag would were it an HTML page. -->
<link href="styles.css" rel="stylesheet" cache="last-modified:2011-10-06T13:00:00; expire:2011-10-07T13:00:00" />

<!-- By specifying related pages, we can tell the browser that the user will probably go there next so go ahead and start loading that page.  Paged news stories could be vastly sped up with this approach. This is not the old "web accelerator" approach, as that tried to just blanket-download everything and played havoc with web apps. -->
<link href="page2.html" rel="next" cache="last-modified:2011-10-06T13:00:00; fetch:prefetch" />

<!-- Not only do we tell the browser whether or not it needs to be cached, but we tell the browser that the file will not be used immediately when the page loads. Perhaps its a rollover image, so it needs to be loaded before the user rolls over something but that can happen after all of the immediately-visible images are downloaded.  Alternatively this could be a numeric priority for even more fine-graied
     control -->
<link href="hoverimage.png" cache="last-modified:2011-10-06T13:00:00; fetch:defer" />

<!-- If there's too many resources in use to list individually, link to a central master list. Any file listed here is treated as if it were listed individually, and should include the contents of the cache attribute.  Normal caching rules apply for this file, including setting an explicit cache date for it. Naturally multiple of these files could be referenced in a single page, whereas there can be only a single Manifest file. The syntax of this file I leave for a later discussion.  -->
<link href="resources.list" rel="resources" />

In practice, a CMS knows what those values should be. It can simply tell the browser, on demand, what other resources it is going to need, when they were last updated, the smartest order in which to download them, even what to prefetch based on where the user is likely to go next.

Imagine if, for instance, a Drupal site could dynamically build a resource file listing all image files used in a theme, or provided by a module. Those are usually a large number of very small images. So just build that list once and store it, then include that reference in the page header. The browser can see that, know the full list of what it will need, when they were last updated, even how soon it will need them. If one is not used on a particular page, that's OK. The browser will still load it just like with a Manifest file. On subsequent page loads, it knows it will still need those files but it also knows that its versions are already up to date, and leaves it at that. When it needs those images, it just loads them out of its local cache.

And when a resource does change, the page tells the browser about it immediately so that it doesn't have to guess if there is a new version. It already knows, and can act accordingly to download just the new files it needs.

Any CMS could do the exact same thing. A really good one could even dynamically track a user session (anonymously) to see what the most likely next pages are for a given user, and adjust its list of probable next pages over time so that the browser knows what's coming.

Naturally all of this assumes that a page is coming from a CMS or web app framework of some sort (Drupal, Symfony2, Sharepoint, Joomla, whatever). In practice, that's a pretty good assumption these days. And if not, a statically coded page just omits the cache attribute and the browser behaves normally as it does today, asking "are we there yet?" over and over again and getting told by the server "304 No Not Yet".

Feedback

There are likely many details I am missing here, but I believe the concept is sound. Modern web pages are dynamic on the server side, not just on the client side. Let the server give the browser the information it needs to be smart about caching. Don't go all-or-nothing; that is fine for a pure app but most sites are not pure apps. Server-side developers are smart cookies. Let them help the browser be faster, smarter.

I now don the obligatory flame-retardant suit. (And if you think this is actually a good idea, someone point me to where to propose it besides my blog!)

Owen Barton (not verified)

7 October 2011 - 2:26am

I think there is some misunderstanding of how caching headers work in this post. The browser will only HEAD a resource (resulting in a 304 or 200) if the cached resource has past it's expires header timestamp. By setting future expires headers (which is in pretty much every performance best practice guide, and can be safety and reliably done with almost all css, js and uploaded files in Drupal) the browser will make no HTTP request at all for these resources until they expire.

For drupal.org (for example) visiting my dashboard for a second time, I only get 4 HTTP requests - one for the dashboard (since I am logged in), 2 for Google Analytics and one for tipsy.gif (haven't looked into why this is not cached - it is fetched via javascript, which could be the problem). If I set up anonymous caching plus expiry in a vanilla Drupal 7 site, the second anonymous visit to a page page loads with a single HTTP HEAD request for the whole page, resulting in a 304.

I suspect the reason your tests produced different results is that you were hitting refresh/F5 on the page - when you do this, most browsers will do a HEAD for all resources, even if they are not expired - this is a common source of confusion when testing caching. If you do a "full refresh" (ctrl-F5 or whatever) it will re-GET them all instead. To see the normal caching behavior you need to click on a link to the same page, or click to a different page and back again.

The kind of things you mention are useful approaches too - especially for dynamic/large/complex sites, mobile browsers and other challenging environments where you need more fine grained control - and I don't mean to write those off at all (in fact there is plenty of useful conversations to be had...but I don't have time for a proper response now), but I think there is plenty of value in regular caching too :)

We particulary prepared such a mechanism you described in our COBAEX CMS (http://www.cobasolutions.com/business_software/en_cms). We call it a "functional cache". The main difference between the "normal" cache and ours is the fact, that all the actual pages done in this technology are actually "compiled" by CMS into HTML elements. What I mean is that once you modify the page on CMS - the server prepares HTML file for that page, and once the client needs it - the server simply servers pre-prepared HTML file. This basically means the CMS is not actually building the page every request (and then checking whether the page has been modified, so 200 must be sent, or the page has not been modified and 304 must be sent), but the page rebuild is triggered by a CMS page edit functionality (so we call it "functional caching").

We are using 2 servers: administration server, which is responsible for all the administration / management tasks - we have a full database, compiling mechanisms etc. and the whole page edition is done actually there; and a presentation server, where you have only HTML files, sometimes (depending on the implementation) a very simple database (in most cases MySQL, hence it's quite fast for simple queries) for searching purposes. All the people that accesses website uses the presentation server, that actually serves the HTML files, in some more sophisticated implementations it needs to "build" the page out of sets of preoared HTML elements (e.g. if you have a portal that needs to implement searches - the results list elemenbst are different HTML files, that are put into one list basing on the results from that presentation database - which in fact is also some kind of "pre-compiled" database - as it does not use any foreign keys or so - just a simple tables that has fields needed for searching + names of HTML files that should be included in search results).

This approach gives us several additional advantages - such as increased security (imagine the situation of breaking into presentation server and desroying pages standing there - to restore you just need to publish everything from a administration server again - and you have the most actual site again alive'n'kicking - in some implementation we even have a special mechanism, that in randomly chosen intervals is checking the presentation server whether the pages are OK, and if not - administration server fully automatically publishes everything, so the site will be fixed automatically as soon as the problem has been determined) or the possibility to publish selected sets of pages (not making changes directly on a page - imagine a situation that your company releases a new product and needs to prepare some pages about it - you do not work on a production copy, but prepare everything on an administration server and after everything is ready - just publish everything with a single click).

This technology can be very nicely used also for some automated webpage creation - e.g. in SEO actions people are creating many backoffice pages with some texts inside there. Once you have a tool that enables you just to put the texts and publish many pages on a same (or several) template - you can create such pages very fast. Not mentioning that after some of your pages / domains used for SEO will get a filter - you can move the content from that page to another (using even different template) with a single click. We prepared such a tool for one project - and really works great :)

This approach can be used both for simple webpages (see www.w4e.pl - a very simple page, but still using this technology), but such pages in fact does not show any large difference. In larger pages (where you have many subpages - see www.chodkowska.edu.pl - over 500 subpages in different subdomains, administered by different people etc.) the speed comparing to standard products is already seen. But the real advantage you can see on large portal - the newest implementation www.domoklik.pl we are really proud of. This portal is the fastest real-estate portal in Poland, while has the largest number of offers. And the loading speed is really nice - especially that we still have some room to improve that (no caching servers yet implemented - only the described technology).

I hope the above is understandable :) If not - forgive me, English is not my first language :) But generally I think such approach is the future. Indeed, the standard caching mechanisms are nice, but not nice enough anymore :)

Caching compiled HTML pages is something any modern CMS should be doing. Drupal does so, although by default it doesn't use a separate database and the cache is usually pushed off onto memcache. However, avoiding recompiling and regenerating the HTML page on every request is not what I'm talking about here.

Whitehouse.gov's front page, I assure you, is not being built from scratch every time. But there are still dozens and dozens of HTTP requests on every page load just to verify that caches are still valid. That's the issue I'm looking at in this post.

Steffen (not verified)

7 October 2011 - 3:51am

I like the cache-atttribute! Currently, our CMS appends some kind of "last-changed-id" to a lot of resource URLs (stylesheets, javascripts, images, videos, ...) to prevent the browser from using a stale cached version. This has always felt like a bad workaround, for example because the browser will then unnecessarily cache all versions of a resource that it has ever seen (until it runs out of cache space).

First of all, I'm sorry about my English... not bilingual yet.

You should maybe consider that in big high availability systems we often work with distributed filesystem backends where obtaining modification time of a file is not a "cheap" (in time terms) operation.

You can take by example http://drupal.org/project/imageinfo_cache where a local DB cache is used to avoid such checkouts.

Thanks for your work! Best regards

True, an fstat() is not free, especially in a highly-virtualized environment. However, I would argue that it is probably still cheaper than letting the browser make that check, which would do an fstat() anyway on subsequent requests (one per file). Plus you can, as you say, do some sort of application-level caching of that information as well if appropriate.

I'm sorry I have "my own kind of blindness"...

In most systems we operate there is some kind of reverse proxy involved like Varnish (in memory storage) so we rarely perform any fstat() for static files.

Nevertheless I think your general idea is valuable, I just wanted to point a "black spot". Nowadays systems are quite complicated and it's easy to lose sight of such a thing, there's where "hive mind" excels ;)

Have you considered share the discussion with High performance group on GDO?

Best regards

Varnish can be a huge help on performance, yes. Even if you're not caching HTML pages in it (authenticated users), it can help with static resources. But the browser still has to send an HTTP request that gets to the Varnish server, however.

That said, it's a valid point that varnish could cause issues. Imagine, HTML page says "this image file has been updated, so you need to go get it". Browser dutifully sends a request, but hits Varnish and gets the Varnish cached version of the file. But the version on the file system is newer than what's in Varnish. Hm, not sure what to do about that. I'm open to suggestions.

I haven't posted it over there yet, as I wanted to suss it out where more than just Drupalers would see it. (This is not a Drupal-specific problem. If you want to post a link over there, feel free. :-)

I stated that your idea is valuable ;) and of course I know Varnish suffer described HTTP overhead in current scenarios.

It's true that Varnish can cause some issues (you normally deal with this with purge/ban mechanisms), and cache expiration policies is a growing pain in our customer's systems (when to expire what with this views, blocks, comments, etc. constellation)... but your CMS can talk with your own proxy ;) and there is always the option to rewrite resource names using a filter (Googles's mod_pagespeed use this approach) if you can assume rising (back-end) server loads. In my a opinion, in an ideal world, different static resources should have different names and we would avoid all this mess.

Ok, it's up to you... I will not post a link there... lets drupalist find this their own way ;) (IRC mention in my case)

Best regards

As stated, the HTTP protocol has a widely know cache specification, so we should try to leverage the power of our applications by embracing it, not canching it.

The cache attribute, in a link, is a mere re-thought of the application caching layers, which are a bad thing in big projects, 'cause you need to mantain your own application layer and couple your application with it.

Embracing HTTP means re-using existing software (browsers, proxies, reverse proxies) to be web-scale.

Take a look at this presentation, on why we should avoid application caching layers: http://www.slideshare.net/odino/be-lazy-be-esi-http-caching-and-symfony…

The solution you propose, here, is, BTW, a concept similar to ESI (Edge Side Includes), except from the fact that ESI doesn not apply to static assets, but to webpage fragments: take a lot at this specification (http://www.slideshare.net/fabpot/caching-on-the-edge), I'm pretty sure you will be surprised and happy reading about it.

Apart from these points, nice post.

I just looked through both slideshows, and there's some really good information in them. Yes, Drupal could do a much better job leveraging HTTP than it does now.

However, that's all about a single request. My point here is that modern web pages are not a single request; they're dozens of requests, and there is currently no way for the HTML page to provide information about the caching status of an image it happens to use. That means all cache invalidation is based on polling, which we all know is slow. It also means you cannot control the caching logic for resource files unless you either 1) Route them all through PHP (which would be stupid) or trick out your Apache config (which most web devs don't know how to do, nor should they).

What I am proposing is that we allow the most dynamic request, the HTML page, to provide more useful information about the resources it uses. You can still apply whatever HTTP caching logic you want to the HTML file, but provide more information along with it for the browser to smartly avoid even bothering to send a request to the caching server.

Consider a page at: /about.html, which uses 8 theme images, img1.png through img8.png.

Currently, the browser knows nothing about those images until it loads about.html, and then requests those images in totally separate HTTP requests. Those requests may cache using normal HTTP semantics.

Now, hit /about.html again. The browser doesn't know whether or not it needs to check img1.png, img2.png, etc. All it knows is when its cached version is from, and makes an educated guess as to whether it needs to re-contact the server. All data about img1.png is taken from the img1.png file's HTTP header.

What I'm proposing is that about.html should be able to tell the browser "by the way, you definitely do (or do not) want to get a new version of img1.png". Or say "I will use this file, but not immediately so you can load it last." Etc. That information cannot be derived from the img1.png header itself without re-requesting it, which is what we want to avoid.

Does that make more sense?

I think this is based on an inaccurate understanding of how expires headers work (based on doing a refresh rather than a normal page visit), as I described in my comment above. With caching headers that follow best practices the browser can indeed determine that it doesn't even need to check for a new version of a resource.

I do get that there is a bigger point you are making here, and I think finding ways to put resource caching and prefetch rules more directly into the hands of the CMS is an excellent goal.

I am not 100% sure the tag based rules are the way to go (although I could be persuaded) - this seems it would be hard to model in browsers, since potentially you could have multiple caching rules (potentially from multiple domains) applying to the same resource URI. I would think a extension of the manifest style of approach could be preferable - whilst the current "page-centricness" of Drupal (lack of information on page resources for other pages or the site as a whole) could make this harder to implement, I think this is really an issue with Drupal, and probably not something a standard needs to adapt to specifically.

Expire times set in HTTP headers are only seen when you request that particular resource.

Expire times set in a link are seen when you request the related resource.

Consider this example:

Your Drupal site has set the minimum cache time to 5 minutes, because the content is highly dynamic and people don't want to wait a long time to see their updated profile picture appear.

So you click on the front page and all of those front page image links tell your browser, "You can hold on to your cached copy of this file for another five minutes" except the one which was updated, which says "I just got updated; better invalidate your cache and request another copy."

I think this idea really is spot-on.

- Not sure I understand why you went with the approach of separate LINK tags; these 'cache' attributes could just simply be universal HTML attributes on any element; i.e., lining up with the existing 'id', 'class', etc attributes. Of course, usage of LINK tags might make sense for resources not being referenced in the actual HTML, but e.g., in the CSS instead (as you already mentioned).

- Speaking of, it would make much more sense to have a "cache-*" attribute namespace, comparable to the universal "data" attribute namespace in HTML5. Hence, instead of cramming all kind of values into a single string with wonky delimiters, we could have:

<!-- A regular image -->
<img src="/misc/duplicon.png" alt="Druplicon" cache-last-modified="2011-10-06T13:00:00" cache-expires="2012-10-06T13:00:00" />

<!-- A regular stylesheet -->
<link rel="stylesheet" href="style.css" cache-last-modified="2011-10-06T13:00:00" cache-expires="2011-10-07T13:00:00" />

<!-- An image referenced in a CSS :hover rule -->
<link href="hoverimage.png" cache-last-modified="2011-10-06T13:00:00" cache-fetch="defer" />

That said, the last sample is debatable, as it kinda crosses the line between clear separation of markup and presentation.

I considered putting the cache flags inline as you mention, but decided it was best to centralize it rather than having it scattered throughout the page. Plus, we would need the link approach anyway for CSS-based images.

Multiple cache-* properties would probably work, too. I'm easy there. :-)

It's true that this may be undesirably mixing markup and presentation lines. I'm not sure there. There may be some other mechanism that is cleaner to provide the sort of "push invalidation", which is the actual goal.

My concerns on that LINK element enforcement vs. universal inline element attributes:

- HTML page weight: When separately referencing all external resources on a page via LINK elements, then you're adding a lot of duplication/overhead to the page. Those additional elements also need to be parsed and evaluated by the browser.

- Maintenance: The system would have to make sure and maintain which external resources are exactly on the page and produce correct LINK elements. Conditionally add and remove one or more during page processing, and you quickly need a badass state tracker for external page resources. ;)

- Continuous processing/AJAX: Systems like Drupal send a big BLOB to the browser, but that's not always the case. In fact, if Drupal wouldn't be modular, then it wouldn't have to wait for the entire page being built and processed in order to send it to the browser. Other, less or non-dynamic systems are able to start sending their output as soon as they generate it (which is a commonly known performance optimization tactic). Also, in the case of AJAX, only page fragments are sent to the browser.

- User Interface Interaction: Entire sections of a page might be preloaded in a hidden/disabled state, but are contained on the page. Unless activated through user interaction, those resources may not have to be loaded at all. In particular the cache-fetch="defer" part could be taken to the next level, so as to allow for delayed fetching of resources in general; e.g., considering image slideshows. But then again, perhaps not.

Matt Farina (not verified)

7 October 2011 - 2:57pm

Sites like ESPN and the White House do pull down a lot of files initially. But, on later page loads there are a lot less because most of the page assets are pulled from the browser cache. For example, take espn.com minus the ads or Ooyala (the media player they use) and most of the page is pulled from cache without ever making a call to the server to see if a newer version is available. They are leveraging current caching techniques.

Digging into all the caching techniques and dealing with a primed cache is a different case all together.

But, there are a number of problems with what you propose. I'm all for something better but this has some hurdles.

  1. While having this in the HTML has some benefits there are certain drawbacks. The caching relationship is no longer between the resource/resource server and the receiver (browser). You now have an intermediate entity. This intermediate is a case where an uneducated dev can really screw things up. Or the intermediate becomes out of sync (I could come up with cases for this).
  2. We live in an age where there's a focus on minifying what we send to browser. This includes html compression (like http://code.google.com/p/htmlcompressor/). There is quite a lot that can be removed from a page if we try and it can have an impact. This adds more to the page.
  3. A cms or application serving a page may be able to find the last modified information when you're dealing with something on a small scale. Large scale projects are in a different place. Assets may be in a different location from the application all together making access to this information unavailable.

There may be something here with merit but I think more digging into caching, current caching techniques, and how this works as you scale up is needed.

One more thing to consider, in this age of cloud computing getting file stat info for the last modified may not be so fast. What if you have your files directory (to be very drupal specific) in a cloud object store (like Rackspace cloud files)? Making a call to get the last modified value could get a lot more expensive.

While I wholeheartedly agree that Drupal (and CMSes/frameworks in general!) needs to leverage browser caching to improve its front-end performance, I think your proposal is ill-fated. And the information you provide is either utterly wrong or lacking.

Caching is broken:
- sure, there are problematic cases, but overall it works well
- the largest problem is that regular file caching (using Expires & Cache-Control headers) is limited by a browser's disk cache. These disk caches are prone to weird browser-dependent behavior in the sense that you can't rely on them to actually cache the files as long as you ask them to. We need to work with browser vendors to actually improve their disk caches and make them work in a more reliable manner. On mobile, the disk caches' size needs to increase (they're all ±4 MB, in total).
- but it's definitely not the case that if you load e.g. whitehouse.gov multiple times, that If-Not-Modified requests are sent out for all resources on every page load
- you can leverage localStorage and appCache if you want more control (neither of these are cleared out automatically by the browser, and it's actually very hard for the user to clear them manually)

Multi-domain image servers: first of all, this doesn't apply to images only, but also to CSS, JS, fonts. Everything.
Secondly, your insinuation that only high-end sites are able to afford this is completely wrong. I'm using the Amazon CloudFront CDN on http://driverpacks.net, which is a Drupal 6 site with >600K page views per month. I'll gladly publicize my annual CDN costs. Between $35 and $60 per year, or $3—$5 per month. For well over a million requests per month to their PoPs in the U.S., Europe, Tokyo and Singapore.
Plus — remember those caching headers? Well, they'll ensure that a large portion of your visitors will actually not request the data in the first place. The remainder sends 304s, which results in virtually no traffic, resulting in virtually no costs.

Data URIs: these are mostly beneficial for inlining small resources that are very small, such as list icons and a few small icons — at the cost of ±30% more bytes to send. However, you have less RTTs and no HTTP header overhead.
Your statement that this "completely kills caching" is blatantly wrong, because you can simply include data URIs in CSS files. When the CSS file is cached, the data URI is cached.

SPDY: while I don't like how much grip Google is gaining on the web, fact remains that something needs to be done to improve the web. HTTP is simple. HTTP is stateless. In part thanks to the simplicity of HTTP, the web has thrived. But it's old and inefficient for the current state of the web. We have different demands: it need not just work, it needs to work fast. Before SPDY can become widespread, it needs to be shipped with every Apache installation. So, not much to say here.

HTML5 manifest files: painful indeed! However, one use case for which I think the appcache (in its current state) might be useful: font caching.

A better solution:
- first of all: you forgot about timezones. Big omission.
- breaks caching proxies and reverse proxies such as Varnish — or you'd have to send headers and set these attributes
- I agree with the remarks by Owen Barton & Matt Farina
- you can achieve all of this today with far-future Expires/Cache-Control headers and unique filenames. The problem then becomes determining when a file has changed. In big-scale set-ups you can simply revision all theme- and module-related resources, but that doesn't apply to the average Drupal site. What still does apply, however, is for example the Drupal version number. Drupal.js won't change unless Drupal itself is updated. For other files — to retain maximum flexibility for the user — we want to store some unique identifier (last modification time, hash …) in the database to prevent hits on every page load. Or simply enable page caching (core's statistics module is the only thing preventing this) and only get the last modification times once every X minutes. That works just fine for smaller sites — even under load.
For the CDN module, I've put together a patch that does all of this (minus the caching of uniqueness indicators in the DB — which mikeytown's advagg module already does!): http://drupal.org/node/974350#comment-4264828. In the future, we could make it so that every dynamic file change in Drupal (file upload or image style generation) stores the last modification time in the DB, so that we can take advantage of this in a centralized manner.

SPDY:
- Firefox has SPDY support in mozilla-central (nightly builds, off by default) and they are working on making it stable
- nginx is thinking of adding support for it, they have some test-code I believe.

If both enable it, it would mean that more than 50% of all webbrowsers support it in a couple of months and there is a fast/efficient server implementation which can be added as a proxy-server.

If that happens it will start to be deployed and this could happen in a time frame of a couple of months.

Lots of feedback here, I see. :-) I'll try to respond to a couple of things at once:

1) A couple of people have noted that my explanation (and therefore understanding) of HTTP caching and existing cache control is incomplete. That may well be the case. I've done a bit more testing with the pages above in different browsers and reloading the page in different ways, and to call the results I'm seeing "inconsistent" would be an understatement. It does not conform to what I would expect given what HTTP headers are being sent, to the best of my knowledge. I don't know if the browsers are acting weird, if my setup is weird, or if my testing/inspection techniques are invalid, but whatever HTTP is supposed to be doing with regards to caching and invalidation it doesn't seem to be working consistently.

That's probably something we need to work on improving in our software, as Wim notes.

2) localStorage could be used to implement this sort of forced-update logic entirely in browserspace/userspace. However, doing so would require essentially reimplementing a browser cache in Javascript. You would need to include in a non-loading part of the page a list of resource files, then implement Javascript to fetch those files, store them in localStorage, and then pull them back out and inject them into the page. That's an awful lot of work for something that IMO belongs at a lower level. I'm also not sure how that would help for CSS-based images.

3) Wim is correctly that CSS-based data-uris would not break caching. I'd not thought of that.

4) I didn't leave timezones out of my code samples, actually. I specified that they were all in UTC. A real implementation (this was not intended as such) would do much better date/time handling and probably use a different format than I did. (Date/time formats across different wire formats are pathetically, almost criminally inconsistent. That's a separate matter, however.)

5) It is certainly possible that if SDPY gains traction it will resolve a lot of these issues, or at least make them less relevant. Here's hoping.

6) A 304 resulting in "virtually no traffic" is not true. It may be effectively true on a broadband connection, but most WAN networks (3G, etc.) have a much higher latency than a wireline connection. Plus, there's a separate TCP connection for each one as well, with all of its overhead. Sending a 20 byte HEAD and 20 byte 304 back simply takes longer on a mobile network. That may change with future technology, but right now it's a serious issue.

7)The Manifest file would not be useful for font caching. Right now, the death knell of the Manifest file for more robust caching is that it auto-includes the HTML file that references it. Perhaps the only resource required to build a page that is not going to be static for days or weeks on end is the one that you cannot tell it to not cache-forever.

8) As Matt and others noted, an fstat() on the server may not be all that cheap depending on your server environment so you don't necessarily save much that way. That's certainly true. However, a PHP script doing fstat() on a file is not going to be appreciably slower than Apache doing fstst() on the same file to decide if it should send back a 304 or a 200, and the PHP script has the potential to do its own internal caching, tracking, or whatever else application developers come up with to reduce that time even further.

9) Far-future expires and ever-changing file names can work, but that's frankly an ugly hack. It's also something that in theory Drupal is doing already; The default htaccess file that ships with Drupal sets: ExpiresDefault A1209600 (2 weeks), and we do tack garbage onto the end of a compressed CSS or JS file to give it uniqueness. But if that actually worked properly, why am I still seeing dozens of 304 requests in my browser, even trying to reload the page "correctly"? See point 1 above.

I think my underlying point may have gotten lost in my verbosity, however. It wouldn't be the first time. :-) So let me try to state it more briefly:

A web "page" is, in practice, not one resource but dozens of resources linked together. HTTP has no concept of that relationship between resources. That makes browsers do very wasteful things. We want some way to tie those together so the browser can be smarter about when it does stuff with the network.

Putting essentially pre-compued 304 responses into the HTML page may not be the right solution, certainly. However, I do believe we need some improved way of proving more intelligent contextual information to a browser. Perhaps if SDPY catches on it will solve this issue for us, since it uses only a single TCP connection for all resources. I don't know SPDY well enough to say. I do believe, however, that we do need a contextual way to improve resource caching.

I'm not enough of an expert to judge whether you have correctly diagnosed the problem, but I like your reasoning. To help the browser manage caching, it makes more sense to tell the browser about a change after it happens, rather than try to predict when in the future it's going to change by setting an expire time (sounds obvious, doesn't it?). Here's a brain dump:

  • The attributes don't need to be timestamps: they could be anything that the browser can compare with its version in cache to know if something has changed, like, say, an md5 hash.
  • I don't see anything that inherently requires a CMS. The web server could look for dependencies between objects and take care of adding in the metadata, even if the page is static.
  • Ideally, this information might properly belong in HTTP instead of the page markup. Throw a list of URLs and their stamps into the HTTP header for dependencies of the current URL. I don't know much about HTTP -- there might already be a provision in the spec somewhere for this sort of thing that nobody's using. Is there a place in the header where you can stick application-specific data without breaking existing clients?
  • If the browser provided a javascript API to the browser cache, you could do a whole bunch of things without having to change well-known standards. Maybe try to get google interested?
  • I'm thinking about the design of HTTP; I think it was designed based on a model where caching can take place anywhere between the server and the client (such as in a proxy), with the server, client, and any intervening caches all being pretty dumb and not having to know about each other. Hence the dependency on future expiration times for things, so a caching proxy can know when to discard things without having to ask. A new caching protocol might need to consider how it affects the overall model. I'm not sure -- just thinking out loud.
  • Doing fstat()s in distributed storage environments shouldn't be a design concern, because this system would, out of necessity, be optional. The administrator could just disable it if it doesn't make sense in a particular environment.

I hope something in that brain dump was useful.
I'm glad you're giving thought to this kind of stuff.

If you use some of the caching techniques Wim talked about you can avoid a fstat() call all together (for php or apache). When I say a fstat() call is slow I'm comparing it to nothing because a caching method that works now is already being used.

It occurred to me that static files don't actually change all that much. Which is probably why the current, brain-dead system already works. I think this is a problem that doesn't need to be solved.

To be fair to the cache-manifest: the "cms of today" should be able to handle the assets intelligently for the cache-manifest. Arguably isn't a similar problem visible on the server side with say edge side includes - where the reverse proxy cache actually knows a lot about the page but only needs little pieces updated? The all-or-nothing take on cache-manifest seems like a great tool to focus design and architecture. You touched on it with img, script, and style tags - each of these provides the tools to manage cache invalidation granularly until we need to refresh the "app" itself (much like a release... that will need to be delivered updates). I believe thinking of web page delivery in this a/b manner is healthy - either we are delivering hypertext pages over HTTP (the "days of yore" ;) or we are delivering a web-app that takes advantage of CSS for style and uses JS for app logic (including loading dynamic content onto the "page"). Additionally if this done "right" the same logic that exists client side could be on edge servers to allow for delivery to "dumb" devices... back to static pages with dynamic content. I don't think constraint is the right word, but if it is a constraint lets embrace it!

Lennie (not verified)

19 December 2011 - 9:46am

We run a fairly popular site and what we do is we just add an encoded 'mtime' in the URL of each static file:

/cache234723/path/static-image.jpg

And we add the headers for 'cache public for one year'.

Which is very similair to what you are doing.

It means more filesystem stats, but it is faster especially because our server has enough memory to cache all static files.

What is annoying is that caching the filesystem stats is interresting but it conflicts with caching the HTML. Because if one static file changed (maybe even by FTP/SCP or other means) you don't know what HTML you should remove from the outputcache.

So it is a good idea, it kind of works but your cache has to be really good organised and maintained if you want to profit a 100% from it. If you have a logo in a template for example, you might need to remove all your cache or cache the template as a seperate item.

A suggestion: maybe you should add 'size' as well in the attributes, this will allow a browser to determine which TCP-connection for each request to get the most benefit.