Syndication, aggregation, and HTTP caching headers

Syndication, aggregation, and HTTP caching headers

I’ve seen various people in various places lately who were very unhappy about someone requesting their RSS feed every 30 seconds, or minute, or half hour, or whatever, and re-downloading it every time at a cost of megabytes in bandwidth. I’ve also seen people growing unhappy with the Googlebot for re-downloading their entire site every day.

So, a quick heads-up: there is a way for a client to say “hey, I have an old copy of your page, do you have anything newer, or can I use this one?” and for the server to say “hey, I haven’t changed since the last time you viewed me! use the copy you downloaded then!” Total bandwidth cost: about 300 bytes per request. That’s still a bit nasty for an ‘every 30 seconds’ request, but it means you won’t get cranky at the 10 minute people anymore. Introducing Caching in HTTP (1.1)!

The good news! Google’s client already does the client half of this. Many of the major RSS aggregaters do the client half of this (but alas, not all, there’s a version of Feed on Feeds that re-downloads my complete feed every half hour or so). And major servers already implement this… for static pages (files on disk).

The bad news! Since dynamic pages are generated on the fly, there’s no way for the server software to tell if they’ve changed. Only the generating scripts (the PHP or Perl or ASP or whatever) have the right knowledge. Dynamic pages need to implement the appropriate headers themselves. And because this is HTTP-level (the level of client and server talking their handshake protocol to each other prior to page transmission) not HTML level (the marked-up content of the page itself), I can’t show you any magical HTML tags to put in your template. The magic has to be added to the scripts by programmers.

End users of blogging tools, here’s the lesson to take away: find out if your blogging software does this. If you have logs that show the return value (200 and 404 are the big ones), check for occurrences of 304 (this code means “not modified”) in your logs. If it’s there, your script is setting the right headers and negotiating with clients correctly. Whenever you see a 304, that was a page transmission saved. If you see 200, 200, 200, 200 … for requests from the same client on a page you know you weren’t changing (counting all template changes), then you don’t have this. Nag your software developers to add it. (If you see it only for particular clients, then unfortunately it’s probably the client’s fault. The Googlebot is a good test, since it has the client side right.) An appropriate bug title would be I don’t think your software sets the HTTP cache validator headers, and explain that the Googlebot keeps hitting unchanged pages and is getting 200 in response each time.

RSS aggregater implementers and double for robot implementers: if you’ve never heard of the If-None-Match and If-Modified-Since headers, then you’re probably slogging any page you repeatedly request. Your users on slow or expensive connections hate you, or would if they knew the nature of your evil. Publishers of popular feeds hate you. Have a read of the appropriate bits of the spec and start actually storing pages you download and not re-downloading them! Triple for images!

Weblog and CMS software implementers: if you’ve never heard of the Last-Modified and/or ETag headers, learn about them, and add the ability to generate them to your software.