Monday 19 July 2004 – puzzling.org

I’ve been working on Backwards again: I’m trying to pull all the stuff out of it that’s specific to my site so that other people can use it. I’m getting closer: Andrew is trying to deploy it now, which is helpful. I’ve finally provided site creators with the ability to drop in their own DocFactory easily, which is a nice touch because you can choose to do the DocFactories with Stan, which is ever so much nicer that typing HTML tags, or you can do it with HTML/XML if you like. It means that it’s still a programmer’s web content tool, but I don’t intend to change that.

I’ve also finally stuck in some code that adds the cache validation Last-Modified and ETag HTTP headers to every page request. I added it to the RSS feeds a while back after I noticed that Jeff’s aggregator and Planet SLUG both poll for updates every ten minutes. I didn’t think it would be so useful for the rest of puzzling.org, because most browser visitors come in from Google to look at my summaries of my high school texts.

I’d forgotten about the Googlebot itself though. It’s a pretty regular visitor to my site now — it seems to go through every few days. Other search robots are less frequent visitors. The cache validation headers are preventing the full transmission of an awful lot of content to robots now. Something for a lot of dynamic blog tools to consider doing — perhaps many do it already.

The only problem I’m having is working out what to do when the templates change because really, the validation headers should change then too. The content of the page won’t have changed, but the layout will have. There’s a couple of possibilities:

drop the Last-Modified header and base the Etag header on some kind of hash of the page content (I’m currently setting weak ETags based on the timestamp, actually), which means turning off Nevow’s incremental render; or
extend the existing “change detection” mechanisms to detect changes in the template as well as changes in the content.

Changes in the content are currently detected in a variety of ways, but they’re all based on file timestamps. I haven’t come up with a way to detect a template change yet that doesn’t place burden on the site maintainer to record the fact or date of the change manually. I could insist that the templates be files so that I could check the datestamp but having them as nevow.loaders.DocFactory Python objects is desirable for other reasons. I guess I could also stick the template in some kind of database and timestamp it there. (Actually, doing the latter might be one way to avoid the “need to restart the process when the templates change” problem too. Maybe I have a winner here.)

There’s a few other things I want to sort out before resting for a while (by which I mean “making a numbered release which probably noone will use anyway”):

Documentation. I actually loathe documenting my own projects as much as anyone, it’s only documenting other people’s that I don’t mind.
URL generation. Unfortunately, the fact that I’m using old twisted.web with my personal Backwards site means that the URL generation code is an immense mess. Because old web only talks to it through a proxy, and the proxy code doesn’t set the forwarding headers, I can’t use Nevow’s URL generation mechanisms without ending up with a bunch of http://localhost:8080/ URLs. So I have my own clunky hard-coded base URLs because there’s no way to get them from the request object when it’s behind a twisted.web.proxy. I keep wanting to have a weekend-long hackfest on new twisted.web to get it deployable, but I’m not a “twisted developer” in that sense, and the way they use sandboxes has always said “my rewrite, no you touchie!” to me.
Persistence. The amount of data stored in memory is really too large and should be much lower. So I need to stick it somewhere. Which means choosing between persistence systems on what I currently feel is way too little information. I currently have a bastardised mixture of shelves which really should be one file.