Wednesday 10 November 2004

Using my previous entry as a jumping off point, I’m here to tell you a long and convoluted story about why, as a bug reporter (among other things), I will with but the flimsiest justification tend to regard myself as either put-upon or an idiot. In the spirit of Mark Pilgrim’s morons and assholes division, I bring you two user archetypes: the Thug and the Wimp.

I’d present Thea the Thug to you, but I think you’ve already met her. She emailed you one time while you were sleeping. The subject of the email was ‘PLEASE HELP — URGENT’. It’s possible the body described a genuine bug in your project. It’s just as possible that she was asking something unanswerable like "how can I get a list of email addresses?" It doesn’t matter anyway, because an hour later she sent another mail or two denouncing you for not answering her first. She’s got a range of tactics. She might have claimed to be representing the silent majority of users against the tyranny of the developers. She might have asked you how you expect to make your operating system work for ordinary people if you don’t even answer their emails. She might have just randomly continued asserting vague things about how your program should work, or about design principles, or security. She feels very put upon that you didn’t reply, especially since she’s smarter than you, and also she’s doing you a favour by using your program. You wouldn’t hurt a user now would you?

In fact, you’ve met Thea and other Thugs many times. On mailing lists. In bug reports. On IRC. If you’re particularly unlucky, in the street. You know her and her ilk backwards and forwards. One day, or many days, you snapped, and wrote anything from a blog entry to a howto lambasting users for their ill-placed sense of entitlement, their rudeness, and their over-use of capital letters. In some cases you edited it down into a neutral piece on how to get replies on mailing lists. In other cases, you posted the rant.

Now, meet Willy the Wimp. Willy’s just read your piece on how to get replies on mailing lists and he thinks you’re talking to him. And what’s more, you’re saying the same thing that everyone says to him. Well, you’re mainly saying what he says to himself. Bug reports, he reads in your piece, are hard to get right. If you get them wrong, which you probably will since they’re hard, you will annoy very smart people who have a lot of demands on their time. And if he doesn’t get it right he will look like an idiot. And you’ll never see him again, because he won’t do the report now.

You’ve probably met Willy too, but a bit less often. In their most common variety, Wimps are your granddad, the one who’s just bought a very expensive computer in the fear that he’s getting behind. And further, his next door neighbour has a lot of computer trouble. One time a letter to her niece just disappeared. And she was doing what she always did, she swears! And sure enough, when Granddad Willy uses the computer, he finds out that his very expensive new possession moves things around (possibly because his mousing skills aren’t good, but he doesn’t know that). They are tricksy. They are uncanny. They are not for the likes of him.

However, that’s only one variety of Wimp, the archetypal archetype if you will. And he’s probably not reading your piece. But there’s a Wimp who might be.

This new Wimp is a contradiction in terms. This Wimp actually uses computers a lot. He not only holds a computer science degree, he holds one with quite good results. He has a computer related job. He’s got geeky interests. And he has a dirty little secret: when the computer doesn’t work it’s his fault. It could walk like a bug, talk like a bug, quack like a bug, and carry a big "Hello I’m Buggy the Buggy Bug!" sign, and his first thought on meeting it would be "I suck at computers. I’m a dumb person."

It’s quite likely this Wimp can program even, although three things are likely to be true. The first is that it isn’t his hobby, because if you think other people’s programs prove you’re a dumdum, your own programs square the effect. The second is that he learnt to program only at university or on the job. The third is that he’s spent a lot of time with people who are better programmers than he is.

These Wimps are shy. They’re often in the closet, confessing their hopeless incompetence only to their bewildered partners (who generally perceive them as perfectly competent and often try and explain this), who are the only people you can say things like "but I’m stupid! It’s my fault!" to more than once (and not too many more times than once, mind you, either).

Usability testers are the general exception. They see Wimps all the time. They have tapes of them crying and saying things like "Please, don’t worry about me. Your program is great. I’m just dumb. I’m so sorry." (Really, they do.) Every so often, they write textbooks and try and jerk Wimps into the light by saying things like "users will tend to blame software failures on lack of skills on their part," but their gentle warnings scatter like ashes in the face of the Thugs who blame software failures on the world’s failure to listen to their prophetic tones.

Even so Wimpishness is more common than you think. Academia, for example, is absolutely infested with people who are just waiting for someone to knock and say "Oh hello, did I mention? We’ve noticed that you’re the dumbest person in the department. What on earth did you think you were doing? Pack up your stuff and be gone by noon." And there are an awful lot of computer users who think exactly like this, all the time.

Which is not to say that Wimps are great people, mind you. There are some places where the cultural norm is to lead with "I, a lowly worm, grovel at the feet of my betters and humbly present them with this small critique of their program and beg them to use the whip sparingly," but in most of the circles I move in this kind of thing annoys people and it’s seen as a roundabout way of asking for compliments in the reply and an especially tacky one at that. (And let’s face it, compliments are exactly what the Wimps want, again and again and again. Well that, and for their computers to work.)

No, the trouble with Wimps is that it’s almost impossible to speak softly to them and fend off the Thugs at the same time. I understand this. I really don’t expect people to reply "yes, this is definitely a bug, don’t worry, I think you’re very smart" to bugs I file. I really don’t. But at the same time, every time the program crashes, I think "I’m a dumb person, what am I?" automatically. And in situations like that, hearing about how you don’t want ‘useless’ bug reports is so easy for me to translate into ‘your bug reports are useless’.

OK, so I just have to buck up. I know this. (When I forget, I ask my Dad.) But in the general case, this is a real dilemma. When you write something, or say something, and have a Thug in mind as your audience, or even yourself the perfectly competent user who can flame with the best of them but only when there’s need, there are these people in the background, the ones who are thinking "I’m a dumb person, what am I?" and hearing you say that too.

I would love to know why it is that computers in particular inspire this reaction in people, especially people who’ve had all kinds of external validation about their abilities, and who are being presented with a crash dialog that reads "the program has quit unexpectedly," as opposed to "you broke it, moron!" Maybe I’m just an outsider, but it seems to me that this is something more pronounced in computing (and academia) than elsewhere. How do people’s feedback loops get so skewed? Is there a magical time to grab a Wimp and give them the good swift kick in the bum that you secretly want to give them? If not, what’s the best way to improve the feedback quality?

Open Source Web Design

So it’s not like I’m in the most beautiful city in the world or anything, so some of the exciting things I’ve been doing include pulling a bunch of people’s weblogs into WordPress. (OK, OK, and I did take the night mode on Andrew’s digicam for a spin.)

Then this evening I was looking at my website and thinking “time for a change” and decided to head over to Open Source Web Design with the idea that if I ran out of inspiration I could pinch a design and remodel it.

It turns out OSWD’s designs are… really unspeakable awful.

They have a ‘hot or not’ rating system. Here’s my rating criteria:

  1. I would close the browser window immediately if presented with this design.
  2. I would wince and enlarge the font several times, as well as upping the contrast on my monitor, when presented with this design.
  3. This design looks kind of like the rest of the web, except all the really bad bits, which it only resembles in part.
  4. This design is quite good.
  5. Upon seeing this design I would immediately plot to steal it for my own site, held back only by the fear of looking like a big thief, and also someone who can’t do their own web design. (Not that I can do my own web design, it’s just that at the moment I do it anyway.)

I gave nothing a score above three. Andrew was sitting next to me, and he gave nothing above a three either, despite several designs making extensive use of techniques designed to appeal to his demographic. (A demographic consisting of people who love the colour purple.)

If there really are people out there doing free high quality web designs for the fun of it, I’d appreciate some Google terms.

Thursday 21 October 2004

It’s high past time that I stopped using Movable Type 2.6.x on my servers. Various options present themselves.

The first is upgrading to Movable Type 3.x. This is undesirable for two reasons. I’m hitching my wagon to proprietary software that may have arbitrary changes in price and conditions whenever a new version is released. Further, the 3.x version would cost US$99, and while I can afford this, I’m hosting other people’s weblogs using this installation, with one exception, and none of them are paying me. So I don’t feel generous enough to do this.

The next is using a Free Software solution. This is appealing; it’s my default choice in all other kinds of software. However, in order to do this, the software should have available all the features of Movable Type that the users, or I their administrator (captor) need:

Multiple weblog support.
I have ten weblogs. I don’t want to make ten copies of the same piece of (PHP, because everything is in PHP) code in ten different directories; make either ten different databases or fifty different database tables; and give the same users varying permissions over ten different weblogs. This one is a surprisingly rare feature.
User-based security.
This is not so much because I think the users are EVIL as because they will make mistakes. If the software gives them edit access to all the blogs, they will regularly make posts in the wrong place.
Web interface.
Most of these people don’t have their own computer, they blog from labs or the library. Uploading text files doesn’t cut it.
User editable templates.
Julia wants this, at least, and probably Mos does too.

Now, The first requirement alone knocks out a huge number of the candidates. There are a few that remain. Pivot seems buggy and … odd (probably unusable by users weaned on MT). b2evolution really seems like the only other candidate.

The last alternative is writing my own, because there does rather seem to be a hole in the (quite considerable) market. However, I’ve already done this twice, once for puzzling.org and once ages ago for eyes, and it’s getting dull. And in neither of those cases have I gotten anywhere near things like comments, trackbacks, or other basically necessary features. And while I have no doubt that the basic features of multiple weblogs, multiple users, and web editing would be developed fairly rapidly, the list of stuff I’d need to do started looking a little nasty:

  • I have to decide on a backend. Ew. And every time I change the code I have to have upgrade scripts. More ew.
  • In order to have more than one other user and any co-developers at all, there is but one choice of language. (Unless it turned out to be such a killer app that web hosts around the world started installing twisted.web — unlikely.)
  • Input validation and web based authentication are which of: horrible, easy to get wrong and boring? Answer: all of the above.
  • Then there’s that old horror: prevention of evil. Even if I trust my users to avoid malicious markup the question of fending off comment spammers, referrer spammers, trackback spammers and other jerks will arise eventually.
  • Letting users edit the templates is a huge input validation problem all on its own: how do you dig them out when they write an invalid template? (Getting them to write Nevow style templates? Well, anything’s possible.)
  • In order to handle both my own and other people’s bizarro weblog setups, it would need to work with arbitrary file systems, arbitrary vhosts, and quite likely generate URLs specified by the user.
  • In order to work with both my own and other people’s bizarro weblog setups, it would need to parse about twenty different types of export format.

Irritants; Shiny things

Irritants

  1. Our host’s wireless router, whose default DHCP lease is 30 seconds. That’s sure to fill one’s logs, asking for a new IP address every 30 seconds. I discovered recently though that if you specify a lease time, any lease time, it gives you about 40% of that time. There’s some amusement to be had from that if you aren’t nicely sending DHCPRELEASE, but at the moment I am settling for a lease of 852 million seconds. And counting.
  2. Nautilus, which I’m trying to use regularly, not just for reasons of sympathetic magic, but because it is indeed useful to manipulate pictures by dragging and dropping thumbnails. (A future project is a fairly hard-core script for resizing pictures en masse because the g-scripts ones don’t talk gnome-vfs. Wait on this one, I need to learn pygtk.) However, the usefulness is being offset by the incredible number of bugs, mainly related to frozen redraws or uninformative and occasionally wrong error messages, that manifest themselves every time I try and use it for non-local file access.

Shiny things

  1. Ubuntu’s "Human" icon theme (you may have to explicitly choose this). So many happy talking faces in one X-Chat icon, I can hardly believe it.

Hosting

About this time last year I was unhappy with my website host, and happened to be Googling for new hosts (Google Ads can be pretty useful as long as you were intending to part with money anyway) when I came across the idea of virtual servers: that is, paying someone to run a process for you that behaves just like a little Linux machine.

The concept was just great for me, because I had this enormous list of specialised hosting requirements that started accruing way back when I was hosting for free on a server tucked away at Andrew’s work place. These requirements include something that no shared hosting provider gives out (multiple shell accounts) and stupid requirements like the ability to construct the infinite number of addresses with – signs in them that Andrew and I use for various purposes, mainly for sorting mail from online companies into different folders.

Anyway, the virtual servers appeared to have all the advantages of dedicated servers that I needed (root access with the usual powers minus kernel upgrades and driver fiddling) without the hassles of dedicated servers, which basically comes down to price and hardware maintainence.

Since then though, I’ve embarked on a round of server hopping the likes of which would make any Australian ADSL junkie proud.

A year in review:

Bytemark. These guys are pretty good. They have a simple little shell app where you can login into the host server, poke at your parasite server, reboot it, access the consoles and so on. You can also overwrite the whole thing with a new clean server. These applications are really useful and many virtual server hosts don’t have them. When they don’t, if your server disappears from the ‘net, you’re at the mercy of tech support, even if the host server is up. I switched away from them because they were in the UK and the delay when typing from Australia was annoying me. In retrospect: dumb.

JVDS. Random web pundit consensus seems to be that these guys have a pretty good deal on RAM, disk space, and whatnot. They seem to have the most recommendations. They had reasonably prompt user support. Unfortunately, we really needed it, because they messed up our server’s setup. Every time the host machine rebooted, someone else’s server would come up in place of ours. And this server, being where ours should be, would start receiving all our mail, and not recognising the addresses, rejected it all. Delayed mail not good, bounced mail bad. Further, we had no access to the host server and couldn’t check our machine’s status. So after the fourth time they promised and failed to fix the phantom server problem, we moved hosts again.

Redwood Virtual. These guys have an amazing deal on RAM which was why we went to them. Unfortunately they’ve had two major problems: consistent ongoing performance problems probably related to disk, and massive downtime. Like JVDS they don’t give you any access to the host server that’s useful when your parasite goes down, and unlike JVDS, they don’t have 24 hour support. It turns out they grew out of a bunch of friends who got a dedicated machine and some IP addresses and started playing around with UML.

Linode. I’m testing these people at the moment. While not a strikingly good deal on RAM or disk space, these guys have the most sophisticated host server management facilities I’ve seen. They’re the only host so far with an easy way to find out your bandwidth use. You can reboot and stop your parasite server. You can subdivide your disk space, reformat it, and reinstall at will. You can maintain different installations and switch between them. You can purchase extra RAM and disk space and have it added automatically. You can access your parasite host’s consoles. You can configure reverse DNS. And then, only if there’s something wrong with all of that, you can hassle tech support. Finally, although it’s possible my parasite server just is on a new machine, they seem to have good performance.

[Side note to Twisted people eager to promote one of their helpers: thanks, I’ve heard of tummy.com. However they’re relatively expensive and offer less disk space than I need.]

Linode isn’t all roses though.

First of all, they’re draconian about spam. I’m fine with “thou shalt not spam.” I’m less happy with “if you ever get us blacklisted we will charge you $250 an hour for the time it takes us to get un-blacklisted.” (Background story: I used to run a secondary mail server for twistedmatrix.com. Spammers, for various reasons, like to send spam via the secondary mail server. Hence, I was handling all of twistedmatrix.com’s spam and forwarding it to them, as secondary servers are meant to do. One of their users noticed my server’s name in all his spams, and promptly got me in trouble with my provider, who was JVDS at the time. Moral of the story: it isn’t hard to falsely look like an open relay, and never secondary for someone who may have users who can read email headers but don’t know DNS.)

Second, as mentioned, they’re not the best deal on RAM and disk space. In particular, I probably am really pushing it trying to run a server under my current demand with 64MB RAM, especially as either Nevow or my Nevow app is a really memory hog. And, goddamn, memory usage needs to be a priority for virus and spam checkers. Amavis doesn’t even do any actual matching for me, it just hands off to clamav, and it still eats 6-10MB of memory.

Finally, nitpicking, their default Debian images have some weird problems, most noticeably not have 127.0.0.1 localhost in /etc/hosts. I hope I’ve come across the majority of these now.

However, I am hoping that a week or two of testing (they’re already handling incoming mail for Andrew and myself) will show them to be sufficiently stable and agile to look at settling there for a while.

Nifty; Job

Nifty

I was trying to deal with LiveJournal’s XML-RPC interface to transmit some UTF8 encoded text. It wasn’t working so well, so Andrew introduced me to the following Python snippet (which will test your installed fonts nicely):

print u'abcdefg€ñçﺥઘᚨ'.encode('ascii', 'xmlcharrefreplace')

The output is:

abcdefg€ñçخઘᚨ

Note: you should actually run this file rather than just whacking the script into your Python command line interpreter, because my console or interpreter didn’t like unicode input and mine was set up by those whacky Python nuts at Canonical.

Get that? (Pfft, don’t look at the HTML source, I had to change all of the & signs to &.) It takes the nasty Unicode string "abcdefgó€ñçﺥઘᚨ" and reencodes it in ascii, replacing all the non-ASCII characters (everything except ‘abcdefg’) with XML character references to their Unicode value. The upshot being a XML snippet that you can transmit in ASCII if you’re ever dealing with an interface that doesn’t seem to like your UTF-8 encoded strings.

Job

Not that I have a good resume online, but with half my holdiay over, I’m looking for six months or so of work in Sydney when I get back. I’m available from early December. Python, Perl or Java programming, or possibly tech support, tech writing or office admin, but I’ve got a better resume for the junior programming positions. Leads appreciated!

Pretty pictures

Pretty pictures

Ubuntu changed their default theme to include a harmonious humanity image featuring three pretty young things, which is causing considerable controversy mainly because the models used in the pictures are in various states of (well and truly legal in Australia) partial nudity. Screenshots linked here unless the poster takes them down. (PNGs I ask you?)

A lot of people are making the argument that those images may be inappropriate if displayed in a corporate environment or alternatively to conservative friends or family members. I don’t think anyone’s admitted to being too conservative themself to like the image, so I’ll start.

I like portraiture and good photographs, as it happens, and it can get as naked as can be. Fetish shots are fine as long as I know roughly what to expect. These shots are good photographs and reasonable portraiture, although they’re a bit more glossy/pretty-pretty than I like to see in galleries.

But for some reason, which must be unpopular judging from every theme site I’ve ever seen, I really dislike having people prettier than me on my computer’s desktop. I don’t think I’ve ever had portraits on it at all in fact, but if I did, I would never start with models. Something in the idea leaves me very cold: I’d much rather teh-boring than teh-pretty-people. (In actual fact though, I have a pretty castle shot: not the most amazing shot ever, but a favourite amongst my own.)

(I wonder what is psychologically at the root of this? Perhaps people roughly divide into two: people who’d love to strip off a bit and be happy and playful for a camera, and the other half of people — or maybe that’s just me — whose instinctive reaction to the idea has a little bit of ew in it. It certainly messes with the intended vibe.)

Update: Andrew showed me the proposed CD cover which has similar artwork, and for some reason I have considerably less squick. Maybe I’m acclimatised to teh-pretty when shopping. On the other hand, since partially naked people are usually selling things I don’t want, I think I’d pass it by on the sales rack without a second glance.

Syndication, aggregation, and HTTP caching headers

Syndication, aggregation, and HTTP caching headers

I’ve seen various people in various places lately who were very unhappy about someone requesting their RSS feed every 30 seconds, or minute, or half hour, or whatever, and re-downloading it every time at a cost of megabytes in bandwidth. I’ve also seen people growing unhappy with the Googlebot for re-downloading their entire site every day.

So, a quick heads-up: there is a way for a client to say “hey, I have an old copy of your page, do you have anything newer, or can I use this one?” and for the server to say “hey, I haven’t changed since the last time you viewed me! use the copy you downloaded then!” Total bandwidth cost: about 300 bytes per request. That’s still a bit nasty for an ‘every 30 seconds’ request, but it means you won’t get cranky at the 10 minute people anymore. Introducing Caching in HTTP (1.1)!

The good news! Google’s client already does the client half of this. Many of the major RSS aggregaters do the client half of this (but alas, not all, there’s a version of Feed on Feeds that re-downloads my complete feed every half hour or so). And major servers already implement this… for static pages (files on disk).

The bad news! Since dynamic pages are generated on the fly, there’s no way for the server software to tell if they’ve changed. Only the generating scripts (the PHP or Perl or ASP or whatever) have the right knowledge. Dynamic pages need to implement the appropriate headers themselves. And because this is HTTP-level (the level of client and server talking their handshake protocol to each other prior to page transmission) not HTML level (the marked-up content of the page itself), I can’t show you any magical HTML tags to put in your template. The magic has to be added to the scripts by programmers.

End users of blogging tools, here’s the lesson to take away: find out if your blogging software does this. If you have logs that show the return value (200 and 404 are the big ones), check for occurrences of 304 (this code means “not modified”) in your logs. If it’s there, your script is setting the right headers and negotiating with clients correctly. Whenever you see a 304, that was a page transmission saved. If you see 200, 200, 200, 200 … for requests from the same client on a page you know you weren’t changing (counting all template changes), then you don’t have this. Nag your software developers to add it. (If you see it only for particular clients, then unfortunately it’s probably the client’s fault. The Googlebot is a good test, since it has the client side right.) An appropriate bug title would be I don’t think your software sets the HTTP cache validator headers, and explain that the Googlebot keeps hitting unchanged pages and is getting 200 in response each time.

RSS aggregater implementers and double for robot implementers: if you’ve never heard of the If-None-Match and If-Modified-Since headers, then you’re probably slogging any page you repeatedly request. Your users on slow or expensive connections hate you, or would if they knew the nature of your evil. Publishers of popular feeds hate you. Have a read of the appropriate bits of the spec and start actually storing pages you download and not re-downloading them! Triple for images!

Weblog and CMS software implementers: if you’ve never heard of the Last-Modified and/or ETag headers, learn about them, and add the ability to generate them to your software.

Plausible facts

Chris Yeoh made an attack on Wikipedia, inserting false facts into a new article and seeing if they’d be removed. See also Rusty Russell who considers this vandalism, and Martin Pool who thinks that this isn’t telling us anything we didn’t already know.

It made me think a little about defences already in place against this kind of thing. The biggest concern, I think, is not a misplaced fact somewhere, but the run that kind of fact can get through followup literature. For example, in early editions of The Beauty Myth Naomi Wolf claimed, with a citation, that 150 000 American women die of anorexia every year, which is in fact false. (150 000 was approximately the number of sufferers not deaths.) This is useful to bibliographers of a forensic inclination because tracking your mistakes is an excellent way to narrow down the number of sources you could have got your figures from, but not so useful to, say, anyone else. However, at this time, it isn’t of spectacular concern to Wikipedia: anyone who gets a fact from Wikipedia that would be useful in a publication will double-check it as they would any fact from so general a source. (Surely?)

Imagine the following attack on Wikipedia: I go to an article, say, the Canberra article, and find somewhere to insert this fact:

The number of women cigarette smokers between 15–25 in Canberra is 20%, considerably lower than the Australian national average of 35%.

Of course, I’d have to be a little bit cunning to insert this fact into the existing article, because it doesn’t contain any facts about the population, let alone such precise ones. But assuming I could do that, it would make a plausible if boring addition. It would also be, as far as I know, false. While I think I’ve got the number of female smokers Australia in that age group right to within about 5% (it got a lot of press coverage about five years back, because other demographics aren’t smoking in anything like those numbers), I’ve got no reason to believe that Canberra deviates from the norm in any way.

What harm would this do? Well, it’s possible a bunch of kids would copy it into their school assignments on Canberra (you poor little things, Canberra?) and get caught cheating, less because the fact is wrong and more because no kid puts that kind of fact in an assignment that they aren’t copying wholesale. University students doing assignments on health statistics might get done in by Google, although who knows, if they actually cite it they might be in the clear.

So that’s fairly minor harm. The potential major harm is in reducing the reliability of Wikipedia as a source to the extent that all that work is wasted or that people write off collaborative non-fiction of this kind as impossible. I contribute to that harm by a very small amount in this particular case, but quite a large amount if I had an axe to grind against Wikipedia and decided to be smart about hiding my identify and insert 1000 of the things into different articles. With 20 friends to help I could do a lot of damage.

Internet software has a particularly bad case of a similar problem: there is a large and powerful group of people who are very interested in abusing your software for a number of reasons; ranging from being able to commit fraud using your computer to attacking some IRC server whose admins kicked them off.

Wikipedia has less of a problem because false information in it has less marshallable power: you have to wait until nebulous social factors pick up the information and start wafting it around rather than being able to tell your virus bots to go out and memeify. Hence attacks on Wikipedia tend to be the insertion of spam links taking advantage of its Google juice (well, I presume they get them, Wikitravel sure does) and presumably edit wars between authors rather than determined attempts to undermine it.

The only real reason to insert subtly false information into Wikipedia is that you like being nasty or maybe to put it a different way, you honestly believe that “insecurities” in social systems are just like insecurities in software systems, and you’re on a crusade to ‘re-write’ society so that the kiddies can’t hack it. Or to be generous, you want to give Wikipedia a chance to show that it can defend itself, although applying the “what if everyone did that?” test doesn’t make that look so good either. (Societal systems will to break down once a critical point of disorder is reached, and since the fix for this is hardly trivial, the “doing them a favour by demonstrating flaws” argument doesn’t hold nearly as much water as it does for attacks on software.)

Anyway, given that, I thought I would consider it in the light of other heuristics for asserting facts: print heuristics, to the limited extent that I know them.

Take for example my first year university modern history course. As a rule of thumb, you don’t assert facts without justification in a history essay. Almost every declarative sentence you write will be accompanied with footnotes showing where you got your facts and arguments from. There is the occasional judgement call to make, because a sufficiently well-known fact doesn’t need citation. (To give examples on either side of the line: the fact that the assassination of Archduke Franz Ferdinand occurred on the 28th June 1914 would not require citation, but casualty figures for the battle of the Somme would, and an argument that the alliance system in Europe made WWI inevitable would require a small army of superscript numbers.) Given that though, you exhibit your sources, you check your sources where your argument relies on them (ye, unto the tenth generation) if you don’t want to get caught out, and the worth of your argument rests on the authority of your sources.

That did actually matter in first year by the way: the most common mistake made in WWII essays was sourcing Holocaust information from the web, which apparently — no, this isn’t a story from personal experience — means you run a high risk of relying on Holocaust-denying websites. (The alternative is that those essays were all by young Holocaust deniers, but given the number of people whinging that the course was insufficiently Marxist I think my classmates’ ideologies lay elsewhere.)

Now authority gets murky and geeks want the numbers here. But secretly, as Martin Pool points out, “humans are cunningly designed to do trust calculations in firmware” (yes, even people who trust their conscious mind more than their firmware). You can also see Dorothea Salo on heuristics for evaluating source reliability.

Of course, encyclopedias have different standards, because otherwise you’d get a bibliography that stood twice as high as the encyclopedia stack. (Less of a problem on DVD or the web, mind you!) I believe the system is that they rely more directly on authority: rather than sourcing the article’s facts from authority, you get an authority to write the article. Wikipedia can’t go this way, so they are left with two choices for establishing authority: have a really good reputation and a bunch of caring people, or more citations.

Citations are my secret shame, by the way, once you get used to following them and discovering interesting background information you get addicted. I wouldn’t say no to more citations on Wikipedia. (Take Franz Ferdinand for example — since we all knew I had to look up that date in a couple of places — “[n]o evidence has been found to support suggestions that his low-security visit to Sarajevo was arranged […] with the intention of exposing him to the risk of assassination” huh? Well, it would be interesting to read about who made the allegations and about the hunt for evidence, yes?)

Compare the issue of fact checking in software.

Although it has been argued (not, I think, by Eric S. Raymond, but by people extending his “many eyes” argument) that security problems in Free Software ought to be discovered very quickly because of sharp eyed people reading the code, bug reports tend to be inspired by the code not doing something you want it to do. (Except in the case of Metacity, which I think would need to display an obscene and anatomically implausible insult to me whilst running around in the background deleting all my files after emailing them one-by-one to my worst enemy before I’d believe I’d found a bug in it.)

This is a similar problem to that of finding dull smoking statistics inserted into Wikipedia by attackers or simply by authors who got their facts wrong: the less your bug displays itself to non-hostile people in the course of their usage, the less likely it is to be reported. It’s even worse, in fact, because while I can’t see any real reason for hostile people to search Wikipedia for false facts aside from saying “hah, take that insecure Internet, back to the World Book!”, there are a lot of reasons for hostile people to search for holes in your software.

I think the argument for relying on authority is less good too. Being an authority on history involves having mastery of an enormous number of facts and a considerable number of arguments, together with a nose for intellectual trends, some good mates, and a lot of cut-and-thrust. But while code authorities write a lot of code, and understand a larger amount of it, they don’t by and large earn their status in a competitive arena where their code is successful only if other people’s code is wrong, incomplete or doubtful.

This has considerable advantage in production time, as anyone who’s familiar with existing tools intended to introduce proofs of correctness knows. However, I think it has a minor cost in that there’s nowhere near the same incentive to engage with and criticise other people’s code, unless you’re a criminal. Of course I know that there are code auditors who aren’t searching for criminal opportunities, however that the professional incentive for everyone to do it is lacking.

And in some ways, Wikipedia also lacks this professional incentive. In a volunteer project rather than a cut-throat professional world where everyone is fighting for tenure, the incentives for fact checking are less. It’s essentially reducible to the old problem about getting people to write documentation: you can pay them, you can hurt them (this is how you get people to report bugs) or you can make it sexy. Ouch.

Idle notes

Idle notes

An interesting thing about time zones is that they’re stuffing up my web surfing habits. Because I read a lot of stuff written in Europe and the US (I was going to say ‘disproportionate amounts of stuff’, but given where the English writers of the world live, not so), and I’m used to it all being written over night and then doing all that reading during the morning.

Now that I’m in Spain, I’m actually 1-7 hours ahead of most of the writers, so I have to wait all day for their content to dribble in. I prefer the Australian setup.

What I’ve been working on

Precious little, much of it Wikitravel, which is sort of silly, but sort of not, since it’s still pretty hot in the middle of the day in Palma.

What I should be working on

  1. Updates to the Twisted Labs website pending the 2.0 release.
  2. A report for Diego.