Plausible facts

Chris Yeoh made an attack on Wikipedia, inserting false facts into a new article and seeing if they’d be removed. See also Rusty Russell who considers this vandalism, and Martin Pool who thinks that this isn’t telling us anything we didn’t already know.

It made me think a little about defences already in place against this kind of thing. The biggest concern, I think, is not a misplaced fact somewhere, but the run that kind of fact can get through followup literature. For example, in early editions of The Beauty Myth Naomi Wolf claimed, with a citation, that 150 000 American women die of anorexia every year, which is in fact false. (150 000 was approximately the number of sufferers not deaths.) This is useful to bibliographers of a forensic inclination because tracking your mistakes is an excellent way to narrow down the number of sources you could have got your figures from, but not so useful to, say, anyone else. However, at this time, it isn’t of spectacular concern to Wikipedia: anyone who gets a fact from Wikipedia that would be useful in a publication will double-check it as they would any fact from so general a source. (Surely?)

Imagine the following attack on Wikipedia: I go to an article, say, the Canberra article, and find somewhere to insert this fact:

The number of women cigarette smokers between 15–25 in Canberra is 20%, considerably lower than the Australian national average of 35%.

Of course, I’d have to be a little bit cunning to insert this fact into the existing article, because it doesn’t contain any facts about the population, let alone such precise ones. But assuming I could do that, it would make a plausible if boring addition. It would also be, as far as I know, false. While I think I’ve got the number of female smokers Australia in that age group right to within about 5% (it got a lot of press coverage about five years back, because other demographics aren’t smoking in anything like those numbers), I’ve got no reason to believe that Canberra deviates from the norm in any way.

What harm would this do? Well, it’s possible a bunch of kids would copy it into their school assignments on Canberra (you poor little things, Canberra?) and get caught cheating, less because the fact is wrong and more because no kid puts that kind of fact in an assignment that they aren’t copying wholesale. University students doing assignments on health statistics might get done in by Google, although who knows, if they actually cite it they might be in the clear.

So that’s fairly minor harm. The potential major harm is in reducing the reliability of Wikipedia as a source to the extent that all that work is wasted or that people write off collaborative non-fiction of this kind as impossible. I contribute to that harm by a very small amount in this particular case, but quite a large amount if I had an axe to grind against Wikipedia and decided to be smart about hiding my identify and insert 1000 of the things into different articles. With 20 friends to help I could do a lot of damage.

Internet software has a particularly bad case of a similar problem: there is a large and powerful group of people who are very interested in abusing your software for a number of reasons; ranging from being able to commit fraud using your computer to attacking some IRC server whose admins kicked them off.

Wikipedia has less of a problem because false information in it has less marshallable power: you have to wait until nebulous social factors pick up the information and start wafting it around rather than being able to tell your virus bots to go out and memeify. Hence attacks on Wikipedia tend to be the insertion of spam links taking advantage of its Google juice (well, I presume they get them, Wikitravel sure does) and presumably edit wars between authors rather than determined attempts to undermine it.

The only real reason to insert subtly false information into Wikipedia is that you like being nasty or maybe to put it a different way, you honestly believe that “insecurities” in social systems are just like insecurities in software systems, and you’re on a crusade to ‘re-write’ society so that the kiddies can’t hack it. Or to be generous, you want to give Wikipedia a chance to show that it can defend itself, although applying the “what if everyone did that?” test doesn’t make that look so good either. (Societal systems will to break down once a critical point of disorder is reached, and since the fix for this is hardly trivial, the “doing them a favour by demonstrating flaws” argument doesn’t hold nearly as much water as it does for attacks on software.)

Anyway, given that, I thought I would consider it in the light of other heuristics for asserting facts: print heuristics, to the limited extent that I know them.

Take for example my first year university modern history course. As a rule of thumb, you don’t assert facts without justification in a history essay. Almost every declarative sentence you write will be accompanied with footnotes showing where you got your facts and arguments from. There is the occasional judgement call to make, because a sufficiently well-known fact doesn’t need citation. (To give examples on either side of the line: the fact that the assassination of Archduke Franz Ferdinand occurred on the 28th June 1914 would not require citation, but casualty figures for the battle of the Somme would, and an argument that the alliance system in Europe made WWI inevitable would require a small army of superscript numbers.) Given that though, you exhibit your sources, you check your sources where your argument relies on them (ye, unto the tenth generation) if you don’t want to get caught out, and the worth of your argument rests on the authority of your sources.

That did actually matter in first year by the way: the most common mistake made in WWII essays was sourcing Holocaust information from the web, which apparently — no, this isn’t a story from personal experience — means you run a high risk of relying on Holocaust-denying websites. (The alternative is that those essays were all by young Holocaust deniers, but given the number of people whinging that the course was insufficiently Marxist I think my classmates’ ideologies lay elsewhere.)

Now authority gets murky and geeks want the numbers here. But secretly, as Martin Pool points out, “humans are cunningly designed to do trust calculations in firmware” (yes, even people who trust their conscious mind more than their firmware). You can also see Dorothea Salo on heuristics for evaluating source reliability.

Of course, encyclopedias have different standards, because otherwise you’d get a bibliography that stood twice as high as the encyclopedia stack. (Less of a problem on DVD or the web, mind you!) I believe the system is that they rely more directly on authority: rather than sourcing the article’s facts from authority, you get an authority to write the article. Wikipedia can’t go this way, so they are left with two choices for establishing authority: have a really good reputation and a bunch of caring people, or more citations.

Citations are my secret shame, by the way, once you get used to following them and discovering interesting background information you get addicted. I wouldn’t say no to more citations on Wikipedia. (Take Franz Ferdinand for example — since we all knew I had to look up that date in a couple of places — “[n]o evidence has been found to support suggestions that his low-security visit to Sarajevo was arranged […] with the intention of exposing him to the risk of assassination” huh? Well, it would be interesting to read about who made the allegations and about the hunt for evidence, yes?)

Compare the issue of fact checking in software.

Although it has been argued (not, I think, by Eric S. Raymond, but by people extending his “many eyes” argument) that security problems in Free Software ought to be discovered very quickly because of sharp eyed people reading the code, bug reports tend to be inspired by the code not doing something you want it to do. (Except in the case of Metacity, which I think would need to display an obscene and anatomically implausible insult to me whilst running around in the background deleting all my files after emailing them one-by-one to my worst enemy before I’d believe I’d found a bug in it.)

This is a similar problem to that of finding dull smoking statistics inserted into Wikipedia by attackers or simply by authors who got their facts wrong: the less your bug displays itself to non-hostile people in the course of their usage, the less likely it is to be reported. It’s even worse, in fact, because while I can’t see any real reason for hostile people to search Wikipedia for false facts aside from saying “hah, take that insecure Internet, back to the World Book!”, there are a lot of reasons for hostile people to search for holes in your software.

I think the argument for relying on authority is less good too. Being an authority on history involves having mastery of an enormous number of facts and a considerable number of arguments, together with a nose for intellectual trends, some good mates, and a lot of cut-and-thrust. But while code authorities write a lot of code, and understand a larger amount of it, they don’t by and large earn their status in a competitive arena where their code is successful only if other people’s code is wrong, incomplete or doubtful.

This has considerable advantage in production time, as anyone who’s familiar with existing tools intended to introduce proofs of correctness knows. However, I think it has a minor cost in that there’s nowhere near the same incentive to engage with and criticise other people’s code, unless you’re a criminal. Of course I know that there are code auditors who aren’t searching for criminal opportunities, however that the professional incentive for everyone to do it is lacking.

And in some ways, Wikipedia also lacks this professional incentive. In a volunteer project rather than a cut-throat professional world where everyone is fighting for tenure, the incentives for fact checking are less. It’s essentially reducible to the old problem about getting people to write documentation: you can pay them, you can hurt them (this is how you get people to report bugs) or you can make it sexy. Ouch.