Computational linguistics

February 15, 2013

Citation delusions: "The most influential paper Gerard Salton never wrote"

In trying to finalise my PhD revisions, I am giving some background on text categorisation.

Extremely briefly, the problem of text categorisation is this: you have a document and some (usually pre-defined, unless you’re clustering) categories. For example, the categories might be news and editorial. Or academic article, newspaper article and blog entry. The choice of categories is application dependent.

Then you have a document you wish to assign to a category. Is it news, or editorial? The typical way of doing this is to assemble a set of training examples: pre-assigned news and editorial pieces. Then you measure the similarity of your new document to the pre-assigned collections, and whichever category it is most like is your document’s category. You might notice that I have not here defined “measure the similarity” and “most like”: that’s often the research question. How can you represent the collections efficiently so that they can be compared against new documents? What are good measures of similarity?

A fairly common way to picture this is (for historical reasons, as we’ll see), a vector. For each word in the vocabulary (the vocabulary being the set of terms used in every document in the training examples, typically, sometimes you might try and smooth the morphology out or similar), you construct a numerical representation. Say the vocabulary is no-good, bad, rotten, and a document reads “no-good no-good bad”, you might describe it as a vector , showing two uses of the first vocabulary item, 1 of the second and none of the third. (Again, whether you count vocabulary items, or weight them in various ways, is a research question. You may also notice that this counting-of-occurences model is a “bag of words” approach, that is, it does not distinguish between “bad rotten” and “rotten bad” even though in language word order and syntactic structure is meaningful. It’s possible to transform the vectors so that this orthogonality of individual words does not hold.)

For reasons that I won’t go into here, I am trying to discuss this model briefly in my PhD thesis — actually, more briefly than I did above — and therefore looking to cite the originator of the idea. I started coming across citations in other papers that looked something like: “Gerard Salton [and others] (1975). A vector space model for information retrieval.” Sounds good. It’s got the key words in it, and quite a few citations!

I like to sight before citing though, which means I found this interesting paper:

David Dubin (2004). The Most Influential Paper Gerard Salton Never Wrote, Library Trends 52(4):748–764.

Gerard Salton is often credited with developing the vector space model (VSM) for information retrieval (IR). Citations to Salton give the impression that the VSM must have been articulated as an IR model sometime between 1970 and 1975. However, the VSM as it is understood today evolved over a longer time period than is usually acknowledged, and an articulation of the model and its assumptions did not appear in print until several years after those assumptions had been criticized and alternative models proposed. An often cited overview paper titled “A Vector Space Model for Information Retrieval” (alleged to have been published in 1975) does not exist, and citations to it represent a confusion of two 1975 articles, neither of which were overviews of the VSM as a model of information retrieval. Until the late 1970s, Salton did not present vector spaces as models of IR generally but rather as models of specific computations. Citations to the phantom paper reflect an apparently widely held misconception that the operational features and explanatory devices now associated with the VSM must have been introduced at the same time it was first proposed as an IR model.

Naturally such a subtle treatment of the history of the model is not great for my immediate purposes: I need That One Citation! (As best I can tell from Dubin, if I have to pick one it should be G. Salton, (1979). Mathematics and information retrieval. Journal of Documentation, 35(1), 1–29.) but it’s fun to come across the analysis of an idea in this form.

Update: if you want a reasonable overview of text classification/topic classification/topic assignment, the survey of choice seems to be Fabrizio Sebastiani (2002). Machine learning in automated text categorization, ACM Computing Surveys, 34(1):1–47. You know, modulo 11 years now.

November 21, 2011June 11, 2016

Computational linguists

xkcd suddenly exploded in my circles in 2006, thanks to the comic Randall Munroe calls Computational Linguists and most people refer to as “Fuck Computational Linguistics” getting around at the annual conference of the Association for Computational Linguistics.

There’s been requests for the xkcd store to sell it before, but it’s never been done.

I just ordered a batch through Sticker Mule, both of the full comic and of a smaller badge version I did. (They will do proofs of them, I’ll be interested to see if the “Fuck” bugs them.) In order to do so I did a vector version of the comic (via Inkscape’s “trace bitmap”), and because the original comic, and these variants, are under Creative Commons Attribution NonCommercial, I can share them with you here. If you want them, order copies from the sticker vendor of your choice!

Full comic:
Indicative PNG | Compressed Inkscape SVG | PDF (fonts as paths)

Smaller badge-like variant:

Compressed Inkscape SVG | PDF (fonts as paths)

The vector versions aren’t very clean, but neither is the original comic, so I’m hoping these look like the spirit of the original, rather than a nasty hack.

Reminder: these are licensed for free noncommercial use (the precise condition is noncommercial use with attribution to the original author, modifications OK). So don’t sell them!

March 27, 2009April 11, 2026

Ada Lovelace Day wrap 2: Karen Spärck Jones elsewhere

Yes, this does mean that a third of these things is coming, but I wanted to point to some other profiles of Karen Spärck Jones, aside from my brief one. At least at the present time, she’s on the first page of most profiled Ada Lovelace Day subjects. I was really pleased to learn more about this inspiring scientist.

Martin Belam has a long profile quoting extensively from Spärck Jones’s interviews and speeches and focussing on both her own career progression: she worked with Margaret Masterman at the Cambridge Language Research Unit. “You have no conception of how narrow the career options were [for women],” is one of Belam’s quotes. Another one of her stories reminds me of more recent stories Pia Waugh has told me about the resistance of parents playing a role in girls not choosing computing careers (these days it’s apparently the perceived low earnings and limited career prospects of programmers from the point of view of ambitious parents, so at least something has changed):

We were trying to get at girls in schools [to take up computing] and we knew we had to get to the teachers first. We found that the spread of computing in the administrative and secretarial world has completely devalued it. When one of the teachers suggested to the parents of one girl that perhaps she should go into computing the parents said: ‘Oh we don’t want Samantha just to be a secretary’. That’s nothing to do with nerdiness, but the fact that it’s such a routine thing.

Bill Thompson was a student of Spärck Jones’s, and writes about her influence on him as a fellow philosopher turned computer scientist. He also wrote her obituary for The Times (and, in 2003, that of her husband, fellow computer scientist Roger Needham).

IT journalist Brian Runciman remembers Spärck Jones as the most interesting woman he’s ever interviewed in Computing’s too important to be left to men. (I think it’s very important to get more women into computing. My slogan is: Computing is too important to be left to men. seems to be Spärck Jones’s best known quote.) In the interview with him, she talked about how her ideas permeate modern search engine implementations.

She scored smaller mentions from:

Tom Simonite in New Scientist: Celebrating Ada Lovelace: the ‘world’s first programmer’
Rose Tinted Web: Ada Lovelace Day
Peter Turney (himself a well known computational linguist) lists her among others at Ada Lovelace Day
Mariya Genzel on Twitter

March 24, 2009April 11, 2026

Ada Lovelace Day profile: Karen Spärck Jones

Let’s create new role models and make sure that whenever the question “Who are the leading women in tech?” is asked, that we all have a list of candidates on the tips of our tongues… To take part All you need to do is… pick your tech heroine and then publish your blog post any time on Tuesday 24th March 2009. It doesn’t matter how new or old your blog is, what gender you are, what language you blog in, or what you normally blog about – everyone is invited.

This is a profile of a woman in technology for Ada Lovelace Day.

Karen Spärck Jones by Markus Kuhn (modifications by Mary Gardiner) is licensed under a Creative Commons Attribution 2.5 Australia License.
Based on a work at commons.wikimedia.org.

I first heard about Karen Spärck Jones, who was a senior scientist in my field of computational linguistics, in 2007 as part of my paying job, which is as the editorial assistant for Computational Linguistics. Just before she died, Spärck Jones wrote Computational Linguistics: What About the Linguistics? which we published posthumously as the Last Words column for Vol. 33, No. 3. (Spärck Jones was aware both that she was dying and that her column was going to appear under the heading ‘Last Words’.) I was never able to correspond with her directly: she died before we even had the camera ready copies done.

Spärck Jones’s academic career began in 1957, and was funded entirely by grant money until 1994: most academics will recognise this as a hard way, requiring researchers to fund their own positions with grant money awarded in cycles.

Spärck Jones was the originator of the Inverse Document Frequency measure in information retrieval (1972, A statistical interpretation of term specificity and its application in retrieval., Journal of Documentation, 28:11–21) which is nearly ubiquitously used as part of the measure of the importance of various words contained in documents when searching for information. (The word ‘the’, for example, is very unimportant, as it occurs in essentially all documents, thus having high document frequency and low inverse document frequency.) She had a long history in experimental investigations of human language (most computational linguists are now in this business). She was also at one time president of the Association for Computation Linguistics.

Awards Spärck Jones won in her lifetime include Fellowships of the American and European Artificial Intelligence societies, Fellowship of the British Academy, the ACL Lifetime Achievement Award and the Lovelace Medal of the British Computer Society.

Elsewhere: Spärck Jones’s obituary in Computational Linguistics and Wikipedia.