Monday, December 20, 2010


Various research disciplines have been trending towards 'big data' in recent years. This has, of course, been most evident in the sciences, but I've enjoyed reading a recent publication by a team at Harvard including, amongst others, the psycholinguist Steven Pinker: someone has finally gone wild with Google Books. A summary of their findings can be found here.

The project has taken about 5 million of the digitised volumes and reduced them to UTF-8 encoding to create 'n-grams' - a sequence of letters without spaces from which a computer can derive words, or compounds up to 5 words long in this instance ('1-gram' = 1 word). This mechanism actually came about to avoid copyright concerns - no one can actually 'read' the books; they're dealing with data, not information per se - and the simplicity of the resulting database made me wonder about the level of metadata in this mix.

One of the key questions we've been tasked with evaluating at the end of our first semester of digital asset management is whether, broadly speaking, detailed metadata is really necessary, or if fully-searchable text will do - essentially whether Google fully takes care of our searching needs, making all other methods redundant. This 'culturomics' project raises two issues: the possible advent of a new phase in the digital humanities, and a great case study for a Google research model.

Staying with the metadata question for the time being, the apparent simplicity of their n-gram UTF-8 stream is necessarily backed up by metadata, and quite a lot of it if you look at all the metadata originally harvested by the Google Books project that ultimately forms the basis for this work. The reason that fully-searchable text can never work by itself is because data requires context to become information, and that context can only be provided by metadata. Within the context of Google Books and the n-gram data derived from it, it's meaningless without an accurate date and place of publication.

The Harvard team selected the 5 million books with the highest quality metadata (and OCR), then carefully filtered it through various algorithms for enhanced accuracy (even then, they were left with a few percentage points of error). It would seem that, the larger the data set, the more straightforward your metadata needs to be, so my current perspective is that the Google (re)search model, like culturomics, is complementary to the type of traditional focused research that requires detailed metadata. Sometimes you just need to drill down within a specific collection or domain with more expert metadata than Google can provide.

As for the question of a shift in research patterns within the humanities, Jon Orwant, director of digital humanities initiatives at Google, has described the culturomics method as something that can "complement" traditional research models, rather than being an end in itself. This is worth bearing in mind when some have lebelled the technique as crude - it seems like a useful tool, with a lot of potential, but not a replacement for traditional research models. Processing this sort of data in isolation has the potential for all kinds of problems and whether this new work will spark a field of stand-alone "culturomics" is another question entirely.

No comments:

Post a Comment