Monday, December 20, 2010

"Culturomics"

Various research disciplines have been trending towards 'big data' in recent years. This has, of course, been most evident in the sciences, but I've enjoyed reading a recent publication by a team at Harvard including, amongst others, the psycholinguist Steven Pinker: someone has finally gone wild with Google Books. A summary of their findings can be found here.

The project has taken about 5 million of the digitised volumes and reduced them to UTF-8 encoding to create 'n-grams' - a sequence of letters without spaces from which a computer can derive words, or compounds up to 5 words long in this instance ('1-gram' = 1 word). This mechanism actually came about to avoid copyright concerns - no one can actually 'read' the books; they're dealing with data, not information per se - and the simplicity of the resulting database made me wonder about the level of metadata in this mix.

One of the key questions we've been tasked with evaluating at the end of our first semester of digital asset management is whether, broadly speaking, detailed metadata is really necessary, or if fully-searchable text will do - essentially whether Google fully takes care of our searching needs, making all other methods redundant. This 'culturomics' project raises two issues: the possible advent of a new phase in the digital humanities, and a great case study for a Google research model.

Staying with the metadata question for the time being, the apparent simplicity of their n-gram UTF-8 stream is necessarily backed up by metadata, and quite a lot of it if you look at all the metadata originally harvested by the Google Books project that ultimately forms the basis for this work. The reason that fully-searchable text can never work by itself is because data requires context to become information, and that context can only be provided by metadata. Within the context of Google Books and the n-gram data derived from it, it's meaningless without an accurate date and place of publication.

The Harvard team selected the 5 million books with the highest quality metadata (and OCR), then carefully filtered it through various algorithms for enhanced accuracy (even then, they were left with a few percentage points of error). It would seem that, the larger the data set, the more straightforward your metadata needs to be, so my current perspective is that the Google (re)search model, like culturomics, is complementary to the type of traditional focused research that requires detailed metadata. Sometimes you just need to drill down within a specific collection or domain with more expert metadata than Google can provide.

As for the question of a shift in research patterns within the humanities, Jon Orwant, director of digital humanities initiatives at Google, has described the culturomics method as something that can "complement" traditional research models, rather than being an end in itself. This is worth bearing in mind when some have lebelled the technique as crude - it seems like a useful tool, with a lot of potential, but not a replacement for traditional research models. Processing this sort of data in isolation has the potential for all kinds of problems and whether this new work will spark a field of stand-alone "culturomics" is another question entirely.

Thursday, December 16, 2010

The future of digital memory

Here's a sobering bellwether for the state of our immersion in the digital realm, recently pointed out to me by a colleague who's been engaged with the issue for some time: what do we want to happen to our digital legacy after we die? Almost all of us have one, given the extent to which we rely, in one way or another, on digital and online content, so should we care about providing our nearest and dearest with what amounts to a 'digital will'? For some of the questions this raises, you can see Dave Thompson's fascinating FAQ on digital wills here. Undoubtedly, this generation represents a watershed for future memory and history. Given the pace of things, we may well be around long enough to experience its impact.

A recent article by the BBC alludes to the fact that many of us probably wouldn't want our digital lives to be shared out - we might well be more inclined to erase our data than to leave a digital will. Despite the very real consequences of an individual's online actions (think about legal action relating to Twitter feeds, amongst other things), the medium retains the suggestion of anonymity, informality and, consequently, greater personal freedom, but is it something that we would wish to represent our memory?

Of course, in the UK, society itself is online (e-government, e-science, e-learning &c.) so where does this leave the future of our collective memory and the history of our times? The question is so large that I'm going to keep it rhetorical, but I'm particularly interested in how this also feeds into issues surrounding digital preservation in the cultural and public sectors. Some of these have been raised within a recent Discover Magazine blog article discussing a publication examining the long-term storage of e-services in instances where there is a legal obligation to retain the information for a very long time (100 years, in the case of some e-government data). It presents the thesis that there is too much data, and most of it is not safe in the long-term - why not use analogue storage methods in conjunction with digital media and get the best of both worlds?

Perhaps the solution in this context is to save digital data onto lots of microfilm as the authors suggest, but it highlights the fact that the digital drive has never been coupled with sustainability. In memory institutions, like libraries and museums, the move towards digital is mostly extremely expensive, and represents a major investment, despite the risks. Digital content represents one of the best leverage tools available to cultural heritage institutions at present by prioritising user access, which equates to institutional relevance and compatibility with current information demands. Sustainability is really a bonus that everyone wants to achieve, planning as best they can, but long-term storage is indeed a mighty complex problem and a conversation that will endure.

Monday, December 13, 2010

Metadata vs. ontology

It's been rather a long hiatus in blogging terms, and an even longer one from the issues that I professed to form the core of this blog. So I'm returning again to DAM, and I wanted to come back in with a topic that's been cropping up again and again throughout my course: information integration and interoperability. This becomes an issue when you want to be able to search across domains (or even institutions within the same domain), since each has their respective metadata structures and terminological systems.

Why is this important? Essentially, research patterns these days are increasingly digital, remote (i.e. web-based) and cross-disciplinary. Online content is predominantly public-facing and we find now that 'engagement' and 'discovery' are as important as traditional focused research goals when it comes to making information accessible. It's important that information can be found via different locations and pathways, so it needs to be linked. If anyone's used WorldCat, then they've experienced some of the potential of this simple idea.

So how to do it? Something like WorldCat is pretty straightforward - it's lots and lots of metadata, harvested and converted to WorldCat (MARC) format. Metadata for libraries has a long history (MARC goes back to the 1960s) and works well for describing the traditional book format, but it does have difficulty accommodating new media and other types of objects. The metadata solution to this problem has been Dublin Core, a 'lowest common denominator' set of terms that can theoretically describe anything in only 15 base text fields, prioritising interoperability above domain-specific detail.

The problems with this solution have been several. Fundamentally, the trade-off between more complex metadata descriptions for interoperability isn't alway acceptable to specific communities. As such, efforts have been made to add additional fields to Dublin Core for greater detail, but this just returns us again to the problem of interoperability - it's another data set that doesn't work outside of your domain. A metadata system, which is a based on terminology, also doesn't allow for interrelations between objects, beyond the applied terms they have in common.

Enter, then, the ontology. Briefly, for those who might not know what an ontology is, or perhaps think it sounds vaguely Kantian (as I did until fairly recently), it can probably most easily be described as a formal logic (artificial intelligence, if you like) for a computer system, which can only 'know' what you tell it to know and how to know it.

Unlike metadata, which uses simplified terminological structures written by humans for human consumption, an ontology provides a formal system for data integration that can be far more complex. At its best, an ontology can recover the context and concepts behind the simplifications of terminological systems, focusing instead on an object itself and how it relates to other objects. It's further advantage is that by representing objects through formal relationships, they are freed from the constraints of domain-specific metadata, allowing for a top-level ontology to search across multiple domains.

It would seem that such a system might be able to better serve the 'engagement' and 'discovery' side of a user's needs, never mind providing a powerful research tool. But it's not as simple as my title makes out. Can a digital object truly 'exist' without metadata? It couldn't be found without it, so metadata remains the prime building block in this search for interoperability and information integration - everything else is built on top of that. A core metadata structure my not be the solution to these challenges on its own, but it needs to retain a strong presence. Add to that the reluctance of many in the cultural sector to commit to the time and effort required to master the ways of ontologies and it would seem that the just is still out.