This is a post I made on Growstuff Talk to propose some initial steps towards interoperability for open food projects. If you have comments, probably best to make them on that post.
I wanted to post about some concepts from my past open data work which have been very much in my mind when working on Growstuff, but which I’m not sure I’ve ever expressed in a way that helps everyone understand their importance.
Just for background: from 2007-2011 I worked on Freebase, a massive general-purpose open data repository which was acquired by Google in 2010 and now forms part of their “Knowledge” area. While working at Google I also worked as a liaison between Google search/knowledge and the Wikimedia Foundation, and presented at a Wikimedia data summit where we proposed the first stages of what would become Wikidata — an entity-based data store for all of Wikimedia’s other projects.
Freebase and Wikidata are part of what is broadly known as the Semantic Web, which has to do with providing data and meaning via web technologies, using common data formats etc.
The Semantic Web movement has several different branches, ranging from the extremely abstract and academic, to the quite mundane and pragmatic. Some of the more common bits of Semantic Web technology you might have come across are microformats, for instance, which let you add semantic meaning to your HTML markup, for instance for defining the meanings of links to things like licenses or for marking up recipes on food blogs and the like. There is also Semantic Mediawiki which adds some semantic features on top of a wiki, to allow you to query for information in interesting ways; Practical Plants uses SMW and its search is based on this semantic data.
At the more academic end of the Semantic Web world are things like RDF which creates a directed graph of semantic data which can be queried via a language called SPARQL, and attempts to define data standards and ontologies for a wide range of purposes. These are generally heavyweight and mostly of interest to researchers, academics, etc, though some aspects of this work are starting to seep through into consumer technology.
This is all background, however. What I wanted to talk about was the single most important thing we learned while working on Freebase, which is this:
Entities must have unique identifiers.
Here’s what I mean. Let’s say you know three people all called Mary Smith. Then someone says, “It’s Mary Smith’s birthday today.” Which one are they referring to? You don’t know. In any system based around knowledge, you need to have some kind of unique ID for each entity to avoid ambiguity. So instead you might say, “Mary Smith, whose employee number is E453425” or “Mary Smith, whose email address is mary@example.com”, or “Mary Smith, whose primary key in our database is 789”.
When working on our proposal for phase 1 of Wikidata, one of the things we realised is that the Wikimedia community — all the languages of Wikipedia, the Wikimedia Commons, etc — lacked unique identifiers for real-world entities. For instance, Barack Obama was http://en.wikipedia.org/wiki/Barack_Obama on English Wikipedia and http://de.wikipedia.org/wiki/Barack_Obama on German Wikipedia and http://commons.wikimedia.org/wiki/Barack_Obama on Wikimedia Commons and http://en.wikinews.org/wiki/Category:Barack_Obama on Wikinews, but none of these was his definitive identifier.
Meanwhile, interwiki links — the links between English and German and French and Swahili and Korean wikipedias — were maintained by hand (or, actually, by a bot) that had to update every wikipedia whenever a page was added or changed on any of them. This was a combinatoric exercise: with 2 wikis, there are two links (A -> B and B