Open government and parsable data formats
I first became aware of these issues via Raymond Yee, who teaches at UC Berkeley and who I’ve worked with a bit, hosting an Open Govt meetup at Freebase’s office, and going over to speak to his class about Freebase. Anyway, Raymond has blogged on several occasions about the lack of clarity in Recovery.gov data format specifications and the difficulty in working with data that is theoretically open but impossible to query effectively.
To my mind, if you can’t readily query against the data, it’s not really open. It’s just standing a little way out of your reach, waving and taunting. The folks at the Open Government Working Group agree. Their Open Data Principles say:
5. Machine processable
Data are reasonably structured to allow automated processing.
They expand a bit on the wiki’s talk page, saying:
P Language Rule
You know you have a truly open format if you can build a parser for it in Perl, Python or PHP in an afternoon. That parser should be able to crawl through the dataset and dump the results into a SQL database. That doesn’t necessarily mean that the data is best handled with an SQL database (although most of this material will fall into that category) – just that it can be easily imported into one.
I’d take it further. If it takes more than an hour using a P-language and standard libraries (XML parser, etc) then the data’s insufficiently open. Ideally it would take around fifteen minutes. If you think that’s too stringent, keep in mind that there’s nothing stopping agencies from providing specialised scripts or libraries themselves, which would bring the time down to “run this script from the command line, giving your MySQL database details as parameters.” Even though my rocket scientist friends assure me that rocket science isn’t as hard as people think, this still isn’t rocket science.
Anyway, all this came to mind when I saw two links from Simon Willison this morning, pointing to posts from the Sunlight Foundation: No PDFs! and Adobe is bad for Open Government.
Simon says:
At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.
CSV is a great format for open data, especially when that data takes a “rectangular” shape and includes numeric data. It’s easy to understand, easy to parse, and even non-programmers can load it up in Excel or Google Spreadsheets to take a look and make charts. I wish more people provided CSV data. But in the meantime, I’ll add my voice to the “Oh, God, noooo! Anything but PDF!” chorus.
Comments are closed.












CSV is a great format for open data [...]. It’s [...] easy to parse [...].
Um. Only if you restrict yourself to a subset of CSV, especially if you expect anyone to import the stuff into Excel at some point.
Excel is a bit too smart at times, and you’ll get weird things like ZIP codes being recognised as numeric (so leading zeroes disappear), numbers being recognised as dates or vice versa, etc.
Numbers with decimal places might also cause problems for people in locales where the decimal separator is not a dot – for example, in Germany, exporting to “CSV” from Excel will result in a *semicolon*-separated file with decimal *commas*.
Some applications support line breaks inside a field (some only if the field is surrounded by quotation marks), while others mess up completely in such a situation.
Tab-separated is a bit better, since it avoids the problem of the separator being embedded in the contents (most data is less likely to contain a tab than a comma, semicolon, or quotation mark). But even that will fall prey to the heuristics Excel and friends use to determine data type based on content.
Sure, but any kind of CSV format, once inspected, is still easier to write a parser for than PDF. And, arguably, better than XML for most applications. Perl’s Text::CSV module, which comes with the distribution by default, handles all the issues you describe above except those that are Excel’s fault. And quoting fields that aren’t really numeric (eg. zipcodes, phone numbers) will fix most of those.
It’s not that you can’t make some godawful messes with CSV, but they’re *still* more parsable than PDF.
I know we can’t ask old govt data to get future-perfect right away, but for greenfield projects, maybe it makes sense to sell JSON as the simplest option. JSON definitely simplifies the parse-and-go workflow compared to XML, and it isn’t subject to the frame issues of CSV.
Yeah, we mostly use JSON at Freebase.com and although I was skeptical at first (probably a hangover from early Javascript trauma) I am starting to really love working with it. XML does have some benefits though, like being able to specify that certain fields are required, or whatever.
After talking to a number of government agencies at the GovHack event over the weekend, they were quite positive about the idea of producing a tool that could be pointed at a geo data MID/MIF file and validate as “export grade”, making sure it has a default project, appropriate datum, ST_IsValid-passing polygons, and so on and so forth.
I’m hoping to produce it (or something like it) shortly.