Idle Thoughts – A Few More Approaches to Making CSV Files Queryable

How much more useful would CSV data be if it was queryable?

In a previous post (Using CSV Docs As a Database) I described a recipe for importing a CSV file into a Google spreadsheet so that the data it contained could be queried using the Google visualisation API. Whilst there isn’t a lot of technical knowledge required to republish the data in this way, there is still the overhead of requiring the user to log in to a Google account, create a spreadsheet, import the CSV document and then discover the spreadsheet URL. And bearing in mind the rule of thumb that two clicks is at least one click too many, this route to making documents republishable is likely to be seen as too complicated for many people.

So here are a few partial ideas about other ways in which we might be able to (re)publish CSV documents so that they become queryable. (That is, lightweight ways of providing a query interface to a CSV doc.)

Firstly, I wonder whether or not it would be possible to tweak the example Google Chart tools data source implementation example so that a user could just upload or link an already online CSV document as an external data source? The example provides a tutorial for how to use a CSV document as an external datasource by wrapping it with a (provided) Java library that implements the Google data source API; so I wonder: could this example be tweaked so that any CSV files uploaded to or placed in a specified directory/folder could be selected as the external datasource, meaning that a council officer could expose a queryable interface to a CSV document simply by pasting a copy of the CSV document into a particular folder? Or maybe modifying the example to become a service such that if it it was passed a link to an online CSV doc, it would allow that document to be treated as the external datasource and provide the query interface to it?

Secondly, it seems to me that YQL also offers a query interface to arbitrary online CSV documents? The post Analyzing World Cup Data with YQL on the Yahoo Developer blog gives an example of how to use YQL to write queries over a CSV document (although rather perversely they use a Google spreadsheet as the CSV source!) but we can presumably do a similar thing with data listed on data.gov.uk for example. Trying out the:
select * from csv where url=’http://example.com/whatever.csv’
approach on several URLs through an error (“Unable to parse data using default charset utf-8”), though apparently it is possible to force the select … from csv handler to accept a specified charset; (there are two problems that then arise for me at least: i) I’m not sure how to specify the charset; ii) I don’t know how to detect the charset of the files that are apparently not being recognised by YQL as utf-8. It does make me think, though, that guidance about setting charsets (maybe as the server specified MIME-TYPE for CSV files? Or am I talking b****ks? If not, maybe this Google apps script URLFetch function might help?!) may be required as part of the gotcha guidelines for publishing online open data?

Anyway, here’s an example of a CSV file indexed on data.gov.uk that I could query – Bournemouth Libraries:

That is:
select * from csv where url=’http://www.bournemouth.gov.uk/library/data/libraries.csv’ and col1 like ‘%BH8%’

(A list of the other query filters supported within the YQL SELECT … WHERE statement can be found here: Filtering Query Results (WHERE).)

Although I forget how at the moment, I seem to remember that it is possible to create parameterised short URL queries over YQL, so for example I could presumably now create a short URL query along the lines of http://example.com/yql/bournmeouthLibraries?postcode=???? that would take in the first half of a postcode, for example, and then query the Bournemouth libraries CSV document as a database for libraries whose addresses specify that postcode area?

Thirdly. I almost feel obliged to demonstrate a Yahoo Pipes implementation… so without any real explanation, how about the following?

(I suspect that for large CSV files, the only solution that would work would be the Google VIz API external datasource example. I know Yahoo Pipes borks on large HTML files, and I’ve had YQL time out trying to query a large XML file before now…)

[Note that this is a scheduled post and that I am on holiday at the moment – away from the net and without a computer to hand. Which is why I haven’t tried out any of the above, and don’t intend to for at least a week or two…;-)]

PS see also this bit of @codepo8 magic http://isithackday.com/csv-to-webservice/ (reminded of this via @OSandCMS)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst

Rate this:

Share this:

Related

Author: Tony Hirst

One thought on “Idle Thoughts – A Few More Approaches to Making CSV Files Queryable”