So it seems no-one really saw why I got so excited by Dabble DB doing the linked data thing with data tables…
…so here’s an example…
First of all, importing some data via copy-and-paste:
…and we commit it:
All so simple, right…
So let’s pull some other data in from somewhere else; as CSV from a Google spreadsheet, perhaps?
(Note that the spreadsheet could have itself imported the data by scraping a table or list from an HTML page, or grabbing it via webservice with a RESTful API.)
So we import it:
…and commit it:
I’m not sure what the cacheing/refresh policy in Dabble DB is? For example, if the Google spreadsheet data changes, will Dabble keep up with the changes, and how often? (Maybe someone from Dabble DB could post a comment to clarify this?;-)
And finally, we grab data for the third table by screenscraping a table from an HTML page – this page:
Give it the URL:
Select the table:
…and commit it:
So now I have the the tables, by different means, that I used in the previous demo.
If I do some table linking in a similar way to the previous demo, I can get a table that lists grants awarded to different HEIs, along with their postcodes. (This doesn’t actually use the HTML table scraped data, but another mashup could…I could have added the Government Office Region(-ish) data to the table, for example.)
So just to be clear, here: this table is made up from columns from two separate tables. The JISC project data comes from one table, the HEI postcode location from another. The HEI homepage URI is common to both original data tables and is used to key the combined table.
And then I can export the data…
…and shove it into a pipe – using CSV, I think?
Then we can filter on just the HEIs that have been awarded grants, and have been geocoded to somewhere in the UK:
And we can get a map:
…and the KML, geo-RSS etc…
… and maybe take the JSON output from the pipe and use it to drive a proportional symbol map, showing the number of projects awarded to each institution, for example…
In the same way that Yahoo Pipes lets you do crazy stuff with lists, so Dabble DB lets you get funky with data tables… What’s not to like (except the lack of regular expressions for data cleaning, maybe…?;-)
So there we have it:
- some cut and pasted data in one table (HE location data), and a CSV imported table from a Google spreadsheet (the JISC project allocation data); (the HTML table scraped data is superfluous in this example);
- linked tables in Dabble DB to reconcile the data in the two tables;
- the mashed data table then gets exported from Dabble DB as CSV into a Yahoo pipe;
- the pipe geocodes the postcode location data for each HEI and exports the geo-coded feed as JSON;
Job done, hopefully ;-)
PS I’ve started reflecting a little on how I pull these mashup things together, and I think it’s a bit like chess… I’ve completely assimilated various patterns or configurations of particular data representations and how they can be processed that let me “see several moves ahead”. And in messing around with Dabble DB, it’s like I’ve just learned a new configuration, like a pin or a fork, that I didn’t really appreciate for; but now it’s something I “get”, and something I can look for, something that may be “several moves ahead”, whenever I get an urge to have a tinker… And that’s why I think this post, and the previous one on the topic, are maybe gonna be important if you want to keep up over the coming months…;-) Does that make sense…?
PPS @dfflanders and Ross, if you’re reading… being able to table or list scrape from HTML (so no embedded tables), or grab *simple* XML feeds into Google spreadsheets is one one way of making data available. Fixing on some canonical URIs in a standard format for HEI homepages would also be a start… (EPSRC uses different – valid, but maybe deprecated? – URIs for homepages compared to the URIs listed in the scraped HERO database, for example? I’m not sure what sort of HEI homepage URIs JISC uses… the one that the HEIs use themselves would be a start?)
11 thoughts on “Using Dabble DB in an Online Mashup Context”
Nice demo, really useful explanation for those not used to geohacking.
Do you have any hesitations or concerns about this linked data idea? Is it not just table joins in the SQL world? If so, then I’d expect trouble ahead. Even in simple joins, you get the 2 most basic errors that confuse people … rows that fall through joins or rows that multiply through joins. But in addition, you could have a slew of other problems linking data from different sources with different definitions covering different time periods and so on and so on. The result might look nice, but how meaningful would it be? I don’t know. It’s a cool demo, but it gives me the heebie geebies just thinking about how probable it is that gibberish would result except in the most trivial of data situations.
Hey, thanks for the post. We love to see how people are using our work. To answer your question, we don’t keep the data in synch unless you set up a scheduled import for the URL in question.
Here’s a link:
“Do you have any hesitations or concerns about this linked data idea? … it gives me the heebie geebies just thinking about how probable it is that gibberish would result except in the most trivial of data situations.”
Yes, I do have concerns, but the phase I’m in is exploring just how, in principle, we can start to wire the stuff together.
I’ve started trying to categorise different sorts of errors and conflicts I’m coming across, as well as unique keys that would, in an ideal world, work properly (ISBNs, postcodes and URIs are a start, but even there there’s a multitude of issues: different ISBNs for different editions of the same work, DOIs that can resolve to multiple locations, multiple URIs for the same resource, and so on).
Differences in definition of data that looks “similar” across different tables is one to be wary of and one that I hope to explore in later posts; I want to do some demos that show where it can go badly wrong, but I’m still assembling the toolbox at the moment, whist keeping an eye out for potentially horrific combinations (ideas welcome;-)
Data integrity is a major issue too – data pulled from separate independent sources may be collected at different times, may have different data refresh rates, the tools I use may have different cacheing policies, fail to refresh when I though they did, and so on.
One of the major things I’m working towards, though, is:
a) this stuff can be done;
b) it’s hard to do reliably;
c) it may be hard to spot errors;
d) NOW do you see why data is important, now do you see why government databases talking to each other might be problematic, now do you see why the likes of Google and Dunnhumby have the potential to “do evil”?
What I’m doing (I think) is starting to break a taboo…which may or may not be a good thing.
And the cavalier attitude I have in these posts towards the way I mix and match data is deliberate… Because are you sure YOUR data isn’t being used that way by other people, knowingly or not?
I love your demos :)
Nice work! The skills to do this are still vanishingly scarce though, but we need more examples of this kind to demonstrate why publishing structured raw data has so much potential.
“Yes, I do have concerns, but the phase I’m in is exploring just how, in principle, we can start to wire the stuff together.”
Thanks, Tony, for providing context for what you’re doing. It’s a very cool thing. Very difficult, but very cool. And with important implications that I can only dimly begin to imagine. Best wishes.
Following up on Gary’s comments and Tony’s reply, I put together some data manually related to this mashup, and offered it to Tony to use. He noted that there was no attribution on the data – where I had got it from etc. I corrected this by putting in a free text attribution for where each bit of data had come from.
It then occurred to me that what I really needed was a way of including a machine readable attribution so that Tony could mash both the data, but carry through the attribution at the same time, so that in the final mash, Tony (and any viewers) could still link the data back to it’s original source and so decide if this was a valid use of the data.
Thanks heaps for this post. I was putting a dataset together in DabbleDb and was hoping I could do some filtering through a restful interface, but alas this was not the case. Although I’d come across Yahoo Pipes a little while I hadn’t considered using it to “richen up” the dabble interface to give me what I wanted.
Admittedly I didn’t go with pipes, but rather went with YQL (just personal preference really).
Thanks again for reminding me to mash :)
Comments are closed.