Mash/Combining Data from Three Separate Sources Using Dabble DB
Over dinner with friends a couple of nights ago, I was asked how I typically approach problem solving tasks. Thinking about it, it’s a bottom-up AND top-down approach where I attack both ends of the problem (the “what I’ve got now” end and the “ultimate vision”) at the same time, in the hope that the tiny steps taken from each end meet up somewhere in the middle…
So for example, in the dev8D Dragon’s Den I mentioned the desire to put together a thematic choropleth map depicting the funding that’s going into different UK Government office regions as a result of JISC or EPSRC project awards. Here’s how I’ve started to work out how to do that…
(What follows gets a little involved at times, so the main trick to look out for is how to create a single data table by mashing together data from three separate data tables.)
At one end, is the output surface. A quick scout around turned up no flash components or KML overlays I could use on Google maps or ThematicMapping (ffs why can’t National Statistics make some free warez available???) so I opted for the amMap interactive map instead.
To plot the map, I need to be able to sum the value of project grants over lead HEIs within particular GORs (got that?;-) So where’s the data?
All over the place, that’s where…
- EPSRC Support By Organisation shows the total amount of current project funding awarded to each HEI by EPSRC;
Hmm, no GOR, no geolocation data… Which means I need a mapping from HEI to GOR…
- …but the closest I can find is a listing of the postcodes of each HEI: HERO screenscraper, and even that’s a scrape of another service…
- and finally, here’s a mapping from postcode areas to GORs: postcode area lookup table.
There’s a warning though: please note “regions” were recorded for my own visual aid and are NOT an attempt to tie in with current UK Administrative Regions.. Hmm – okay – add that one to the caveats/risk assessment. If the maps turns out very wrong, that’s EPSRC’s problem, right, for not making the data available in a clean way?!;-)
Okay, so those are the data sources: one contains HEI names and project funding data, one contains HEI names, location data (well, postcodes) and homepage URIs, and one contains mappings from postcode towns to UK regions (which loosely relate, possibly, to GORs).
Now at this point point I’ve already decided that I want to try use Dabble DB to somehow conflate the data from these three separate sources (though I’m not totally sure how… it’s just something I seem to remember from somewhere and somewhen a long time ago that Dabble DB supports if there are common fields – and matching strings – across different data tables).
Getting the data into Dabble DB is a copy and paste operation, but I’m going to take an intermediate step, highlighting and copy the tables from the separate web pages and pasting them into a Google spreadsheet. Why? Because I already know that this works and it’ll also let me cast an eye over the data to make sure it looks about right.
Looking at the HEI names from EPSRC and the HERO screenscrape, they don’t really match though, which means that Dabble DB won’t be able to use HEI names to idenify common rows in the HE location and EPSRC project tables. However, the HERO screenscrape page does have the HEI homepage URI, and a look beneath the “Go to Site” link on the EPSRC page shows that those links point to the HEI homepage…
…which means I should be able to link items in the EPSRC projects listing to items in the HEI location table by virtue of common homepage URIs.
and the URIs are exposed, so I can copy and paste the table and drop it into a spreadsheet, with the HERO data and postcode/region data in separate sheets.
A quick look over the URIs from both sources in the spreadsheets shows minor differences though – some URIs end with a “/” and others don’t (there are also a few broken scrapes that I tidy by hand); now if Dabble DB uses strict string matching to relate data in one table to data in another table (which I’d guess is likely) then missed matches will presumably occur?
So just to be safe, we need a data cleaning stage. To do this, I copy the data from the URI column in each spreadsheet, drop it into my TextWrangler text editor, and just clean up all the URIs so they end with a trailing / by searching for \.uk$ and replacing it with .uk/
Then I copy the URIs from the text editor and past them back into the appropriate column in the appropriate spreadsheet.
Looking at the postcode/GOR table, I need to get one or two letter postal town identifiers from the HEI postcodes, so to do this I copy the postcode column from the spreadsheet, and paste it into my text editor. This time I do a regular expression powered search and replace using this regexp: ([A-Z]+).* and replacing with \1
So now I have three spreadsheets on Google docs, which I can scan by eye to make sure they look okay, then easily copy and paste into separate tables (known as separate categories) in the same Dabble DB project, like this:
- the EPSRC data:
- HERO screenscrape data:
- and the postcode/region mapping data:
Now for the fun part; each of the above tables is a separate category, with separate column fields, in a Dabble DB project. It is possible to link a column with a similar column in another category, and consequently “pair” similar items in different tables. (So a column containing a particular URI, for example, in a row in one table/category can be related to a particular row in a particular column in another category/table, if the corresponding cell there contains the same URI (Dabble DB handles the actual pairings, you just have to link the columns).
So playing blind, I linked the URI column in the EPSRC category with a new category, which I called Meta:
This created a new table/category – Meta – with a couple of columns: a ‘Name” column, containing the URIs, and a column that linked back to corresponding entries in the EPSRC project category.
And then I did the same linking for the URI column in the HEI Location table/category, which automatically added another column in the Meta table that linked across to rows in the corresponding HEI Location table:
In the Meta category view, I can now add additional columns that are derived from columns in the other, linked tables. So for example, I can add a derived column corresponding to the value of project grants that is pulled in from the linked EPSRC projects column:
So my Meta table/category now looks like this:
Which is pretty clever I think..? ;-)
But then it gets more so… Suppose I link the Postcode town column from the HEI location table with the Postcode/Regional mapping table:
If you’ve been keeping up, you might now expect the UK HEI to be linked to from the Postcode/Region table, which it is:
But the link is symmetrical… and if one category is linked to a second category that is in turn linked to a third category, the columns from the first category can be used as derived columns in the second and the third category…
…which means in the Meta category, I can pull in columns derived from the Postcode/Region category via the HEI location category, first by grabbing the postcode town column into Meta:
To give this:
Then pull in a further derived field from the postcode town column from the Postcode/Region category:
And so now we have a rather more complete Meta category view containing linked items from all three tables (one of which is actually linked indirectly via one of the others):
Clever, eh??? So now I know how to annotate data in one table using data from another table if the two tables each have a column that contains similar data :-)
Okay, so now I have a table that contains rows that contain both project funds and UK regions info – so now I’m in a position to calculate the total amount of funds flowing into each region and then plot them on the thematic map…
…but this post is already way too long, so that’ll have to be for another day…
(Plus I’m not totally sure how to do it yet… and Mission Impossible is just starting (this is a scheduled post…;-)