Archive for the ‘Data’ Category
Every four years, I get blown away by the dedication of people who have spent the previous four years focussed on their Olympic Challenge (I find it hard to focus for more than an hour or two on any one thing!)
Anyway, I was intrigued to see this post on Google Maps Mania yesterday – Olympic Heat Maps – that displayed the Olympics medal table in the form of a heat map, along with several variants (medal tallies normalised against population, or GDP, for example).
The maps were neat, but static – they’d been derived by cutting and pasting a snapshot of a medals table into a Google spreadsheet, and then creating a Heat Map widget using the data…
So I had a look round for a ‘live’ data source for the medals table, didn’t find anything obvious, so looked for a widget that might be pulling on a hidden data source somewhere… Whereupon I found a reference to a WordPress Olympic Medal Tally widget…
A quick peek at the code shows the widget pulling on a data feed from the 08:08:08 Olympics blog, so I ‘borrowed’ the feed and some of the widget code to produce a simple HTML table containing the ISO country codes that the Google Heat Map widget requires, linked to it from a Google Spreadsheet (Google Spreadsheets Lets You Import Online Data) and created a live Olympic medal table map (top 10).
If you want to use the heat map as an iGoogle widget, here it is: Olympic Medal Table Map Widget.
Some time ago now, I wrote a post about progressive enhancement of HTML web pages, including some examples of how HTML data tables could be enhanced to provide graphical views of the data contained in them.
I’m not sure if anyone is actively maintaining progressive enhancement browser extensions (I haven’t checked) but here are a couple more possible enhancements released as part of the Google visualisation API, as described in Table Formatters make Visualization tables even nicer:
A couple of other options allow you to colour a table cell according to value (an implementation of the ‘format cell on value’ function you find in many spreadsheets), and a formatter that will “format numeric columns by defining the number of decimal digits, how negative values are displayed and more”, such as adding a prefix or suffix to each number.
Also in my feed reader recently was this post on Search Engines Extracting Table Data on the Web, which asks:
what if Google focused upon taking information from tables that contain meaningful data (as opposed to tables that might be used on a web page to control the formatting of part or all of a page)?
What if it took all those data filled tables, and created a separate database just for them, and tried to understand which of those tables might be related to each other? What if it then allowed for people to search through that data, or combine the data in those tables with other data that those people own, or that they found elsewhere on the Web?
and then links to a couple of recent papers on the topic.
Alternatively, could we map the data contained in a “regular” Google or Yahoo UI table to a Google spreadsheets like format – in which case, we might be able to use the Google Visualisation API Query Language? (I’m not sure if the Query Language can be applied directly to Google datatable objects?)
It’s not too hard then to imagine a browser extension that can be used to overlay a SQL-like query engine on top of pages containing Yahoo or Google datatables, essentially turning the page into a queryable database? Maybe even Ubiquity could be used to support this?
Early last week, Google announced a Google Flu trends service, that leverages the huge number of searches on Google to provide a near real-time indicator of ‘flu outbreaks in the US. Official reports from medical centres and doctors can lag actual outbreaks by up to a couple of weeks, but by correlating search trend data with real medical data, the Google folks were able to show that their data led the the official reports.
John Naughton picked up on this service in his Networker Observer column this week, and responded to an email follow-up comment I sent him idly wondering what search terms might be indicators of recession in this post on Google as a predictor. “Jobseeker’s allowance” appears to be on the rise, unfortunately (as does “redundancy”).
For some time, I’ve been convinced that spotting clusters of related search terms, or meaningful correlations between clusters of search terms, is going to be big the next step towards, err, something(?!), and Google Flu trends is one of the first public appearances of this outside the search, search marketing and ad sales area.
Which is why, on the playful side, I tried to pitch something like Trendspotting to the Games With a Purpose (GWAP) folks (so far unreplied to!), the idea being that players would have to try to identify search terms who’s trends were correlated in some “folk reasonable” way. Search terms like “flowers” and “valentine”, for example, which appear to be correlated according to the Google Trends service:
Just out of interest, can you guess what causes the second peak? Here’s one way of finding out – take a look at those search terms on the Google Insights for Search service (like Google Trends on steroids!):
Then narrow down the date over which we’re looking at the trend:
By inspection, it looks like the peak hits around May, so narrow the trend display to that period:
If you now scroll down the Google Insights for Search page, you can see what terms were “breaking out” (i.e. being searched for in volumes way out of the the norm) over that period:
So it looks like a Mother’s Day holiday? If you want to check, the Mother’s Day breakout (and ranking in the top searches list) is even more evident if you narrow down the date range even further.
Just by the by, what else can we find out? That the “Mother’s Day” holiday at the start of May is not internationally recognised, maybe?
There are several other places that are starting to collect trend data – not just search trend data – from arbitrary sources, such as Microsoft Research’s DataDepot (which I briefly described in Chasing Data – Are You Datablogging Yet?) and Trendrr.
The Microsoft service allegedly allows you to tweet data in, and the Trendrr service has a RESTful API for getting data in.
Although I’ve not seen it working yet (?!), the DataDepot looks like it tries to find correlations between data sets:
Next stop convolution of data, maybe?
So whither the future? In an explanatory blog post on the flu trends service – How we help track flu trends – the Googlers let slip that “[t]his is just the first launch in what we hope will be several public service applications of Google Trends in the future.”
It’ll be interesting to see what exactly those are going to be?
PS I’m so glad I did electronics as an undergrad degree. Discrete maths and graph theory drove web 2.0 social networking theory algorithms, and signal processing – not RDF – will drive web 3.0…
What do you do when you scrape a data set, geocode it so you can plot it on a map, and find that the geocoding isn’t quite as good as you’d hoped?
I’d promised myself that I wasn’t going to keep on posting “yet another way of scraping data into Google spreadsheets then geocoding it with a Yahoo pipe” posts along the lines of Data Scraping Wikipedia with Google Spreadsheets, but a post on Google Maps mania – Water Quality Google Map – sent me off on a train of thought that has sort of paid dividends…
So first up, the post got me thinking about whether there are maps of Blue Flag beaches in the UK, and where I could find them. A link on the UK page of blueflag.org lists them: UK Blue Flag beaches, (but there is a key in the URL, so I’m not sure how persistent that URL is).
Pull it into a Google spreadsheet using:
Publish the CSV:
Geocode the beaches using a Yahoo pipe – rather than using the Pipe location API, I’m making a call to the Yahoo GeoPlanet/Where API – I’ll post about that another day…
Grab the KML from the pipe:
Now looking at the map, it looks like some of the markers may be mislocated – like the ones that appear in the middle of the country, hundreds of miles from the coast. So what it might be handy to do is use the scraped data as a buggy, downloaded data set that needs cleaning. (This means that we are not going to treat the data as “live” data any more.)
And here’s where the next step comes in… Google MyMaps lets you seed a map by importing a KML file:
The import can be from a desktop file, or a URL:
Import the KML from the Yahoo pipe, and we now have the data set in the Google MyMap.
So the data set in the map is now decoupled from the pipe, the spreadsheet and the original Blue Flag website. It exists as a geo data set within Google MyMaps. Which means that I can edit the markers, and relocate the ones that are in the wrong place:
And before the post-hegenomic tirade comes in (;-), here’s an attempt at capturing the source of the data on the Google MyMap.
So, to sum up – Google MyMaps can be used to import an approximately geo-coded data set and used to tidy it up and republish it.
PS dont forget you can also use Google Maps (i.e. MyMaps) for geoblogging
A quickie post, this one, to complement a post from a long time ago where I plotted out – as a network – the links between people who served on the same committee on the Isle of Wight Council (Visualising CoAuthors in Open Repository Online Papers, Part 3, half way through the post).
In this case, I trawled the Isle of Wight Council committees to populate the rows of a spreadsheet with column headings “Committee Name” and “Councillor”.
Pasting the results into Many Eyes gives an IW Council membership dataset that can be easily visualised. So for example, here’s a glanceable treemap showing the membership of each committee:
The search tool adds yet another dimension to the visualisation, in this case allowing us to pick out the various committees the searched for named individual sits on.
It strikes me that if the search tool supported Boolean expressions, such as AND and OR (maybe with each term being realised by a different colour bounding box?), it would be possible to explore the variation – or similarity – in make-up of different committees? On the first tree map, this approach would make it obvious which committees the same groups of people were sitting on?
And why would we want to do this? To identify potential clashes of interest, maybe, or a lack of variation in the composition of different committees that might, ideally, be independent of each other?
PS Hmm, I suppose you could use a similar visualisation to look at the distribution of named directors across FTSE 100 companies and their subsidiaries, suppliers and competitors, for example? ;-) Does anyone have simple lists of such information in a spreadsheet anywhere?;-)
Over dinner with friends a couple of nights ago, I was asked how I typically approach problem solving tasks. Thinking about it, it’s a bottom-up AND top-down approach where I attack both ends of the problem (the “what I’ve got now” end and the “ultimate vision”) at the same time, in the hope that the tiny steps taken from each end meet up somewhere in the middle…
So for example, in the dev8D Dragon’s Den I mentioned the desire to put together a thematic choropleth map depicting the funding that’s going into different UK Government office regions as a result of JISC or EPSRC project awards. Here’s how I’ve started to work out how to do that…
(What follows gets a little involved at times, so the main trick to look out for is how to create a single data table by mashing together data from three separate data tables.)
At one end, is the output surface. A quick scout around turned up no flash components or KML overlays I could use on Google maps or ThematicMapping (ffs why can’t National Statistics make some free warez available???) so I opted for the amMap interactive map instead.
To plot the map, I need to be able to sum the value of project grants over lead HEIs within particular GORs (got that?;-) So where’s the data?
All over the place, that’s where…
- EPSRC Support By Organisation shows the total amount of current project funding awarded to each HEI by EPSRC;
Hmm, no GOR, no geolocation data… Which means I need a mapping from HEI to GOR…
- …but the closest I can find is a listing of the postcodes of each HEI: HERO screenscraper, and even that’s a scrape of another service…
- and finally, here’s a mapping from postcode areas to GORs: postcode area lookup table.
There’s a warning though: please note “regions” were recorded for my own visual aid and are NOT an attempt to tie in with current UK Administrative Regions.. Hmm – okay – add that one to the caveats/risk assessment. If the maps turns out very wrong, that’s EPSRC’s problem, right, for not making the data available in a clean way?!;-)
Okay, so those are the data sources: one contains HEI names and project funding data, one contains HEI names, location data (well, postcodes) and homepage URIs, and one contains mappings from postcode towns to UK regions (which loosely relate, possibly, to GORs).
Now at this point point I’ve already decided that I want to try use Dabble DB to somehow conflate the data from these three separate sources (though I’m not totally sure how… it’s just something I seem to remember from somewhere and somewhen a long time ago that Dabble DB supports if there are common fields – and matching strings – across different data tables).
Getting the data into Dabble DB is a copy and paste operation, but I’m going to take an intermediate step, highlighting and copy the tables from the separate web pages and pasting them into a Google spreadsheet. Why? Because I already know that this works and it’ll also let me cast an eye over the data to make sure it looks about right.
Looking at the HEI names from EPSRC and the HERO screenscrape, they don’t really match though, which means that Dabble DB won’t be able to use HEI names to idenify common rows in the HE location and EPSRC project tables. However, the HERO screenscrape page does have the HEI homepage URI, and a look beneath the “Go to Site” link on the EPSRC page shows that those links point to the HEI homepage…
…which means I should be able to link items in the EPSRC projects listing to items in the HEI location table by virtue of common homepage URIs.
and the URIs are exposed, so I can copy and paste the table and drop it into a spreadsheet, with the HERO data and postcode/region data in separate sheets.
A quick look over the URIs from both sources in the spreadsheets shows minor differences though – some URIs end with a “/” and others don’t (there are also a few broken scrapes that I tidy by hand); now if Dabble DB uses strict string matching to relate data in one table to data in another table (which I’d guess is likely) then missed matches will presumably occur?
So just to be safe, we need a data cleaning stage. To do this, I copy the data from the URI column in each spreadsheet, drop it into my TextWrangler text editor, and just clean up all the URIs so they end with a trailing / by searching for \.uk$ and replacing it with .uk/
Then I copy the URIs from the text editor and past them back into the appropriate column in the appropriate spreadsheet.
Looking at the postcode/GOR table, I need to get one or two letter postal town identifiers from the HEI postcodes, so to do this I copy the postcode column from the spreadsheet, and paste it into my text editor. This time I do a regular expression powered search and replace using this regexp: ([A-Z]+).* and replacing with \1
So now I have three spreadsheets on Google docs, which I can scan by eye to make sure they look okay, then easily copy and paste into separate tables (known as separate categories) in the same Dabble DB project, like this:
- the EPSRC data:
- HERO screenscrape data:
- and the postcode/region mapping data:
Now for the fun part; each of the above tables is a separate category, with separate column fields, in a Dabble DB project. It is possible to link a column with a similar column in another category, and consequently “pair” similar items in different tables. (So a column containing a particular URI, for example, in a row in one table/category can be related to a particular row in a particular column in another category/table, if the corresponding cell there contains the same URI (Dabble DB handles the actual pairings, you just have to link the columns).
So playing blind, I linked the URI column in the EPSRC category with a new category, which I called Meta:
This created a new table/category – Meta – with a couple of columns: a ‘Name” column, containing the URIs, and a column that linked back to corresponding entries in the EPSRC project category.
And then I did the same linking for the URI column in the HEI Location table/category, which automatically added another column in the Meta table that linked across to rows in the corresponding HEI Location table:
In the Meta category view, I can now add additional columns that are derived from columns in the other, linked tables. So for example, I can add a derived column corresponding to the value of project grants that is pulled in from the linked EPSRC projects column:
So my Meta table/category now looks like this:
Which is pretty clever I think..? ;-)
But then it gets more so… Suppose I link the Postcode town column from the HEI location table with the Postcode/Regional mapping table:
If you’ve been keeping up, you might now expect the UK HEI to be linked to from the Postcode/Region table, which it is:
But the link is symmetrical… and if one category is linked to a second category that is in turn linked to a third category, the columns from the first category can be used as derived columns in the second and the third category…
…which means in the Meta category, I can pull in columns derived from the Postcode/Region category via the HEI location category, first by grabbing the postcode town column into Meta:
To give this:
Then pull in a further derived field from the postcode town column from the Postcode/Region category:
And so now we have a rather more complete Meta category view containing linked items from all three tables (one of which is actually linked indirectly via one of the others):
Clever, eh??? So now I know how to annotate data in one table using data from another table if the two tables each have a column that contains similar data :-)
Okay, so now I have a table that contains rows that contain both project funds and UK regions info – so now I’m in a position to calculate the total amount of funds flowing into each region and then plot them on the thematic map…
…but this post is already way too long, so that’ll have to be for another day…
(Plus I’m not totally sure how to do it yet… and Mission Impossible is just starting (this is a scheduled post…;-)
Last week, I posted a quick demo of how to visualise data stored in a Google spreadsheet in Many Eyes Wikified (HEFCE Grant Funding, in Pictures).
The data I used was the latest batch of HEFCE teaching funding data, but Joss soon tweeted to say he’d got Research funding data up on Google spreadsheets, and could I do something with that? You can see the results here: Visualising UK HEI Research Funding data on many Eyes Wikified (Joss has also had a go: RAE: UK research funding results visualised).
Anyway, today the Guardian announced a new content API (more on that later – authorised developer keys are still like gold dust), as well as the Guardian data store (strapline: “facts you can use”) and the associated Data Store Blog.
Interestingly, the data is being stored on Google docs, in part because Google spreadsheets offer an API and a wide variety of export formats.
As regular OUseful.info readers will know, one of the export formats from Google spreadsheets is CSV – Comma Separated Variable data – which just so happens to be liked by services such as Dabble DB and Many Eyes. I’ll try to come up with a demo of how to mash-up several different data sets in Dabble DB over the next few days, but as I’ve a spare half-hour now, I thought I’d post a qiuck demo of how to visualise some of the Guardian data store spreadsheet data in Many Eyes Wikified.
So to start, let’s look at the the RAE2008 results data – University research department rankings (you can find the actual data here: http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ.
If you speak URL, you’ll know that you can get the CSV version of the data by adding &output=csv to the URL, like this: http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ&output=csv
Inspection of the CSV output suggests there’s some crap at the top we don’t want – i.e. not actual column headings – as well as the the end of the file:
(Note this “crap” is actually important metadata – it describes the data and its provenance – but it’s not the actual data we want to visualise).
Grabbing the actualt data, without the metadata, can be achieve by grabbing a particular range of cells using the &range= URL argument. Inspection of the table suggests that meaningful data can be found in the columnar range of A to H; guesswork and a bit of binary search identifies the actual range of cell data as A2:H2365 – so we can export JUST the data, as CSV, using the URL http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ&output=csv&range=A2:H2365.
If you create a new page on Many Eyes Wikified, this data can be imported into a wiki page there as follows:
We can now use this data page as the basis of a set of Many Eyes visualisations. Noting that the “relative URL address” of the data page is ousefulTestboard/GuardianUKRAERankings2008 (the full URL of the wikified data page is http://manyeyes.alphaworks.ibm.com/wikified/ousefulTestboard/GuardianUKRAERankings2008), create a new page and put a visualisation placeholder or two in it:
Saving that page – and clicking through on the visualisation placeholder links – means you can now create your visualisation (Many Eyes seems to try to guess what visualisation you want if you use an appropriate visulisation name?):
Select the settings you want for you visualisation, and hit save:
A visualisation page will be created automatically, and a smaller, embedded version of the visualisation will appear in the wiki page:
If you visit the visualisation page – for example this Treemap visualisation, you should find it is fully interactive – which means you can explore the data for yourself, as I’ll show in a later post…