I *love* treemaps. If you’re not familiar with them, they provide a very powerful way of visualising categorically organised hierarchical data that bottoms out with a quantitative, numerical dimension in a single view.
For example, consider the total population of students on the degrees offered across UK HE by HESA subject code. As well as the subject level, we might also categorise the data according to the number of students in each year of study (first year, second year, third year).
If we were to tabulate this data, we might have columns: institution, HESA subject code, no. of first year students, no. of second year students, no. of third year students. We could also restructure the table so that the data was presented in the form: institution, HESA subject code, year of study, number of students. And then we could visualise it in a treemap… (which I may do one day… but not now; if you beat me to it, please post a link in the comments;-)
Instead, what I will show is how to visualise data from a sports championship, in particular the start of the Formula One 2011 season. This championship has the same entrants in each race, each a member of one of a fixed number of teams. Points are awarded for each race (that is, each round of the championship) and totalled across rounds to give the current standing. As well as the driver championship (based on points won by individual drivers) is the team championship (where the points contribution form drivers within a team is totalled).
Here’s what the results from the third round (China) looks like:
|Paul di Resta||Force India-Mercedes||0|
|Adrian Sutil||Force India-Mercedes||0|
F1 2011 Results – China, © 2011 Formula One World Championship Ltd
We can represent data from across all the races using a table of the form:
Sample of F1 2011 Results 2011, © 2011 Formula One World Championship Ltd
Here’s what it looks like when we view it in a treemap visualisation:
The size of the boxes is proportional to the (summed) values within the hierarchical categories. In the above case, the large blocks are the total points awarded to each driver across teams and races. (The team field might be useful if a driver were to change team during the season.)
I’m not certain, but I think the Many Eyes treemap algorithm populates the map using a sorted list of summed numerical values taken through the hierarchical path from left to right, top to bottom. Which means top left is the category with the largest summed points. If this is the case, in the above example we can directly see that Webber is in fourth place overall in the championship. We can also look within each blocked area for more detail: for example, we can see Hamilton didn’t score as many points in Malaysia as he did in the other two races.
One of the nice features about the Many Eyes treemap is that it allows you to reorder the levels of the hierarchy that is being displayed. So for example, with a simple reordering of the labels we can get a view over the team championship too:
What might be interesting would be to feed Protovis or the JIT with data dynamically form a Google Spreadsheet, for example, so that a single page could be used to display the treemap with the data being maintained in a spreadsheet.
Hmm, I wonder – does Google spreadsheets have a treemap gadget? Ooh – it does: treemap-gviz. It looks as if a bit of wrangling may be required around the data, but if the display works out then just popping the points data into a Google spreadsheet and creating the gadget should give an embeddable treemap display with no code required:-) (It will probably be necessary to format the data hierarchy by hand, though, requiring differently layed out data tables to act as source for individual and team based reports.)
So – how long before we see some “live” treemap displays for F1 results on the F1 blogs then? Or championship tables from other sports? Or is the treemap too confusing as a display for the uninitiated? (I personally don’t think so.. but then, I love macroscopic views over datasets:-)
PS see also More Olympics Medal Table Visualisations which includes a demonstration of a treemap visualisation over Olympic medal standings.
In a couple of posts last year (Hackable SPARQL Queries: Parameter Spotting Tutorial and First Dabblings With Pipelinked Linked Data) I started to explore how it might be possible to use Yahoo Pipes as an environment for sharing – and chaining together (albeit inefficiently) – queries to the data.gov.uk open transport data datastore.
Those posts concentrated on querying the datastore in order to find the location of traffic monitoring points according to various search criteria. In this post, I’ll show you one way of visualising traffic count data from a traffic count point using Many Eyes Wikified.
The first thing we need to do is come up with a query that will pull traffic monitoring data back from the transport datastore. My first point of call for finding a query to get me started is usually to search over the data.gov.uk Google group archive in my mailbox. As ever, @jenit had posted a ‘getting started’ solution:-)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX traffic: <http://transport.data.gov.uk/0/ontology/traffic#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?direction (SUM(?value) AS ?total)
traffic:count ?count .
?count a traffic:Count ;
traffic:category <http://transport.data.gov.uk/0/category/bus> ;
traffic:direction ?direction ;
traffic:hour ?hour ;
rdf:value ?value .
GROUP BY ?direction
I tweaked this a little (using guesswork and pattern matching, rather than understanding, Chinese Room style;-) to come up with a tweaked query that appears to pull out traffic count data for different categories of vehicle on a particular day from a particular monitoring point:
SELECT ?vehicle ?direction ?hour (SUM(?value) AS ?total)
traffic:count ?count .
?count a traffic:Count ;
traffic:category ?vehicle ;
traffic:direction ?direction ;
traffic:hour ?hour ;
rdf:value ?value .
REGEX(str(?hour), "^2008-05-02T") )
GROUP BY ?vehicle ?direction ?hour
I then turned this into a Yahoo Pipes query block, using the date and traffic monitoring point as input parameters:
Here’s how to do that:
Having got the query into the pipes ecosystem, I knew I should be able to get the data out of the pipe as JSON data, or as a CSV feed, which could then be wired into other web pages or web applications. However, to get the CSV output working, it seemed like I needed to force some column headings by defining attributes within each feed item:
To tidy up the output a little bit more, we can sort it according to time and day and traffic count by vehicle type:
It’s then easy enough to grab the CSV output of the pipe (grab the RSS or JSON URI and just change the &_render=rss or &_render=json part of the URI to &_render=csv) and wire it into somewhere else – such as into Many Eyes WIkified:
Doing a couple of quick views over the data in Many Eyes wikified, it seemed as if there was some duplication of counting, in that the numbner of motor vehicles appeared to be the sum of a number of more specific vehicle types:
Looking at the car, van, bus, HGV and motorbike figures we get:
So I made the judgement call to remove the possibly duplicate motor-vehicle data from the data feed and reimported the data into Many Eyes WIkified (by adding some nonsense characters (e.g. &bleurghh to the feed URI so that Many Eyes thought it was a new feed.)
It was then easy enough to create some interactive visualisations around the traffic data point. SO for example, here we have a bubble chart:
Do you spot anything about traffic flow going North versus South at 8am compared to 5pm?
Lets explore that in a little more detail with a matrix chart:
This shows us the relative counts for different vehicle types, again by time of day. Notice the different distribution of van traffic compared to car traffic over the course of the day.
A treemap gives a slightly different take on the same data – again, we can see how there is a difference between North and South flowing volumes at different times of day within each category:
One thing that jumps out at me from the treemap is how symmetrical everything seems to be at noon?!
All the above visualisations are interactive, so click through on any of the images to get to the interactibve version (Java required).
As to how to find traffic monitoring point IDs – try this.
PS a disadvantage of the above recipe is that to generate a visualisation for a different traffic point, I’d need to use the desired parameters when grabbing the CSV feed from the pipe, and then create new Many Eyes Wikified data pages and visualisation pages. However, using nothing more than a couple of web tools, I have managed to prototype a working mockup of a visualisation dashboard for traffic count data that could be given to a developer as a reference specification for a “proper” application. And in the meantime, it’s still useful as a recipe… maybe?
PPS While I was playing with this over the weekend, it struck me that if school geography projects ever do traffic monitoring surveys, it’s now possible for them to get hold of “real” data. If there are any school geography teachers out there who’d like to bounce around ways of making this data useful in a school context, please get in touch via a comment below :-)
One of the, err, side projects I’ve been looking at with a couple of people from the OBU has been bouncing around a few ideas about how we might “wrap” coverage of Formula One races with some open educational resources.
So with the first race of the new season over, I thought I’d have a quick play with some of the results data…
First off, where to get the results info? An API source doesn’t seem to be available anywhere that I’ve found as a free service, but the FIA media centre do publish a lot of the data (albeit in a PDF format): F1 Media Centre – Melbourne Grand Prix, 2009.
To get the data into an appropriate form required a little bit of processing (for example, recasting the race lap chart to provide the ranking per lap ordered by driver) but as ever, most of the charts fell out easily enough (although a couple more issues were raised – like being able to specify the minimum y-axis range value on a bar chart, for example).
Anyway, you can find the charts linked to from here: Australia Lap Times visualisation.
In the meantime, here are some examples (click through to reach the interactive original).
First up, a scatter plot to compare lap times for each driver across the race:
Secondly, a line chart to compare time series lap times across different drivers:
This bar chart views lets you compare the lap times for each driver over a subset of laps:
A “traditional” drivers standings chart for each lap:
Finally, this bar chart can be run as an animation (sort of) to show the rank of each driver for each lap during the race:
There are a few more data sets (e.g. pitting behaviour) that I haven’t had a look at yet, but if and when I do, I will link to them from the Australia Lap Times visualisation page on Many Eyes Wikified.
PS If you’re really into thinking about the data, maybe you’d like to help me think around how to improve the “Pit stop strategist” spreadsheet I started messing around with too?! ;-)
PPS It’s now time for the 2010 season, and this year, there’s some Mclaren car telemetry data to play with. For example, here’s a video preview of my interactive Mclaren data explorer.
I really should make this the last post on the Guardian data store for a while, because I’ve got a stack of other things I really ought to be doing instead, but it struck me that the following demo might open up some peoples eyes as to what’s possible when you have several data sets that play nicely…
It shows you how to take data from two different spreadsheets and link it together to create a third data set that contains elements from the two original ones. (If you just want to cut to the end, here’s a (visualised) reason why it might not be such a happy idea to go to Southampton Solent if you want to study Architecture or Planning: Student Happiness on Planning and Architecture courses. Now ask yourself the question: how would I (err, that is, you) have produced that sort of chart?)
This whole idea is in and around the area of Linked Data, but doing it the hard way…
If you don’t know what Linked Data is, watch this:
So let’s have a look at the Guardian University Tables/ Satisfaction Data that has been uploaded to Google Spreadsheets:
You’ll notice there are lots of sheets in there, covering the different subject areas – like “Architecture”, for instance:
Importantly, the format of the names of the institutions is consistent across spreadsheets.
So I was wondering – if I was a student wanting to study either planning or architecture, how could i find out which institutions had a good satisfaction rating, or low student to staff ratio, across both those subjects? (Maybe I’m not sure which subject I want to do, so if I choose the wrong one and try to switch course, I know I’m not going to switch into a duff department.)
That is, I’d like to be able to see a single table containing the data from both the overall results table as well as the Architecture and Planning tables.
Now if the data was in a database, and if I spoke SQL, this would be quite easy to do (hint condition: look up sql JOIN). But I only have my browser to hand, so what to do…?
Dabble DB provides one answer… How? Here’s how…
Start off by creating a new application:
I’m going to seed it with a table containing the names of HEIs as listed in column B of the overall data table by importing just that column, as CSV data, from the Google spreadsheet:
Pull the data in:
So now we have the data:
Okay – let’s import a couple more tables, say the data for Planning and Architecture areas.
First, Planning – here’s the CSV:
Click on “More…” and you’ll be offered the chance to Add a New Category.
Take that opportunity:-)
You hopefully get the idea…
Exactly as before…
Now do the same for the Architecture data:
So now i have three tables – known as categories – in Dabble DB.
Let’s link them… that is, let’s make the data from one category available to another.
Firstly, I’ll link the Architecture data to the table that just lists the HEIs – click on the Name of Institution column to pop-up a menu and select Configure:
We’re going to Link the column to Entries in another one:
in particular, we’re going to tell Dabble DB that the Names of Institutions in the Architecture table are the same things as the institions in the HEI category/table:
If you look at the HEIs category, you’ll see the Architecture column has been brought in:
We can now do the same for Planning (remember, pop up the Name of Institution menu and Configure it to Link Entries).
The next step is to pull in some data from the two linked categories. How about we compare the Teaching Satisfaction scores for these two subjects?
Click on the column header for one of the linked categories – say the planning one, select Add Derived Field and then the field you want to pull in:
The data gets pulled in…
(Oops – this is all a bit sparse; maybe I should have used a different filed, such as Average Entry Tariff? Never mind, let’s push on…)
Add the corresponding derived field for the Architecture courses:
If you click on the “unsaved View” label, you can save this data table:
To tidy up the table, let’s hide the duplicated Name columns and resave:
To give something like this:
A nice feature of Dabble DB is that is makes it easy to export data from any given view:
So if we grab the CSV URI:
We can take it to, I dunno, Many Eyes Wikified?
Create a placeholder for a visulisation (the data page is ousefulTestboard/StudentHappinessPlanningArchitecture):
Or just type the text yourself:
Click through to create the viz:
We’ll have the scatter plot:
The empty cells in the data columns may cause Many Eyes Wikified to think the data is Text – it’s not, it’s Number:
Now customise the view… I could just have every spot the same, but Architecture is my first preference, so let’s just highlight the places where students are happiest doing that subject (click through to play with the visualisation):
UCL seems to do best:
So here’s a recap – we’ve essentially JOINed data from two separate spreadsheets from Google Spreadsheets to create a new data table in Dabble DB, then visualised it in Many Eyes.
Can you see now why privacy hacks don’t like the idea of linked data in government, or across companies?
So here’s you’re weekend homework – create a data set that identifies entry requirements across the various engineering subject areas, and try to find a way of visualising it ;-)
Last week, I posted a quick demo of how to visualise data stored in a Google spreadsheet in Many Eyes Wikified (HEFCE Grant Funding, in Pictures).
The data I used was the latest batch of HEFCE teaching funding data, but Joss soon tweeted to say he’d got Research funding data up on Google spreadsheets, and could I do something with that? You can see the results here: Visualising UK HEI Research Funding data on many Eyes Wikified (Joss has also had a go: RAE: UK research funding results visualised).
Anyway, today the Guardian announced a new content API (more on that later – authorised developer keys are still like gold dust), as well as the Guardian data store (strapline: “facts you can use”) and the associated Data Store Blog.
Interestingly, the data is being stored on Google docs, in part because Google spreadsheets offer an API and a wide variety of export formats.
As regular OUseful.info readers will know, one of the export formats from Google spreadsheets is CSV – Comma Separated Variable data – which just so happens to be liked by services such as Dabble DB and Many Eyes. I’ll try to come up with a demo of how to mash-up several different data sets in Dabble DB over the next few days, but as I’ve a spare half-hour now, I thought I’d post a qiuck demo of how to visualise some of the Guardian data store spreadsheet data in Many Eyes Wikified.
So to start, let’s look at the the RAE2008 results data – University research department rankings (you can find the actual data here: http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ.
If you speak URL, you’ll know that you can get the CSV version of the data by adding &output=csv to the URL, like this: http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ&output=csv
Inspection of the CSV output suggests there’s some crap at the top we don’t want – i.e. not actual column headings – as well as the the end of the file:
(Note this “crap” is actually important metadata – it describes the data and its provenance – but it’s not the actual data we want to visualise).
Grabbing the actualt data, without the metadata, can be achieve by grabbing a particular range of cells using the &range= URL argument. Inspection of the table suggests that meaningful data can be found in the columnar range of A to H; guesswork and a bit of binary search identifies the actual range of cell data as A2:H2365 – so we can export JUST the data, as CSV, using the URL http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ&output=csv&range=A2:H2365.
If you create a new page on Many Eyes Wikified, this data can be imported into a wiki page there as follows:
We can now use this data page as the basis of a set of Many Eyes visualisations. Noting that the “relative URL address” of the data page is ousefulTestboard/GuardianUKRAERankings2008 (the full URL of the wikified data page is http://manyeyes.alphaworks.ibm.com/wikified/ousefulTestboard/GuardianUKRAERankings2008), create a new page and put a visualisation placeholder or two in it:
Saving that page – and clicking through on the visualisation placeholder links – means you can now create your visualisation (Many Eyes seems to try to guess what visualisation you want if you use an appropriate visulisation name?):
Select the settings you want for you visualisation, and hit save:
A visualisation page will be created automatically, and a smaller, embedded version of the visualisation will appear in the wiki page:
If you visit the visualisation page – for example this Treemap visualisation, you should find it is fully interactive – which means you can explore the data for yourself, as I’ll show in a later post…
A couple of weeks ago, I posted a workaround for Creating Your Own Results Charts for Surveys Created with Google Forms. With the release of Many Eyes Wikified, it’s now possible to power Many Eyes visualisations from online data (e.g. as described in Many Eyes Wiki Dashboard – Online Visualisation Tools That Feed From Online Data Sources).
So I was wondering – would it be possible to just pull data from the results spreadsheet for a Google form, and visualise it directly in Many Eyes without having to do any results processing on the spreadsheet side?
Firs step – find a form. I created a test one some time ago doodling ideas for a mobile survey form which contains some data, so that’s a start: Demo Mobile User Form.
Second step – get the results file as CSV: Mobile survey results:
Hmm – Many Eyes Wikified doesn’t see the columns…???
It’s is ok with a different subsets though… e.g. this one:
(Note that I can’t seem to specify “to end of column” in the Google spreadsheet CSV export? e.g. setting the range to A1:J doesn’t work:-( So i need to define an arbitrary final row…)
Trying out the visualisations on this data, I can sort of get the text cloud visualisation to work:
Unfortunately, in many of the chart types, there doesn’t seem to be the ability to plot a count of particular results(?).
For example, in the bubble chart, I can’t seem to plot bubble size as a count of the results in each results category? (Would I expect to be able to do that…? Hmmm… I think so…?!) Instead, I can only plot size according to data values in one of the numerical columns?
In many cases, in order to plot sensible visualisations that process and display the form results data, I need to be able to count the occurrence of different results classes within a results column. A count option is available in the Matrix chart, but not in many of the other visualisation types?
There’s also the issue that many of the results contain multiple items; so for a example, in answer to the question “What do you use your mobile phone for?” we might get the answers Voice calls, Text Messaging/SMS, Web search, Maps/directions, Camera (stills) (selected from a drop down list on the original form).
What would be really nice would be the ability to specify a delimiter/separator to split out the different results in a particular column, then let Many Eyes enumerate the different possible answer choices in that column, and count on each one. So for example, I’d like to select a bubble chart based on the column “What do you use your mobile phone for?” and have Many Eyes identify the different segments, (Voice Calls, Web Search etc), count the occurrence of each of those and plot each segment as a bubble, with size proportional to the counted occurrence of the segment in the results.
In the meantime, I suppose it’s always possible to process the results in the spreadsheet as demonstrated in Creating Your Own Results Charts for Surveys Created with Google Forms and then just export the CSV of the particular question results tables to Many Eyes Wikified? Or alternatively, design questions that work nicely when the raw results are passed to Many Eyes Wikified?
Aren’t blog comments wonderful things? Today, I learned from a comment by Nicola on Visualising Financial Data In a Google Spreadsheet Motion Chart that Many Eyes can now be used to visualise live data via Many Eyes Wikified.
Wikified has apparently been in beta for a month or two (somehow I missed it…) but it was launched as a public service earlier this week: Many Eyes Wikified now open to the public:
Many Eyes Wikified is a “remix” of Many Eyes, using a wiki markup syntax to enable you to easily edit datasets and lay out visualizations side-by-side.
It also functions just like a normal wiki: you can collaboratively edit pages, add explanations or documentation to your visuals, see a page’s edit history, and revert changes.
Unlike a normal wiki, you can embed content from your blog or other data source within Wikified and visualize it. You can also embed the content you make in Wikified elsewhere, just like you can in Many Eyes.
I have to admit to hitting a few, err, issues with Many Eyes wikified whilst playing with it on an old Mac, but the promise is just, like, awesome, dude…
So what’s in store?
First up, you can add data to a page by simply copying and pasting a CSV table into it. So far, so Many Eyes – except that the page where you paste your content is actually a wiki page – so you can have all sorts of explanatory text in the page as well.
What’s really useful, though – and something I’ve been wanting for some time – is the ability to pull live data into the wiki page from another online source.
So far I’ve only tried pulling in CSV data from a Google spreadsheet, but as that seems to work okay, I assume pulling in CSV data from a Yahoo! pipe, or DabbleDB database should work too.
(I’m not sure if Many Eyes Wikified will pull in other data types too, such as TSV? Please add a comment to this post if you find out…)
Once you have a data page defined, you can call on that data from a visualisation within another page. This is where I hit a wobbly… I could create a page, and get a stub for the visualisation okay:
And I got the link that let me fire up the visualisation editor:
And I even got the viz editor:
You’ll notice that the data table has been pulled in, with the ability to set the data type for each column, and a toolbar is provided that lets you select the desired Many Eyes style visualisation type – wtih no typing and no programming required…:-)
(However, when I tried to change the visualisation type on my 10.4 OS/X Mac, I just got thrown back to the wikified home page…:-(
Anyway – the promise is there, and from examples like Nicola’s dashboard, it seems as if other people have been coping fine with the visualisation editor…
…which brings me neatly to the idea of the Wikified dashboards…
Many Eyes Wikified allows you to define “dashboards”, which are essentially URI path namespaces within which you can collect a series of separate pages. I’m not sure if you can assert ownership or edit privileges over dashboards, though? At the moment, it looks as if all pages are editable by anyone, in true public/open wiki style…
So to sum up, Many Eyes visualization tools are now available as endpoints for wholly online data mashups. May the fun begin…