Twitter Powered Subtitles for Conference Audio/Videos on Youtube

Last week, I wrote a post on hHow people Tweeted Through Carter on “Delivering Digital Britain” at NESTA, where I created a slideshow of tweets posted in response to the NESTA Delivering Britain event/video stream and used them to illustrate the audio recording of Lord Carter’s presentation.

Chatting to @liamgh last week, I mentioned how i was stumped for an easy way to do this. He suggested creating a subtitles feed, and then uploading it to Youtube, along with the audio recording (doh!).

So – here’s a proof of concept result of doing just that:

(Actually, I’m not sure that subtitles/closed captions work in the embedded movies, so you may have to click through: proof of concept use of Tweets as subtitles: Carter at ‘Delivering Digital Britain’, part 1. NB it is possible force embedded videos to show captions, as
Learn More: Showing captions by default
describes. Simply add &cc_load_policy=1 to the embed code; so e.g. in WordPress, you would use something like [youtube= ].

And here’s how I did it…

The first thing to do was check out how Youtube handled subtitles: Getting Started: Adding / Editing captions.

The trick is to upload a textfile with lines that look something like this:
Text shown at 3 min 14.159 sec for an undefined length of time.

Text shown at 2 min 20.25 sec, until 2 min 23.8 sec

Secondly – getting the list of tweets hashtagged with #carter over the period Lord Carter was speaking (i.e. the period covered by the video). For the original proof of concept, I used the tweets from the spreadsheet of scraped tweets that @benosteen grabbed for me, though it later occurred to me I could get the tweets direct from a Twitter search feed (as I’ll show in a minute).

Scraped tweet timestamp

The question now was how to get the timecode required for the subtitles file from the timestamp associated with each tweet. Note here that the timecode is the elapsed time from the start of the video. The solution I came up with was to convert the timestamps to universal time (i.e. seconds since midnight on January 1st 1970) and then find the universal time equivalent of the first tweet subtitle; subtracting this time from the universal time of all the other tweets would give the number of seconds elapsed from the first tweet, which I could convert to the timecode format.

As to how I got the universal time values – I used a Yahoo pipe (Twitter CSV to timeingsd):

At this point, it’s probably worth pointing out that I didn’t actually need to call on @benosteen’s tweetscraper – I could just use the Twitter search API (i.e. the Twitter advanced search feed output) to grab the tweets. How so? Like this:

Looking at the results of this query, we see the timing is a little off – we actually need results from 8.30am, the actual time of the event:

Which is where this comes into play – searching for “older” results:

If you click on “Older” you’ll notice a new argument is introduced into the search results page URL – &page=:

…which means that by selecting appropriate values for rpp= and page= we can tunnel in on the results covering from a particular time by looking at “older” results pages, and grabbing the URL for the page of results covering the time period we want:

Then we can grab the feed:

which gives us something like this:

which we hack a little to get to the right period:

(For more info on the Twitter search API, see the Twitter Search API Documentation wiki.)

NB while I’m at it, note that there’s a corollary hack here that might
come in useful somewhere, or somewhen, else – getting a Twitter search feed into a Google spreadsheet (so we can, for example,process it as a CSV file published from the spreadsheet):


That is:

Okay – back to the main thread – and a tweak to the pipe to let us ingest the feed, rather than the spreadsheet CSV:

Just by the by, we can add a search front end to the pipe if we want:

and construct the Twitter search API URI accordingly:

(The date formatter converts the search date to the format required by the Twitter search API; it was constructed according to PHP: strftime principles.)

Ok – so let’s recap where we’re at – we’ve now got a pipe that will give us universal timecoded tweets (that’s not so far for such a long post to here, is it?!) If we take the JSON feed from the pipe into an HMTL page, we can write a little handler that will produce the subtitle file from it:

Here’s the code to grab the pipe’s JSON output into an HTML file:

var pipeUrl="";

function ousefulLoadPipe(url){

var d=document;
var s=d.createElement('script');

var pipeJSON=url+"&_render=json&_callback=parseJSON";

Here’s the JSON handler:

function parseJSON(json_data){
var caption; var timestamp=0; var mintime=json_data.value.items[0]['datebuilder'].utime;
for (var i=0; itimestamp) mintime=timestamp;

for (var j=json_data.value.items.length-1; j>=0; j--) {
if (j>0) timeEnd=(1*json_data.value.items[j-1]['datebuilder'].utime)-3; else timeEnd=10+1*json_data.value.items[j]['datebuilder'].utime;
if (timeEnd<timestamp) timeEnd=timestamp+2;
var subtitle=timecode+","+timeEnd+"

Here’s the timecode formatter:

//String formatter from:
String.prototype.pad = function(l, s){
return (l -= this.length) > 0 ? (s = new Array(Math.ceil(l / s.length) + 1).join(s)).substr(0, s.length) + this + s.substr(0, l - s.length) : this; };

function getTimeCode(seconds){
var timecode="00:"+(Math.floor(seconds / 60)).toFixed().pad(2, "0") + ":" + (seconds % 60).toFixed().pad(2, "0")+".0";
return timecode;

(I generated the timecodes in part using a string formatter from

Copy and paste the output into a text file and save it with the .sub suffix, to give a file which can then be uploaded to Youtube.

So that’s the subtitle file – how about getting the audio into Youtube? I’d already grabbed an audio recording of Carter’s presentation using Audacity (wiring the “headphones out” to the “microphone in” on my laptop and playing the recording from the NESTA site), so I just clipped the first 10 minutes (I think Youtube limits videos to 10 mins?) and saved the file as a wav file, then imported it into iMovie (thinking I might want to add some images, e.g. from photos of the event on flickr). This crib – iMovie Settings for Upload to YouTube – gave me the settings I needed to export the audio/video from my old copy of iMovie to a file format I could upload to Youtube (I think more recent versions of iMovie support a “Share to Youtube” option?).

I then uploaded this file, along with the subtitles file:

So there we have it: Twitter subtitle/annotations (pulled from a Twitter search feed) to the first part of Lord Carter’s presentation at Delivering Digital Britain…

PS Also on the Twitter front, O’Reilly have started watching Twitter for links to interesting stories, or into particular debates: Twitscan: The Debate over “Open Core”.

Chatting to @cheslincoln the other night, we got into a discussion about whether or not Twitter could be used to support a meaningful discussion or conversation, given the immediacy/short lived nature of tweets and the limited character count. I argued that by linking out to posts to support claims in tweets, “hyper-discussions” were possible. By mining “attention trends” (a term I got from misreading a tweet of Paul Walk’s that scaffold a conversation, it’s possible to create a summary post of a conversation, or argument, like the O’Reilly one?

See also this post from Paul Walk: Anything you quote from Twitter is always out of context.

Twitter Powered Youtube Subtitles, Reprise: Anytime Commenting

One of the things that attracts me to serialised feeds (as well as confusing the hell out of me) is the possibility of letting people subscribe to, and add, comments in “relative time”…

… that is, as well as viewing the content via a serialised feed, the comments feed should also be serialised (with timestamps for each comment calculated relative to the time at which the person commenting started receiving the serialised feed).

Applying this to the idea of tweeted Youtube movie subtitles (Twitter Powered Subtitles for Conference Audio/Videos on Youtube) in which every tweet made during a presentation at or around that presentation becomes a subtitle on a recording of that presentation, it strikes me that a similar model is possible.

That is, different individuals could watch a Youtube video at different times, tweeting along as they do so, and then these tweets could be aggregated according to relative timestamps to provide a single, combined set of subtitles.

So how might this work in practice? Here’s a thought experiment run through…

Firstly, it’d probably be convenient to set up a twitter account to send the tweets to (say @example, for example).

Create a tag for the video – this could be something like #yt:tBmFzF8szpo for the video at

(Alan Levine reminded me about flickr machine tags earlier today, which are maybe also worth considering in this respect, e.g. as a source of inspiration for a tagging convention?)

Grab a ctrl-C copy of the phrase @example #yt:tBmFzF8szpo for quick pasting into a new tweet, and then start watching the video, tweeting along as you do so…

To generate your subtitle feed, you can then do a search based on Tweets from your username (which would be @psychemedia in my case) to e.g. @example, with hashtag #yt:tBmFzF8szpo, and maybe also using a date range.

(You could augment the Yahoo pipe I used in the twitter subtitle generator proof of concept to remove the hashtag when generating the feed used for subtitling?)

The actual subtitle file generator could then pull in several different subtitle feeds from separate people, relativise their timestamps relative to the time of the first tweet (which could maybe use a keyword, too – such as “START”: @example START #yt:tBmFzF8szpo;-) and then produce an aggregated subtitle feed.

As more people watched the video (maybe including the subtitles to date), their feeds could be added to the aggregating subtitle file generator, and the subtitle file updated/refreshed.

Individuals could even rewatch the video and create new feeds for themselves to join in the emerging conversation…

(Okay, so it’s maybe slower than just reading through the comments, having to replay the video in real time to read the tweets, but this is a sort of thought experiment, right, albeit one that can be implemented quite easily…;-)

PS In one of the comments to Show and Translate YouTube Captions Matt Cutts gave an example of a URL that “will search for the word “china” in videos with closed captions” [ ] (although I’m not sure how well it works?).

So I’m thinking – if live tweets from an event can be associated with a video of an event (maybe because the video is posted with a link to a (now out of date!) upcoming record for that event in order to anchor it in time) then being able to search the tweets as captions/subtitles provides a crib for deeplink searching into the video? (But then, I guess the Goog is looking at audio indexing anyway?)

PPS I just came across another tool for adding subtitles to Youtube videos, as well as videos from other online video sites –

It’s worth looking at, maybe?

PPPS see also Omnisio, a recent Google acquisition that offers “select clips from videos you find on YouTube and other video sites, and easily post them on your profile page or blog. Even better, you and your friends can add comments directly in the video!”.

And there’s more: “With Omnisio you make and share your own shows by assembling clips from different videos.” Roll on the remixes :-)

PPPPS Martin implemented anytime commenting

Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs

Last week, I posted a quick demo of how to visualise data stored in a Google spreadsheet in Many Eyes Wikified (HEFCE Grant Funding, in Pictures).

The data I used was the latest batch of HEFCE teaching funding data, but Joss soon tweeted to say he’d got Research funding data up on Google spreadsheets, and could I do something with that? You can see the results here: Visualising UK HEI Research Funding data on many Eyes Wikified (Joss has also had a go: RAE: UK research funding results visualised).

Anyway, today the Guardian announced a new content API (more on that later – authorised developer keys are still like gold dust), as well as the Guardian data store (strapline: “facts you can use”) and the associated Data Store Blog.

Interestingly, the data is being stored on Google docs, in part because Google spreadsheets offer an API and a wide variety of export formats.

As regular readers will know, one of the export formats from Google spreadsheets is CSV – Comma Separated Variable data – which just so happens to be liked by services such as Dabble DB and Many Eyes. I’ll try to come up with a demo of how to mash-up several different data sets in Dabble DB over the next few days, but as I’ve a spare half-hour now, I thought I’d post a qiuck demo of how to visualise some of the Guardian data store spreadsheet data in Many Eyes Wikified.

So to start, let’s look at the the RAE2008 results data – University research department rankings (you can find the actual data here:

If you speak URL, you’ll know that you can get the CSV version of the data by adding &output=csv to the URL, like this:

Inspection of the CSV output suggests there’s some crap at the top we don’t want – i.e. not actual column headings – as well as the the end of the file:

(Note this “crap” is actually important metadata – it describes the data and its provenance – but it’s not the actual data we want to visualise).

Grabbing the actualt data, without the metadata, can be achieve by grabbing a particular range of cells using the &range= URL argument. Inspection of the table suggests that meaningful data can be found in the columnar range of A to H; guesswork and a bit of binary search identifies the actual range of cell data as A2:H2365 – so we can export JUST the data, as CSV, using the URL

If you create a new page on Many Eyes Wikified, this data can be imported into a wiki page there as follows:

We can now use this data page as the basis of a set of Many Eyes visualisations. Noting that the “relative URL address” of the data page is ousefulTestboard/GuardianUKRAERankings2008 (the full URL of the wikified data page is, create a new page and put a visualisation placeholder or two in it:

Saving that page – and clicking through on the visualisation placeholder links – means you can now create your visualisation (Many Eyes seems to try to guess what visualisation you want if you use an appropriate visulisation name?):

Select the settings you want for you visualisation, and hit save:

A visualisation page will be created automatically, and a smaller, embedded version of the visualisation will appear in the wiki page:

If you visit the visualisation page – for example this Treemap visualisation, you should find it is fully interactive – which means you can explore the data for yourself, as I’ll show in a later post…

See more examples here: RAE 2008 Treemap; RAE 2008 Bubble Diagram; RAE 2008 bar chart.

Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes

In Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs I showed how to pull data from Google spreadsheets (uploaded there by the Guardian as part of their Data Store initiative and visualise it using Many Eyes Wikified.

Unfortunately, one of the downsides of using Many Eyes Wikified is that you can’t filter the imported data or select subsets of rows from it (nor can you choose to just ignore particular columns in the visualisation editor – which would be a really handy thing to be able to do, and probably quite easy to implement?)

So for example, when looking at the RAE 2008 Data, it might be quite useful to be able to just visualise the data for a single institution, such as the Open University. But how can we do this?

One way would be to find a way of searching the data in the spreadsheet, and then only extracting the rows that contained the search term – such as “Open University” – in the institution name column. But I don’t know if that’s possible in Google Spreadsheets, (though it is possible in Zoho DB, which supports all manner of queries in SQL like dialects; but that’s for another day…;-).

An alternative way is to grab the whole of the spreadsheet and then just filter out all the rows that don’t contain the filter term in the desired column… which is an idea that came into my mind last night on my way home, and which appears to work quite well, as this Filtering UK RAE 2008 CSV Data from Google Docs pipe shows:

If your pipe imported a CSV file, as this one does (in fact, if the items being pushed out of the pipe have a y:row attribute set giving a row/count number for each item) then you can get a CSV file out of the pipe too:

Which in this case means we can filter through just the RAE 2008 data for a particular institution, grab the CSV URL for that data, and pull it into Many Eyes wikified in the same way as described before:

Here it is:

And once we have a wiki data page for it, we can visualise it – I’ve popped some examples up here: OU RAE 2008 performance.

For example, how about a bubble diagram view showing our 4* performance across the various units of assessment:

(The number is the percentage of submissions in that unit of assessment achieving the specified grade.)

Or how about this one – an interactive bar chart showing the percentages of 3* and 4* submissions in each unit of assessment:

If you look back at the pipe interface, you’ll see you can use the same pipe to pull out the data for any of the HEIs:

The pipe output CSV URI also makes this clear:

In the next post on this topic, I’ll show you how to create a rather more interesting RSS feed from this sort of pipe, including some Google chart URIs created dynamically within the pipe…

But for now, it’s time to go foraging for dinner…

Creating Google Charts From CSV Data Inside a Yahoo Pipe

One of the problems with processing CSV data using a Yahoo pipe – as described in Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes – is that the RSS output of the pipe is not very well formed. It’s fine for outputting as CSV, or even JSON, but it’s not so hot as RSS.

So in this post, I’ll show you how to get a proper RSS feed out, with a meaningful title and sensible description for each item, and as an added bonus, a chart image generated from the CSV data using the Google Chart API.

So to start, let’s grab some CSV (I’m using the Guardian Data Store RAE 2008 data on Google Spreadhseets again), name the columns ourselves, and generate a sensible title for each feed item:

For the item description data, I’m going to create a simple listing of the data values:

(Don’t worry about where the chartURI came from just for the moment…)

What this loop element does is create a string for each item in the feed, and then output is as the description element for the corresponding item. The data elements within each item are displayed in a faux table within the description. (Note that I could have constructed a proper HTML data table had I wanted to…)

So here’s description (without the chartURI image element shown):

That is:

So that’s part 1 – constructing title and description elements for each feed item, where a feed item is derived from a CSV row.

Now let’s see how to create a chart image from the data using the Google Chart API.

The Google Charts API provides a RESTful API to a chart generating service – which is to say: you can give it a URI containing some data and specifying a chart type, and it will give you an image back in return, with your data graphed appropriately (see also: RESTful Image Generation – When Text Just Won’t Do).

So for example, here’s a pie chart:

And here’s its URI:×300&chd=t:10,10,45,35,0&chl=1*|2*|3*|4*|unclassifed&chtt=RAE2008%20(Open%20University%20)%20-%20Art%20and%20Design

So using our EYES, we can see different things in that URI:

  • the domain that serves the image:
  • the chart type: cht=p
  • the chart size/dimensions: &chs=450×300
  • the chart data: &chd=t:10,10,45,35,0
  • the chart labels: &chl=1*|2*|3*|4*|unclassifed
  • and the chart title:

And now we can construct URIs of that form within the pipe:

Now I do a couple o fbits of tidying up – the chartURI string just needs making into a nice URI – replace the spaces with their ASCII code representation, for example:

As we currently have no link element defined for the feed items, lets use the chart URI:


So there we have it – and RSS feed from the pipe containing the “tabulated” data from each CSV row, and a pie chart to display it:

From the pipe RSS feed URI:
you’ll notice the search term for the particular HEI you’re interested in, which means you should be easily able to hack it to give the appropriate feed for your own HEI:-)

“JOIN”ing Data from the Guardian Data Store Student Satisfaction Data

I really should make this the last post on the Guardian data store for a while, because I’ve got a stack of other things I really ought to be doing instead, but it struck me that the following demo might open up some peoples eyes as to what’s possible when you have several data sets that play nicely…

It shows you how to take data from two different spreadsheets and link it together to create a third data set that contains elements from the two original ones. (If you just want to cut to the end, here’s a (visualised) reason why it might not be such a happy idea to go to Southampton Solent if you want to study Architecture or Planning: Student Happiness on Planning and Architecture courses. Now ask yourself the question: how would I (err, that is, you) have produced that sort of chart?)

This whole idea is in and around the area of Linked Data, but doing it the hard way…

If you don’t know what Linked Data is, watch this:

So let’s have a look at the Guardian University Tables/ Satisfaction Data that has been uploaded to Google Spreadsheets:

You’ll notice there are lots of sheets in there, covering the different subject areas – like “Architecture”, for instance:

Importantly, the format of the names of the institutions is consistent across spreadsheets.

So I was wondering – if I was a student wanting to study either planning or architecture, how could i find out which institutions had a good satisfaction rating, or low student to staff ratio, across both those subjects? (Maybe I’m not sure which subject I want to do, so if I choose the wrong one and try to switch course, I know I’m not going to switch into a duff department.)

That is, I’d like to be able to see a single table containing the data from both the overall results table as well as the Architecture and Planning tables.

Now if the data was in a database, and if I spoke SQL, this would be quite easy to do (hint condition: look up sql JOIN). But I only have my browser to hand, so what to do…?

Dabble DB provides one answer… How? Here’s how…

Start off by creating a new application:

I’m going to seed it with a table containing the names of HEIs as listed in column B of the overall data table by importing just that column, as CSV data, from the Google spreadsheet:

Pull the data in:

So now we have the data:

Okay – let’s import a couple more tables, say the data for Planning and Architecture areas.

First, Planning – here’s the CSV:

Click on “More…” and you’ll be offered the chance to Add a New Category.

Take that opportunity:-)

You hopefully get the idea…

Exactly as before…

Now do the same for the Architecture data:

So now i have three tables – known as categories – in Dabble DB.

Let’s link them… that is, let’s make the data from one category available to another.

Firstly, I’ll link the Architecture data to the table that just lists the HEIs – click on the Name of Institution column to pop-up a menu and select Configure:

We’re going to Link the column to Entries in another one:

in particular, we’re going to tell Dabble DB that the Names of Institutions in the Architecture table are the same things as the institions in the HEI category/table:

If you look at the HEIs category, you’ll see the Architecture column has been brought in:

We can now do the same for Planning (remember, pop up the Name of Institution menu and Configure it to Link Entries).

The next step is to pull in some data from the two linked categories. How about we compare the Teaching Satisfaction scores for these two subjects?

Click on the column header for one of the linked categories – say the planning one, select Add Derived Field and then the field you want to pull in:

The data gets pulled in…

(Oops – this is all a bit sparse; maybe I should have used a different filed, such as Average Entry Tariff? Never mind, let’s push on…)

Add the corresponding derived field for the Architecture courses:

If you click on the “unsaved View” label, you can save this data table:

To tidy up the table, let’s hide the duplicated Name columns and resave:

To give something like this:

A nice feature of Dabble DB is that is makes it easy to export data from any given view:

So if we grab the CSV URI:

We can take it to, I dunno, Many Eyes Wikified?

Here it comes…:

Many eyes wikified data

Create a placeholder for a visulisation (the data page is ousefulTestboard/StudentHappinessPlanningArchitecture):

Or just type the text yourself:

Click through to create the viz:

We’ll have the scatter plot:

The empty cells in the data columns may cause Many Eyes Wikified to think the data is Text – it’s not, it’s Number:

Now customise the view… I could just have every spot the same, but Architecture is my first preference, so let’s just highlight the places where students are happiest doing that subject (click through to play with the visualisation):

UCL seems to do best:

So here’s a recap – we’ve essentially JOINed data from two separate spreadsheets from Google Spreadsheets to create a new data table in Dabble DB, then visualised it in Many Eyes.

Can you see now why privacy hacks don’t like the idea of linked data in government, or across companies?

So here’s you’re weekend homework – create a data set that identifies entry requirements across the various engineering subject areas, and try to find a way of visualising it ;-)

Visualising Lap Time Data – Australian Grand Prix, 2009

One of the, err, side projects I’ve been looking at with a couple of people from the OBU has been bouncing around a few ideas about how we might “wrap” coverage of Formula One races with some open educational resources.

So with the first race of the new season over, I thought I’d have a quick play with some of the results data…

First off, where to get the results info? An API source doesn’t seem to be available anywhere that I’ve found as a free service, but the FIA media centre do publish a lot of the data (albeit in a PDF format): F1 Media Centre – Melbourne Grand Prix, 2009.

For convenience as much as anything, I thought I’d use Many Eyes Wikified to produce a set of visualisations based on the lap time data and the race lap chart.

To get the data into an appropriate form required a little bit of processing (for example, recasting the race lap chart to provide the ranking per lap ordered by driver) but as ever, most of the charts fell out easily enough (although a couple more issues were raised – like being able to specify the minimum y-axis range value on a bar chart, for example).

Anyway, you can find the charts linked to from here: Australia Lap Times visualisation.

In the meantime, here are some examples (click through to reach the interactive original).

First up, a scatter plot to compare lap times for each driver across the race:

Secondly, a line chart to compare time series lap times across different drivers:

This bar chart views lets you compare the lap times for each driver over a subset of laps:

A “traditional” drivers standings chart for each lap:

Finally, this bar chart can be run as an animation (sort of) to show the rank of each driver for each lap during the race:

There are a few more data sets (e.g. pitting behaviour) that I haven’t had a look at yet, but if and when I do, I will link to them from the Australia Lap Times visualisation page on Many Eyes Wikified.

PS If you’re really into thinking about the data, maybe you’d like to help me think around how to improve the “Pit stop strategist” spreadsheet I started messing around with too?! ;-)

PPS It’s now time for the 2010 season, and this year, there’s some Mclaren car telemetry data to play with. For example, here’s a video preview of my interactive Mclaren data explorer.

Visualising MPs’ Expenses Using Scatter Plots, Charts and Maps

A couple of days ago, the Guardian’s @datastore announced that a spreadsheet of UK MPs’ (Members of Parliament) expenses had been posted to the Guardian OpenPlatform datastore on Google Spreadsheets.

Just because, I though it would be nice to visualise the spreadsheet using some Many Eyes Wikified charts, so I had a look at the data, and sighed a little: in many of the spreadsheet cells was a pound sign, and Many Eyes doesn’t like those – it just wants numbers… So I went in to Yahoo pipes to create a pipe to tidy up the CSV output of the spreadsheet so I could pipe it into Many Eyes Wikified… and drew a blank: I couldn’t get the pipe to work (no CSV – just HTML (it turns out I was using the wrong URL pattern from the spreadsheet – doh!)). So I exported the CSV, reg-exped it in a text editor, adn uploaded it to create a new spreadsheet. (Which reminds me: note to self – create a tidy-upper pipe fed from the datastore and refactor the wikified data page to feed from the pipe…)

[Many Eyes Wikified is no longer available as a service – to replicate the following visulisations, you need to upload the data to Many Eyes (the none wikified version…). I think this is the spreadsheet I was pulling in to the Wikified service…]

So anyway, here are some interactive ways of visualising MPs’ expenses data using Many Eyes wikified

Firstly, a bar char – select which expenses category you’d like to chart and then view the ranked distribution by sorting by values. If you mouse over any of the bars, you’ll see which MP made that claim:

Second up, a block histogram view. This chart is good for looking at the natural distribution of different claim categories. The search box makes it easy to search for your MP by name:

Again, mousing over any of the blocks identifies the name of the MP making that claim.

Thirdly, a scatter plot. This display lets you compare an MP’s claims across two categories, and potentially bring in a thrid category using the dot size:

As with the other visulisations, mouse over any point to see which MP it belongs to.

By the by, along the way I did a couple of other Yahoo pipes – one to extract expenses by MP name, (which simply pulls in CSV from the spreadsheet, then filters on an MP’s surname), the other MPs’ expenses by postcode. The latter pipe actually embeds the foemer, and works by looking up the name of the MP by postcode, using the TheyWorkForYou API; this name is then passed in to an embedded ‘expenses by name’ pipe.

Anyway, back to the viz biz: Charles Arthur generously picked up on my tweets announcing the visualisations with a blog post on the Guardian data blog (Visualising MP expenses) in a post that included the tease:

But what we need now is a dataset which shows constituency distances
from Westminster, so that we can compare that against travel. And perhaps someone else can work out the travelling MPs’ carbon footprints based on whether they went by air or rail or car

No fair… Okay – so where to get the location data for each MP. Well, the TheyWorkForYou API came to my rescue again. One call (to getConstituencies) pulled out details of each constituency, which included the lat/long geo-coordinated of the ‘centre’ of each constituency (along with the co-ordinates of the bounding box round each constituency… maybe I’ll use those another time ;-) A second call (to getMPs) pulled out all the MPs, and their constituencies. Loading both sets of data into different sheets on Dabble DB, meant I could then link them together by constituency name (for more on linking data in Dabble DB, see Mash/Combining Data from Three Separate Sources Using Dabble DB and Using Dabble DB in an Online Mashup Context).

Adding the MP data into Dabble DB after a further bit of cleaning – removing things like Mr, Miss, and Sir from the firstnames etc – and linking by MP name meant that I could now generate a single data view that exposed MPs by name, constituency, and expense claims, along with the geolocation of the midpoint of their constituency.

After grabbing the CSV feed out of this Dabble DB view into a pipe, and tidying up the data a little once again (eg removing commas in the formatted numbers), it was an easy matter to pull the JSON output from the pipe into a map, and plot different coloured markers depending what ‘band’ the MPs’ total expenses fell into. Here’s a snapshot of that first map:

All well and good – what’s nice about this view is that it’s quite easy to see which MPs appear to be claiming disproportionately more than other MPs with constituencies in a similar area. (There may be good reason for this, like, err… whatever. This tool is just a starting point for sensemaking round the data, right?!;-). If you click on one of the markers you can pop up a little info window, too (rather sparse in this first demo):

In that first map, the only expenses data I was exposing, and mapping, was the total travel expenses claimed. So over a coffee this afternoon, I created a richer view, and tweaked the map code to let me inspect a couple of other data sets. You can find the map here: MPs’ travel expenses map.

So for example, we can look at mileage claims:

Or the total expenses claimed for living away from the primary home:

One thing these quick to put together maps show is how powerful map based displays can be used to get a feel for local differences where there is a difference. (There may well be a good reason for this, of course; including errors in the data set being used…)

It’s also interesting to use the map based displays in conjunction with other chart based visualisations, such as the MPs’ expenses visualisations on Many Eyes Wikified, to explore the data in a far more natural way than trying to make sense of a spreadsheet containing the MPs’ expenses data.

Enjoy :-)

PS the code is all as is; if it’s broken and the visualisations are consequently wrong/misleading, then I apologise in advance… ;-)

PPS See also: My Guardian OpenPlatform API’n’Data Hacks’n’Mashups Roundup, which describes 6 different recipes for playing with Guardian openplatform resources. And if you’re into F1, see Visualising Lap Time Data – Australian Grand Prix, 2009 ! ;-)

PPPS see also MPs’ Expenses by Constituency, Sort Of…, where I plot a couple of really colourful proportional symbol maps based on total travel expenses…

How To Create Wordcloud from a Twitter Hashtag Search Feed in a Few Easy Steps

So I was struggling for a quick hit blog post to publish today (busy:-(, but then I got a tweet from @paulbradshaw asking “Any ideas how you could make mashup showing the frequency of certain words in hashtagged tweets – e.g. tagcloud.”

Hmm – like this maybe?

create word cloud from hashtag feed


[NOTE – you need to encode the hashtag as %23 in the feed URI.]

I call this technique a screencaptutorial… (err….?!)

[UPDATE: I don’t think this hack works any more, at least not directly (I don’t think a link to the RSS feed is provided any more from the search results page. You can however construct a URL that will search or the 100 most recent tweets containing your search term(s): The rpp argument specifies the number of results per page, and the result_type argument gets you the most recent, rather than “most popular” tweets. Note that the SEARCHTERM needs escaping if it’s a multi-word search phrase for example. In this case, space characters get encoded as %20, and punctuation may also need encoding. Use this encoder set to “encodeURI” to encode the URL for you… ]

The screen capture was made using Jing, and the white background comes from an empty text editor document exploded to fill the screen.

For more info on manipulating Twitter search feeds, see Twitter Powered Subtitles for Conference Audio/Videos on Youtube.

PS I’m not sure whether the wordle app generates a static word cloud from a feed, or a more dynamic one? (That is, does it just grab the feed contents at the time the word cloud is created and use those to generate a one-hit word cloud, or does it keep sampling the the feed? If you want a live word cloud, then a better way is to import the feed into a Google spreadsheet, publish the spreadsheet, take a CSV output from it and drop it into Many Eyes wikified. Or create a web page of your own and generate the word cloud from the feed (maybe pulling it into the page as JSON via Yahoo pipe, so you can get around having to use a proxy to pull the feed into the page) using a word cloud javascript library such as Dynacloud, Cloudinizr or Cloudy.

MPs Expenses by Constituency (Sort Of…)

A few weeks ago, I posted several maps visualising MPs’ expenses (Visualising MPs’ Expenses Using Scatter Plots, Charts and Maps). A couple of days later, I created another map that I didn’t post at the time, partly becuase it’s very approximate, but it does demonstrate something I haven’t logged on before – how to do overlays on Google maps…

So here’s a the link: MPs expenses block map.

The blocks are defined using the bounding box co-ordinates for each MP’s constituency as made available by TheyWorkForYou (specifically, using the getGeometry API call).

The data set for the map was constructed by adding bounding box data for each constituency to a Dabble DB table, and then joining it with expenses data from another table.

PS following this tweet from @ElrikMerlin “Oh, that IS cool. What happens if you colour boxes by party and simply have area proportional to amount?” I knocked up a quick proportional symbol map that shows the total travel expenses claimed by party, where the circle diameter is proportional to the total expenses and the colour denotes the party.

MPs total travel expenses by party

But’ that’s enough for now… this is supposed to be a holiday weekend, after all…!