Posts Tagged ‘ddj’
…got that?;-) Or in other words, this is a post looking at some visualisations of the #ddj hashtag community…
A couple of days ago, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre. I’ll try to post some notes on it, err, sometime; but for now, here’s a quick collection of some of the various things I’ve tinkered with around hashtag communities, using #ddj as an example, as a “note to self” that I really should pull these together somehow, or at least automate some of the bits of workflow; I also need to move away from Twitter’s Basic Auth (which gets switched off this week, I think?) to oAuth*…
*At least Twitter is offering a single access token which “is ideal for applications migrating to OAuth with single-user use cases”. Rather than having to request key and secret values in an oAuth handshake, you can just grab them from the Twitter Application Control Panel. Which means I should be able to just plug them into a handful of Tweepy commands:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)
So, first up is the hashtag community view showing how an individual is connected to the set of people using a particular hashtag (and which at the moment only works for as long as Twitter search turns up users around the hashtag…)
Having got a Twapperkeeper API key, I guess I really should patch this to allow a user to explore their hashtag community based on a Twapperkeeper archive (like the #ddj archive)…
One thing the hashtag community page doesn’t do is explore any of the structure within the network… For that, I currently have a two step process:
1) get a list of people using a hashtag (recipe: Who’s Tweeting our hashtag?) using the this Yahoo Pipe and output the data as CSV using a URL with the pattern (replace ddj with your desired hashtag):
2) take the column of twitter homepage URLs of people tweeting the hashtag and replace http://twitter.com/ with nothing to give the list of Twitter usernames using the hashtag.
3) run a script that finds the twitter numerical IDs of users given their Twitter usernames listed in a simple text file:
import tweepy auth = tweepy.BasicAuthHandler("????", "????") api = tweepy.API(auth) f =open('hashtaggers.txt') f2=open('hashtaggersIDs.txt','w') f3=open('hashtaggersIDnName.txt','w') for uid in f: user=api.get_user(uid) print uid f2.write(str(user.id)+'\n') f3.write(uid+','+str(user.id)+'\n') f.close() f2.close() f3.close()
Note: this a) is Python; b) uses tweepy; c) needs changing to use oAuth.
4) run another script (gist here – note this code is far from optimal or even well behaved; it needs refactoring, and also tweaking so it plays nice with the Twitter API) that gets lists of the friends and the followers of each hashtagger, from their Twitter id and writes these to CSV files in a variety of ways. In particular, for each list (friends and followers, generate three files where the edges represent: i) link between an individual and other hashtaggers (“inner” edges within the community); ii) link between hashtagger and not hashtaggers (“outside” edges from the community); iii) links between hashtagger and hashtaggers as well as not hashtaggers);
5) an edit of the friends/followers CSV files to put them into an appropriate format for viewing in tools such as Gephi or Graphviz. For Gephi, edges can be defined using comma separated pairs of nodes (e.g. userID, followerID) with a bit of syntactic sugar; we can also use the list of Twitter usernames/user IDs to provide nice labels for the community’s “inner” nodes:
nodedef> name VARCHAR,label VARCHAR
edgedef> user VARCHAR,friend VARCHAR
Having got a gdf formatted file, we can load it straight in to Gephi:
In 3d view we can get something like this:
Node size is proportional to number of hashtag users following a hashtagger. Colour (heat) is proportional to number of hashtaggers follwed by a hashtagger. Arrows go from a hashtagger to the people they are following. So a large node size means lots of other hashtaggers follow that person. A hot/red colour shows that person is following lots of the other hashtaggers. Gephi allows you to run various statistics over the graph to allow you to analyse the network properties of the community, but I’m not going to cover that just now! (Search this blog using “gephi” for some examples in other contexts.)
Use of different layouts, colour/text size etc etc can be used to modify this view in Gephi, which can also generate high quality PDF renderings of 2D versions of the graph:
(If you want to produce your own visualisation in Gephi, I popped the gdf file for the network here.)
If we export the gexf representation of the graph, we can use Alexis Jacomy’s Flex gexfWalker component to provide an interactive explorer for the graph:
Clicking on a node allows you to explore who a particular hashtagger is following:
Remember, arrows go from a hashtagger to the person they are following. Note that the above visualisation allows you to see reciprocal links. The colourings are specified via the gexf file, which itself had its properties set in the Preview Settings dialogue in Gephi:
As well as looking at the internal structure of the hashtag community, we camn look at all the friends and/or followers of the hashtaggers. THe graph for this is rather large (70k nodes and 90k edges), so after a lazyweb reuest to @gephi I found I had to increase the memory allocation for the Gephi app (the app stalled on loading the graph when it had run out of memory…).
If we load in the graph of “outer friends” (that is the people the hashtaggers follow who are not hashtaggers) and filter the graph to only show nodes who have more than 5 or so incoming edges we can see which Twitter users are followed by large numbers of the hashtaggers, but who have not been using the hashtag themselves. Becuase the friends/followers lists return Twitter numercal IDs, we have to do a look up on Twitter to find out the actual Twitter usernames. This is something I need to automate, maybe using the Twitter lookup API call that lets authenticated users look up the details of up to 100 Twitter users at a time given their numerical IDs. (If anyone wants this data from my snapshot of 23/8/10, drop me a line….)
Okay, that/s more than enough for now… As I’ve shared the gdf and gexf files for the #ddj internal hashtaggers network, if any more graphically talented than I individuals would like to see what sort of views they can come up with, either using Gephi or any other tool that accepts those data formats, I’d love to see them:-)
PS It also strikes me that having got the list of hashtaggers, I need to package up this with a set of tools that would let you:
– create a Twitter list around a set of hashtaggers (and maybe then use that to generate a custom search engine over the hashtaggers’ linked to homepages);
– find other hashtags being used by the hashtaggers (that is, hashtags they may be using in arbitrary tweets).
Yesterday, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre.
Here are the slides I used in my talk, such as they are… I really do need to annotate them with links, but in the meantime, if you want to track any of the examples down the best way is to probably just search this blog ;-)
(Readers might also be interested in my slides from News:Rewired (although they make even less sense without notes!))
Although most of the slides will be familiar to longtime readers of this blog, there is one new thing in there: the first sketch if a network diagram showing how some of my favourite online apps can work together based on the online file formats they either publish or consume (the idea being once you can get a file into the network somewhere, you can route it to other places/apps in the network…)
The graph showing how a handful of web apps connect together was generated using Graphiz, with the graph defined as follows:
GoogleSpreadsheet -> CSV;
GoogleSpreadsheet -> “<GoogleGadgets>”;
GoogleSpreadsheet -> “[GoogleVizDataAPI]“;
CSV -> GoogleSpreadsheet;
YahooPipes -> CSV;
YahooPipes -> JSON;
CSV -> YahooPipes;
JSON -> YahooPipes;
XML -> YahooPipes;
“[YQL]” -> JSON;
“[YQL]” -> XML;
YahooPipes -> KML;
I had intended to build this up “live” in a text editor using the GraphViz Mac client to display the growing network, but in the end I just showed a static image.
At the time, I’d also forgotten that there is an experimental Graphviz chart generator made available via the Google Charts API, so here’s the graph generated via a URL using the Google Charts API:
Here’s the chart playground view:
PS if the topics covered by the #ddj event appealed to you, you might also be interested in the P2PU Open Journalism on the Open Webcourse, the “syllabus” of which is being arranged at the moment (and which includes at least one week on data journalism) and which will run over 6 weeks, err, sometime; and the Web of Data Data Journalism Meetup in Berlin on September 1st.
Having managed to get F1 timing data data through my cobbled together F1 timing data Scraperwiki, it becomes much easier to try out different visualisation approaches that can be used to review the stories that sometimes get hidden in the heat of the race (that data journalism trick of using visualisation as an analytic tool for story discovery, for example).
Whilst I was on holiday, reading a chapter in Beautiful Visualization on Gapminder/Trendalyser/Google Motion Charts (it seems the animations may be effective when narrated, as when Hans Rosling performs with them, but for the uninitiated, they can simply be confusing…), it struck me that I should be able to view some of the timing data in the motion chart…
So here’s a first attempt (going against the previously identified “works best with narration” bit of best practice;-) – F1 timing data (China 2011) in Google Motion Charts, the video:
Visualising the China 2011 F1 Grand Prix in Google Motion Charts
If you want to play with the chart itself, you can find it here: F1 timing data (China 2011) Google Motion Chart.
The (useful) dimensions are:
- lap – the lap number;
- pos – the car/racing number of each driver;
- trackPos – the position in the race (the racing position);
- currTrackPos – the position on the track (so if a lapped car is between the leader and second place car, their respective currtrackpos are 1, 2, 3);
- pitHistory – the number of pit stops to date
The timeToLead, timeToFront and timeToBack measures give the time (in seconds) between each car and the leader, the time to the car in the racing position ahead, and the time to the car in racing position behind (these last two datasets are incomplete at the moment… I still need to calculate this missing datapoints…). The elapsedTime is the elapsed racetime for each car at the end of each measured lap.
The time starts at 1900 because of a quirk in Google Motion Charts – they only work properly for times measured in years, months and days (or years and quarters) for 1900 onwards. (You can use years less than 1900 but at 1899 bad things might happen!) This means that I can simply use the elapsed time as the timebase. So until such a time as the chart supports date:time or :time as well as date: stamps, my fix is simply to use an integer timecount (the elapsed time in seconds) + 1900.
Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.
I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:
Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.
Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…
So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…
Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…
Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).
Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)
Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)
lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)
I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)
Prior programming experience suggests this should also work…
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))
Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):
boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)
Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
Hmm… can we set the filter value dynamically I wonder?
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
Seems like it…?:-) We can also combine interactive controls:
manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))
Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…
The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.
[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group's one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you "see" statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome... Err....?!]
To try to bring a bit of focus back to this blog, I’ve started a new blog – F1 Data Junkie: http://f1datajunkie.blogspot.com (aka http://bit.ly/F1DataJunkie) – that will act as the home for my “procedural” F1 Data postings. I’ll still post the occasional thing here – for example, reviewing the howto behind some of the novel visualisations I’m working on (such as practice/qualification session utilisation charts, and race battle maps), but charts relating to particular races, will, in the main, go onto the new blog….
I’m hoping by the end of the season to have an automated route of generating different sorts of visual reviews of practice, qualification and race sessions based on both official timing data, and (hopefully) the McLaren telemetry data. (If anyone has managed to scrape and decode the Mecedes F1 live telemetry data and is willing to share it with me, that would be much appreciated:-)
I also hope to use the spirit of F1 to innovate like crazy on the visualisations as and when I get the chance; I think that there’s a lot of mileage still to come in innovative sports graphics/data visualisations*, not only for the stats geek fan, but also for sports journalists looking to uncover stories from the data that they may have missed during an event. And with a backlog of data going back years for many sports, there’s also the opportunity to revisit previous events and reinterpret them… Over this weekend, I’ve been hacking around a few old scripts to to to automate the production of various data formatters, as well as working on a couple of my very own visualisation types:-) So if you want to see what I’ve been up to, you should probably pop over to F1 Data Junkie, the blog… ;-)
*A lot of innovation is happening in live sports graphics for TV overlays, such as the Piero system developed by the BBC, or the HawkEye ball tracking system (the company behind it has just been bought by Sony, so I wonder if we’ll see the tech migrate into TVs, start to play a role in capturing data that crosses over in gaming (e.g. Play ALong With the Real World), or feed commercial data augmentations from Sony to viewers via widgets on Sony TVs…
There’ve also been recent innovations in sports graphics in the press and online. For example, seeing this interactive football chalkboard on the Guardian website, that lets you pull up, in a graphical way, stats reviews of recent and historical games, or this Daily Telegraph interactive that provides a Hawk-Eye analysis of the Ashes (is there an equivalent for The Master golf anywhere, I wonder, or Wimbledon tennis? See also Cricket visualisation tool), I wonder why there aren’t any interactive graphical viewers over recent and historical F1 data…. (or maybe there are? If you know of any – or know of any interesting visualisations around motorsport in general and F1 in particular, please let me know in the comments…:-)
The notion of data driven journalism appears to have some sort of traction at the moment, not least as a recognised use context of some very powerful data handling tools, as Simon “Guardian Datastore” Rogers appearance at Google I/O suggests:
(Simon’s slot starts about 34:30 in, but there’s a good tutorial intro to Fusion Tables from the start…)
As I start to doodle ideas for an open online course on something along the lines of “visually, data” to run October-December, data journalism is going to provide one of the major scenarios for working through ideas. So I guess it’s in my interest to promote this European Journalism Centre: Survey on Data Journalism to try to find out what might actually be useful to journalists…;-)
[T]he survey Data-Driven Journalism – Your opinion aims to gather the opinion of journalists on the emerging practice of data-driven journalism and their training needs in this new field. The survey should take no more than 10 minutes to complete. The results will be publicly released and one of the entries will win a EUR 100 Amazon gift voucher
I think the EJC are looking to run a series of data-driven journalism training activities/workshops too, so it’s worth keeping an eye on the EJC site if #datajourn is your thing…
PS related: the first issue of Google’s “Think Quarterly” magazine was all about data: Think Data
PPS Data in journalism often gets conflated with data visualisation, but that’s only a part of it… Where the visulisation is the thing, then here’s a few things to think about…
Ben Fry interviewed at Where 2.0 2011
I don’t have time to chase this just now, but it could be handy… Over the last few months, several of Alasdair Rae (University of Sheffield) Google Fusion Tables generated maps have been appearing on the Guardian Datablog, including one today showing the UK’s new Parliamentay constituency boundaries.
Looking at Alasdair’s fusion table for English Indices of Deprivation 2010, we can see how it contains various output area codes as well as KML geometry shape files that can be used to draw the boundaries on map.
On the to do list, then, is to a set of fusion tables that we can use to generate maps from datatables containing particular sorts of output area code. Because it’s easy to join two fusion tables by a common column, we’d then have a Google Fusion Tables simple recipe for thematic maps:
1) get data containing output area or constituency codes;
2) join with the appropriate mapping fusion table to annotate original data with appropriate shape files;
3) generate map…
I wonder – have Alasdair or anyone from the Guardian Datablog/Datastore team already published such a tutorial?
PS Ah, here’s one example tutorial: Peter Aldhous: Thematic Maps with Google Fusion Tables [PDF]
PPS for constituency boundary shapefiles as KML see http://www.google.com/fusiontables/DataSource?dsrcid=1574396 or the Guardian Datastore’s http://www.google.com/fusiontables/exporttable?query=select+col0%3E%3E1+from+1474106+&o=kmllink&g=col0%3E%3E1
TheyWorkForYou also publish boundary files via the TheyWorkForYou API. The Edina ShareGeo site also publish shapefiles derived from Ordnance Survey/OS data for 2010 Westminster Constituencies (for a quick start guide to using shapefiles in R, see Amateur Mapmaking: Getting Started With Shapefiles).
You may or may not have noticed that the Boundary Commission released their take on proposed parliamentary constituency boundaries today.
They could have released the data – as data – in the form of shape files that can be rendered at the click of a button in things like Google Maps… but they didn’t… [The one thing the Boundary Commission quango forgot to produce: a map] (There are issues with publishing the actual shapefiles, of course. For one thing, the boundaries may yet change – and if the original shapefiles are left hanging around, people may start to draw on these now incorrect sources of data once the boundaries are fixed. But that’s a minor issue…)
Instead, you have to download a series of hefty PDFs, one per region, to get a flavour of the boundary changes. Drawing a direct comparison with the current boundaries is not possible.
The make-up of the actual constituencies appears to based on their member wards, data which is provided in a series of spreadsheets, one per region, each containing several sheets describing the ward makeup of each new constituency for the counties in the corresponding region.
It didn’t take long for the data junkies to get on the case though. From my perspective, the first map I saw was on the Guardian Datastore, reusing work by University of Sheffield academic Alasdair Rae, apparently created using Google Fusion Tables (though I haven’t see a recipe published anywhere? Or a link to the KML file that I saw Guardian Datablog editor Simon Rogers/@smfrogers tweet about?)
[I knew I should have grabbed a screen shot of the original map...:-(]
It appears that Conrad Quilty-Harper (@coneee) over at the Telegraph then got on the case, and came up with a comparative map drawing on Rae’s work as published on the Datablog, showing the current boundaries compared to the proposed changes, and which ties the maps together so the zoom level and focus are matched across the maps (MPs’ constituencies: boundary changes mapped):
Interestingly, I was alerted to this map by Simon tweeting that he liked the Telegraph map so much, they’d reused the idea (and maybe even the code?) on the Guardian site. Here’s a snapshot of the conversation between these two data journalists over the course of the day (reverse chronological order):
Here’s the handshake…
I absolutely love this… and what’s more, it happened over the course of four or five hours, with a couple of technology/knowledge transfers along the way, as well as evolution in the way both news agencies communicated the information compared to the way the Boundary Commission released it. (If I was evil, I’d try to FOI the Boundary Commission to see how much time, effort and expense went into their communication effort around the proposed changes, and would then try to guesstimate how much the Guardian and Telegraph teams put into it as a comparison…)
At the time of writing (15.30), the BBC have no data driven take on this story…
And out of interest, I also wondered whether Sheffield U had a take…
PS By the by, the DataDrivenJournalism.net website relaunched today. I’m honoured to be on the editorial board, along with @paulbradshaw @nicolaskb @mirkolorenz @smfrogers and @stiles, and looking forward to seeing how we can start to drive interest, engagement and skills development in, as well as analysis and (re)use of, and commentary on, public open data through the data journalism route…
PPS if you’re into data journalism, you may also be interested in GetTheData.org, a question and answer site in the model of Stack Overflow, with an emphasis on Q&A around how to find, access, and make use of open and public datasets.
If part of the role of data journalism is to make transparent the justification behind claims that are, or aren’t, backed up by data, there’s good reason to suppose that the journalists should be able to back up their own data-based claims with evidence about how they made use of the data. Posting links to raw data helps to a certain extent – at least third parties can then explore the data themselves and check the claims the press are making – but you could also argue that the journalists should also make their notes available regarding how they worked the data. (The same is true in public reports, where summary statistics and charts are included in a report, along with a link to the raw data, but no transparency in how the summary reports/charts were actually produced from the data.)
In Power Tools for Aspiring Data Journalists: R, I explored how we might use the R statistical programming language to replicate a chart that appeared in one of Ben Goldacre’s Bad Science columns. I included code snippets in the post, along with the figures they generated. But is there a way of getting even closer to the source, as it were, and produce documents that essentially generate their output from some sort of “source code”?
For example, take this view of my working relating to the production of the funnel chart described in Goldacre’s column:
You can find the actual “source code” for that document here: bowel cancer funnel plot working notes If you load it into something like RStudio, you can “run” the code and generate your own PDF from it.
The “source” of the document includes both text and R code. When the Sweave document is processed, the R code contained within the document is executed and the results also included in the document. The charts shown in the report are generated directly from the code included in the document, using data pulled in to the document form a source referenced within the document. If the source data is changed, or the R code is changed, what’s contained in the output document will change as well.
This sort of workflow will be familiar to many experimental scientists, but I wonder: is it something that data journalists have considered, at least as a way of keeping working notes about data related projects they are working on?
PS as well as Sweave, see dexy.it, which generalises the Sweave approach to allow you to create self-documenting software/code. Educators, also take note…;-)