OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Search Results

My Slides from the Data Driven Journalism Round Table (ddj)

Yesterday, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre.

Here are the slides I used in my talk, such as they are… I really do need to annotate them with links, but in the meantime, if you want to track any of the examples down the best way is to probably just search this blog ;-)

(Readers might also be interested in my slides from News:Rewired (although they make even less sense without notes!))

Although most of the slides will be familiar to longtime readers of this blog, there is one new thing in there: the first sketch if a network diagram showing how some of my favourite online apps can work together based on the online file formats they either publish or consume (the idea being once you can get a file into the network somewhere, you can route it to other places/apps in the network…)

The graph showing how a handful of web apps connect together was generated using Graphiz, with the graph defined as follows:

GoogleSpreadsheet -> CSV;
GoogleSpreadsheet -> “<GoogleGadgets>”;
GoogleSpreadsheet -> “[GoogleVizDataAPI]“;
“[GoogleVizDataAPI]“->JSON;
CSV -> GoogleSpreadsheet;
YahooPipes -> CSV;
YahooPipes -> JSON;
CSV -> YahooPipes;
JSON -> YahooPipes;
XML -> YahooPipes;
“[YQL]” -> JSON;
“[YQL]” -> XML;
CSV->”[YQL]“;
XML->”[YQL]“;
CSV->”<ManyEyesWikified>”;
YahooPipes -> KML;
KML->”<GoogleEarth>”;
KML->”<GoogleMaps>”;
“<GoogleMaps>”->KML;
RDFTripleStore->”[SPARQL]“;
“[SPARQL]“->RDF;
“[SPARQL]“->XML;
“[SPARQL]“->CSV;
“[SPARQL]“->JSON;
JSON-> “<JQueryCharts_etc>”;

I had intended to build this up “live” in a text editor using the GraphViz Mac client to display the growing network, but in the end I just showed a static image.

At the time, I’d also forgotten that there is an experimental Graphviz chart generator made available via the Google Charts API, so here’s the graph generated via a URL using the Google Charts API:

Graphviz chart via Google Charts API

Here’s the chart playground view:

Google Charts API LIve chart playground

PS if the topics covered by the #ddj event appealed to you, you might also be interested in the P2PU Open Journalism on the Open Webcourse, the “syllabus” of which is being arranged at the moment (and which includes at least one week on data journalism) and which will run over 6 weeks, err, sometime; and the Web of Data Data Journalism Meetup in Berlin on September 1st.

Written by Tony Hirst

August 25, 2010 at 6:00 pm

Posted in Presentation

Tagged with

Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community

…got that?;-) Or in other words, this is a post looking at some visualisations of the #ddj hashtag community…

ddj - PDF export

A couple of days ago, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre. I’ll try to post some notes on it, err, sometime; but for now, here’s a quick collection of some of the various things I’ve tinkered with around hashtag communities, using #ddj as an example, as a “note to self” that I really should pull these together somehow, or at least automate some of the bits of workflow; I also need to move away from Twitter’s Basic Auth (which gets switched off this week, I think?) to oAuth*…

*At least Twitter is offering a single access token which “is ideal for applications migrating to OAuth with single-user use cases”. Rather than having to request key and secret values in an oAuth handshake, you can just grab them from the Twitter Application Control Panel. Which means I should be able to just plug them into a handful of Tweepy commands:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth)

So, first up is the hashtag community view showing how an individual is connected to the set of people using a particular hashtag (and which at the moment only works for as long as Twitter search turns up users around the hashtag…)

Hashtag community

Having got a Twapperkeeper API key, I guess I really should patch this to allow a user to explore their hashtag community based on a Twapperkeeper archive (like the #ddj archive)…

One thing the hashtag community page doesn’t do is explore any of the structure within the network… For that, I currently have a two step process:

1) get a list of people using a hashtag (recipe: Who’s Tweeting our hashtag?) using the this Yahoo Pipe and output the data as CSV using a URL with the pattern (replace ddj with your desired hashtag):
http://pipes.yahoo.com/pipes/pipe.run?_id=1ec044d91656b23762d90b64b9212c2c&_render=csv&q=%23ddj

2) take the column of twitter homepage URLs of people tweeting the hashtag and replace http://twitter.com/ with nothing to give the list of Twitter usernames using the hashtag.

3) run a script that finds the twitter numerical IDs of users given their Twitter usernames listed in a simple text file:

import tweepy

auth = tweepy.BasicAuthHandler("????", "????")
api = tweepy.API(auth)

f =open('hashtaggers.txt')
f2=open('hashtaggersIDs.txt','w')
f3=open('hashtaggersIDnName.txt','w')

for uid in f:
  user=api.get_user(uid)
  print uid
  f2.write(str(user.id)+'\n')
  f3.write(uid+','+str(user.id)+'\n')

f.close()
f2.close()
f3.close()

Note: this a) is Python; b) uses tweepy; c) needs changing to use oAuth.

4) run another script (gist here – note this code is far from optimal or even well behaved; it needs refactoring, and also tweaking so it plays nice with the Twitter API) that gets lists of the friends and the followers of each hashtagger, from their Twitter id and writes these to CSV files in a variety of ways. In particular, for each list (friends and followers, generate three files where the edges represent: i) link between an individual and other hashtaggers (“inner” edges within the community); ii) link between hashtagger and not hashtaggers (“outside” edges from the community); iii) links between hashtagger and hashtaggers as well as not hashtaggers);

5) an edit of the friends/followers CSV files to put them into an appropriate format for viewing in tools such as Gephi or Graphviz. For Gephi, edges can be defined using comma separated pairs of nodes (e.g. userID, followerID) with a bit of syntactic sugar; we can also use the list of Twitter usernames/user IDs to provide nice labels for the community’s “inner” nodes:

nodedef> name VARCHAR,label VARCHAR
23309699,tedeschini
23751864,datastore
13691922,nicolaskb

17474531,themediaisdying
2224401,munkyfonkey

edgedef> user VARCHAR,friend VARCHAR
23309699,17250069
23309699,91311875

2224401,878051
2224401,972651

Having got a gdf formatted file, we can load it straight in to Gephi:

ddj hashtag community

In 3d view we can get something like this:

ddj - hashtag community

Node size is proportional to number of hashtag users following a hashtagger. Colour (heat) is proportional to number of hashtaggers follwed by a hashtagger. Arrows go from a hashtagger to the people they are following. So a large node size means lots of other hashtaggers follow that person. A hot/red colour shows that person is following lots of the other hashtaggers. Gephi allows you to run various statistics over the graph to allow you to analyse the network properties of the community, but I’m not going to cover that just now! (Search this blog using “gephi” for some examples in other contexts.)

Use of different layouts, colour/text size etc etc can be used to modify this view in Gephi, which can also generate high quality PDF renderings of 2D versions of the graph:

ddj - PDF export

(If you want to produce your own visualisation in Gephi, I popped the gdf file for the network here.)

If we export the gexf representation of the graph, we can use Alexis Jacomy’s Flex gexfWalker component to provide an interactive explorer for the graph:

gexfwalker - full network

Clicking on a node allows you to explore who a particular hashtagger is following:

gexfwalker - node explorer

Remember, arrows go from a hashtagger to the person they are following. Note that the above visualisation allows you to see reciprocal links. The colourings are specified via the gexf file, which itself had its properties set in the Preview Settings dialogue in Gephi:

Gephi - preview settings

As well as looking at the internal structure of the hashtag community, we camn look at all the friends and/or followers of the hashtaggers. THe graph for this is rather large (70k nodes and 90k edges), so after a lazyweb reuest to @gephi I found I had to increase the memory allocation for the Gephi app (the app stalled on loading the graph when it had run out of memory…).

If we load in the graph of “outer friends” (that is the people the hashtaggers follow who are not hashtaggers) and filter the graph to only show nodes who have more than 5 or so incoming edges we can see which Twitter users are followed by large numbers of the hashtaggers, but who have not been using the hashtag themselves. Becuase the friends/followers lists return Twitter numercal IDs, we have to do a look up on Twitter to find out the actual Twitter usernames. This is something I need to automate, maybe using the Twitter lookup API call that lets authenticated users look up the details of up to 100 Twitter users at a time given their numerical IDs. (If anyone wants this data from my snapshot of 23/8/10, drop me a line….)

Okay, that/s more than enough for now… As I’ve shared the gdf and gexf files for the #ddj internal hashtaggers network, if any more graphically talented than I individuals would like to see what sort of views they can come up with, either using Gephi or any other tool that accepts those data formats, I’d love to see them:-)

PS It also strikes me that having got the list of hashtaggers, I need to package up this with a set of tools that would let you:
– create a Twitter list around a set of hashtaggers (and maybe then use that to generate a custom search engine over the hashtaggers’ linked to homepages);
find other hashtags being used by the hashtaggers (that is, hashtags they may be using in arbitrary tweets).

(See also Topic and Event based Twittering – Who’s in Your Community?)

Written by Tony Hirst

August 25, 2010 at 9:08 pm

Posted in Tinkering, Visualisation

Tagged with ,

Data Driven Journalism – Survey

The notion of data driven journalism appears to have some sort of traction at the moment, not least as a recognised use context of some very powerful data handling tools, as Simon “Guardian Datastore” Rogers appearance at Google I/O suggests:


(Simon’s slot starts about 34:30 in, but there’s a good tutorial intro to Fusion Tables from the start…)

As I start to doodle ideas for an open online course on something along the lines of “visually, data” to run October-December, data journalism is going to provide one of the major scenarios for working through ideas. So I guess it’s in my interest to promote this European Journalism Centre: Survey on Data Journalism to try to find out what might actually be useful to journalists…;-)

[T]he survey Data-Driven Journalism – Your opinion aims to gather the opinion of journalists on the emerging practice of data-driven journalism and their training needs in this new field. The survey should take no more than 10 minutes to complete. The results will be publicly released and one of the entries will win a EUR 100 Amazon gift voucher

I think the EJC are looking to run a series of data-driven journalism training activities/workshops too, so it’s worth keeping an eye on the EJC site if #datajourn is your thing…

PS related: the first issue of Google’s “Think Quarterly” magazine was all about data: Think Data

PPS Data in journalism often gets conflated with data visualisation, but that’s only a part of it… Where the visulisation is the thing, then here’s a few things to think about…


Ben Fry interviewed at Where 2.0 2011

Written by Tony Hirst

May 13, 2011 at 12:56 pm

A Tinkerer’s Toolbox: Data Driven Journalism

Earlier this week, I popped over to Lincoln to chat to @josswinn and @jmahoney127 about their ON Course course data project (I heartily recommend Jamie’s ON Course project blog), hopefully not setting them off down too many ratholes, erm, err, oops?!, as well as bewildering a cohort of online journalism students with a rapid fire presentation about data driven journalism…

I think I need to draw a map…

Written by Tony Hirst

March 23, 2012 at 9:49 pm

Posted in Anything you want, Presentation

Tagged with

Against Powerpoint – Animated Talks

Watching several of the sessions at #ili2010 earlier this week, I was struck by how many presenters are now using what I think of as a Presentation Zen style of slide, where full screen photographs are overlaid with a half a dozen or so words…

…and over the summer, I attended two consecutive events, the JISC Innovation Forum:

JIF2010 session cartoon capture

and the European Journalism Centre’s Data Driven Journalism event:

EJC DDJ session cartoon capture

where the organisers had arranged for a “doodler” to visually record the keypoints from each session in a cartoon like way to provide a “live capture” overview of the presentation.

Reviewing these visual captures post hoc is different to seeing them being constructed, of course, because the way the argument was developed over time is missing; the dynamic aspect of presentation isn’t there, so it’s hard to see how an argument was constructed. (The same is true of looking at a flip chart record of a brainstorming session, or a mindmap of a topic.) Yes, you might be able to take the static capture and try to organise it and develop insight based on things that seem to go together, but you miss out on the dynamic way in which the story was constructed, or told.

However, it is possible to use video to capture in an animated way the construction of the visual record of a spoken word event, as the RSA Animate videos do:

If you ever have access to an OHP, you can dynamically construct your own slides of course, as many lecturers and school teachers used to so, and possibly still do. One of my favorite and fondly remembered exponents of this approach, from watching Sunday morning TV as a child, was Edward de Bono:

An advantage of this approach, of course, is that it helps with pacing…

PS @herrdoktorc reminded me that Prezi can provide animated direction to a presentation and that is not a million miles away from the experience you might get from watching an RSA Animate video, at least on those occasions when the violent Prezi swirls don’t induce nausea and a feeling of seasickness in the audience;-) For a good example of Prezi in use, see Scott Leslie’s Open Educator as DJ.

PPS I’m also reminded of the sketchcast site, which I guess is to de Bono what Prezi is to RSA Animate?! I guess you could achieve a similar effect by screencapturing an online whiteboard session, for example…

PPPS While on the topic of alternative presentation styles, here’s a link to the Cooliris 2D slide grid style presentation I saw @cogdog deliver at Ed-Media 2009 in Hawaii last year.

Written by Tony Hirst

October 16, 2010 at 10:22 am

Posted in Anything you want

Tagged with

What is a Data Journalist?

Jod ads come and go, so I thought I’d capture the main elements of this one from the BBC:

Data Journalist – Role Purpose and Aims

You will be required to humanize statistics; to make sense of potentially complicated data and present it in a user friendly format.

You will be asked to focus on a range of data-rich subjects relating to long-term projects or high impact daily new stories, in line with Global News editorial priorities. These could include the following: reports on development, global poverty, Afghanistan casualties, internet connectivity around the world, or global recession figures.

Key Knowledge and Experience

You will be a self-starter, brimming with story ideas who is comfortable with statistics and has the expertise to delve beneath the headline figures and explain the fuller picture.
You will have significant journalistic experience gained ideally from working in an international news environment.
The successful candidate should have experience (or at least awareness) of visualising data and visualisation tools.
You should be excited about developing the way that data is interpreted and presented on the web, from heavy number crunching, to dynamic mapping and interactive graphics. You must have demonstrated knowledge of statistics, statistical analysis, with a good understanding of the range and breadth of data sources in the UK and internationally, broad experience with data sources, data mining and have good visual and statistical skills.
You must have a Computer-assisted reporting background or similar, including a good knowledge of the relevant software (including Excel and mapping software).
Experience of producing and developing data driven web content a senior level within time and budget constraints.
A thorough understanding of the BBC World Service’s aims and the part this initiative plays in meeting them.
Excellent communication and interpersonal skills with ability to present information concisely to a broad audience including journalists and commissioning editors. You should be able to demonstrate the ability to influence, negotiate with and persuade others.
Central to the role is an ability to analyse complicated information and present it to our readers in a way that is visually engaging and easy to understand, using a range of web-based technologies, for which you should have familiarity with database interfaces and web presentation layers, as well as database concepting, content entry and management.
You will be expected to have your own original ideas on how to best apply data driven journalism, either to complement stories when appropriate or to identify potential original stories while interpreting data, researching and investigating them, crunching the data yourself and working with designers and developers on creating content that will engage our audience, and provide them with useful, personalised information.
You will work in a multimedia way, when appropriate, liaising with online but also radio and TV and specialist output producers as required, from a range of language services. You will help lead the development of computer-assisted reporting skills in the wider news specials team.

MAIN RESPONSIBILITIES

To identify a range of significant statistics and data -driven stories that can be developed and result in finished graphics that can be used across BBC News websites.
To take a lead role in devising compelling ways of telling data-driven stories on the web, working with specials team designers, developers and journalists as required. Also liaising with radio and TV and specialist output producers across World Service as required, providing a joined-up multi-platform proposition for the audience.
To work with senior stakeholders and programme teams and be an internal expert who can interpret and concisely explain the significance of data to others, and related good practice.
Support the College of Journalism – to help devise training sessions in order to spread the knowledge and best practices of data driven journalism
To help inform the future development by FM&T of tools which enable data-driven stories to be told more quickly and effectively on the web.
To keep abreast of developments in data driven journalism, and pursue collaboration with other teams working on the same area, both within the BBC and also with external organisations.
Willingness to work across a range of online production skills in a flexible manner to BBC standards and values.
· Using their own initiative the successful candidate will be required to build relationships with major sources of content (e.g. BBC networks, programmes, external interest groups) and promote opportunities for cross-media production

COMPETENCIES

Editorial Judgement
Makes the right editorial and policy decisions based upon a clear understanding of the BBC’s distinctive news agenda, the requirements of news and current affairs coverage.

Planning & Organising
Is able to think ahead in order to establish an effective and appropriate course of action for self and others. Prioritises and plans activities taking into account all the relevant issues and factors such as deadlines and resources requirements.

Analytical Thinking
Able to simplify complex problems, process projects into component parts, explore and evaluate them systematically.

Creative Thinking
Is able to transform creative ideas/impulses into practical reality. Can look at existing situations and problems in novel ways and come up with creative solutions.

Resilience
Can maintain personal effectiveness by managing own emotions in the face of pressure, set backs or when dealing with provocative situations. Can demonstrate an approach to work that is characterised by commitment, motivation and energy.

Communication
The ability to get one’s message understood clearly by adopting a range of styles, tools and techniques appropriate to the audience and the nature of the information.

Influencing and Persuading
Ability to present sound and well reasoned arguments to convince others. Can draw from a range of strategies to persuade people in a way that results in agreement or behaviour change.

Managing Relationships and Team Working
Able to build and maintain effective working relationships with a range of people. Highly effective team player.

Written by Tony Hirst

December 4, 2010 at 12:54 pm

Posted in Jobs

Tagged with

UK Journalists on Twitter

A post on the Guardian Datablog earlier today took a dataset collected by the Tweetminster folk and graphed the sorts of thing that journalists tweet about ( Journalists on Twitter: how do Britain’s news organisations tweet?).

Tweetminster maintains separate lists of tweeting journalists for several different media groups, so it was easy to grab the names on each list, use the Twitter API to pull down the names of people followed by each person on the list, and then graph the friend connections between folk on the lists. The result shows that the hacks are follow each other quite closely:

UK Media Twitter echochamber (via tweetminster lists)

Nodes are coloured by media group/Tweetminster list, and sized by PageRank, as calculated over the network using the Gephi PageRank statistic.

The force directed layout shows how folk within individual media groups tend to follow each other more intensely than they do people from other groups, but that said, inter-group following is still high. The major players across the media tweeps as a whole seem to be @arusbridger, @r4today, @skynews, @paulwaugh and @BBCLauraK.

I can generate an SVG version of the chart, and post a copy of the raw Gephi GDF data file, if anyone’s interested…

PS if you’re interested in trying out Gephi for yourself, you can download it from gephi.org. One of the easiest ways in is to explore your Facebook network

PPS for details on how the above was put together, here’s a related approach:
Trying to find useful things to do with emerging technologies in open education
Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community
.

For a slightly different view over the UK political Twittersphere, see Sketching the Structure of the UK Political Media Twittersphere. And for the House and Senate in the US: Sketching Connections Between US House and Senate Tweeps

Written by Tony Hirst

April 10, 2011 at 11:16 am

First Play With R and R-Studio – F1 Lap Time Box Plots

Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.

I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:

Anscombe's quartet /via Wikipedia

Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.

Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…

So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…

Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…

Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).

Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)

Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)

lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)

I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)

Prior programming experience suggests this should also work…

boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))

Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):

boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)

Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
manipulate(
plot(cars, xlim=c(0,x.max)),
x.max=slider(15,25))

Hmm… can we set the filter value dynamically I wonder?

manipulate(
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
maxval=slider(100,140))

Seems like it…?:-) We can also combine interactive controls:

manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))

Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…

The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.

[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group's one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you "see" statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome... Err....?!]

Written by Tony Hirst

May 5, 2011 at 1:56 pm

Sketched Thoughts On Data Journalism Technologies and Practice

Over the last year or two, I’ve given a handful of talks to postgrad and undergrad students broadly on the topic of “technology for data driven journalism”. The presentations are typically uncompromising, which is to say I assume a lot. There are many risks in taking such an approach, of course, as waves of confusion spread out across the room… But it is, in part, a deliberate strategy intended to shock people into an awareness of some of the things that are possible with tools that are freely available for use in the desktop and browser based sheds of today’s digital tinkerers… Having delivered one such presentation yesterday, at UCA, Farnham, here are some reflections on the whole topic of “#ddj”. Needless to say, they do not necessarily reflect even my opinions, let alone those of anybody else;-)

The data-driven journalism thing is being made up as we go along. There is a fine tradition of computer assisted journalism, database journalism, and so on, but the notion of “data driven journalism” appears to have rather more popular appeal. Before attempting a definition, what are some of the things we associate with ddj that might explain the recent upsurge of interest around it?

  • access to data: this must surely be a part of it. In one version we might tell of the story, the arrival of Google Maps and the reverse engineering of an API to it by Paul Rademacher for his April 2005 “Housing Maps mashup”, opened up people’s eyes to the possibility of map-based mashups; a short while later, in May 2005, Adrian Holovaty’s Chicago Crime Map showed how the same mashup idea could be used as an example of “live”, automated and geographically contextualised reporting of crime data. Mashups were all about appropriating web technologies and web content, building new “stuff” from pre-existing “stuff” that was already out there. And as an idea, mashups became all the rage way back then, offering as they did the potential for appropriating, combining and re-presenting elements of different web applications and publications without the need for (further) programming.
    In March 2006, a year or so after the first demonstration of the Housing Maps mashup, and in part as a response to the difficulty in getting hold of latitude and longitude data for UK based locations that was required to build Google maps mashups around British locations, the Guardian Technology supplement (remember that? It had Kakoru puzzles and everything?!;-) launched the “Free Our Data” campaign (history). This campaign called for the free release of data collected at public expense, such as the data that gave the latitude and longitude for UK postcodes.
    The early promise of, and popular interest in “mashups” waxed, and then waned; but there was a new tide rising in the information system that is the web: access to data. The mashups had shown the way forward in terms of some of the things you could do if you could wire different applications together, but despite the promise of no programming it was still too techie, too geeky, too damned hard and fiddly for most people; and despite what the geeks said, it was still programming, and there often still was coding involved. So the focus changed. Awareness grew about the sorts of “mashup” were possible, so now you could ask a developer to build you “something like that”, as you pointed to an appropriate example. The stumbling block now was access to the data to power an app that looked like that, but did the same thing for this.
    For some reason, the notion of “open” public data hit a policy nerve, and in the UK, as elsewhere, started to receive cross-party support. (A brief history of open public data in a UK context is illustrated in the first part of Open Standards and Open Data.) The data started to flow, or at least, started to become both published (through mandated transparency initiatives, such as the release of public accounting data) and requestable (for example, via an extension to FOI by the Protection of Freedoms Act 2012).
    We’ve now got access in principle and in practice to increasing amounts of data, we’ve seen some of the ways in which it can be displayed and, to a certain extent, started to explore some of the ways in which we can use it as a source for news stories. So the time is right in data terms for data driven journalism, right?
  • access to visualisation technologies: it wasn’t very long ago when it was still really hard to display data on screen using anything other than canned chart types – pie charts, line charts, bar charts (that is, the charts you were introduced to in primary school. How many chart types have you learned to read, or create, since then?). Spreadsheets offer a range of grab-and-display chart generating wizards, of course, but they’re not ideal when working with large datasets, and they’re typically geared for generating charts for reports, rather than being used analytically. The visual analysis mantra – Overview first, zoom and filter, then details-on-demand – (coined in Ben Schneiderman’s 1997 article A Grander Goal: A Thousand-Fold Increase in Human Capabilities, I think?) arguably requires fast computers and big screens to achieve the levels of responsiveness that is required for interactive usage, and we have those now…

There are, however, still some considerable barriers to access:

  • access to clean data: you might think I’m repeating myself here, but access to data and access to clean data are two separate considerations. A lot of the data that’s out there and published is still not directly usable (you can’t just load it into a spreadsheet and work on it directly); things that are supposed to match often don’t (we might know that Open Uni, OU and Open University refer to the same thing, but why should a spreadsheet?); number columns often contain things that aren’t numbers (such as commas or other punctuation); dates are provided in a wide variety of formats that we can recognise as such, but a computer can’t – at least, not unless we give it a bit of help; data gets misplaced across columns; character encodings used by different applications and operating systems don’t play nicely; typos proliferate; and so on. So whose job is it to clean the data before it can be inspected or analysed?
  • access to skills and workflows: engineering practice tends to have a separation between the notion of “engineer” and “technician”. Over-generalising and trivialising matters somewhat, engineers have academic training, and typically come at problems from a theory dominated direction; technicians (or technical engineers) have the practical skills that can be used to enact the solutions produced by the engineers. (Of course, technicians can often suggest additional, or alternative, solutions, in part reflecting a better, or more immediate, knowledge about the practical considerations involved in taking one course of action compared to another.) At the moment, the demarcation of roles (and skills required at each step of the way) in a workflow based around data discovery, preparation, analysis and reporting is still confused.
  • What questions should ask? If you think of data as a source, with a story to tell: how do you set about finding that source? Why do you even think you want to talk to that source? What sorts of questions should you ask that source, and what sorts of answer might you reasonably expect it to provide you with? How can you tell if that source is misleading you, lying to you, hiding something from you, or is just plain wrong? To what extent do you or should you trust a data source? Remember, ever cell in a spreadsheet is a fact. If you have a spreadsheet containing a million data cells, that’s a lot of fact checking to do…
  • low or misplaced expectations: we don’t necessarily expect Journalism students to know how to drive to a spreadsheet let alone run or apply complex statistics, or even have a great grasp on “the application of number”; but should they? I’m not totally convinced we need to get them up to speed with yesterday’s tools and techniques… As a tool builder/tool user, I keep looking for tools and ways of using tools that may be thought of as emerging “professional” tools for people who work with data on a day-to-day basis, but wouldn’t class themselves as data scientists, or data researchers; tools for technicians, maybe. When presenting tools to students, I try showing the tools that are likely to be found on a technician’s workbench. As such, they may look a little bit more technical than tools developed for home use (compare a socket set from a trade supplier with a £3.50 tool-roll bargain offer from your local garage), but that’s because they’re quality tools that are fit for purpose. And as such, it may take a bit of care, training and effort to learn how to use them. But I thought the point was to expose students to “industry-strength” ideas and applications? And in an area where tools are developing quite quickly, students are exactly the sort of people we need to start engaging with them: 1) at the level of raising awareness about what these tools can do; 2) as a vector for knowledge and technology transfer, getting these tools (or at least, ideas about what they can do) out into industry; 3) for students so inclined, recruiting those students for the further development of the tools, recruiting power users to help drive requirements for future iterations of the tools, and so on. If the journalism students are going to be the “engineers” to the data wrangler technicians, it’ll be good for them to know the sorts of things they can reasonably ask their technicians to help them to do…Which is to say, the journalists need exposing to the data wrangling factory floor.

Although a lot of the #ddj posts on this OUseful.info blog relate to tools, the subtext is all about recognising data as a medium, the form particular datasets take, and the way in which different tools can be used to work with these forms. In part this leads to a consideration of the process questions that can be asked of a data source based on identifying natural representations that may be contained within it (albeit in hidden form). For example, a list of MPs hints at a list of constituencies, which have locations, and therefore may benefit from representation in a geographical, map based form; a collection of emails might hint at a timeline based reconstruction, or network analysis showing who corresponded with whom (and in what order), maybe?

And finally, something that I think is still lacking in the formulation of data journalism as a practice is an articulation of the process of discovering the stories from data: I like the notion of “conversations with data” and this is something I’ll try to develop over forthcoming blog posts.

PS see also @dkernohan’s The campaigning academic?. At the risk of spoiling the punchline (you should nevertheless go and read the whole thing), David writes: “There is a space – in the gap between academia and journalism, somewhere in the vicinity of the digital humanities movement – for what I would call the “campaigning academic”, someone who is supported (in a similar way to traditional research funding) to investigate issues of interest and to report back in a variety of accessible media. Maybe this “reporting back” could build up into equivalence to an academic reward, maybe not.

These would be cross-disciplinary scholars, not tied to a particular critical perspective or methodology. And they would likely be highly networked, linking in both to the interested and the involved in any particular area – at times becoming both. They might have a high media profile and an accessible style (Ben Goldacre comes to mind). Or they might be an anonymous but fascinating blogger (whoever it is that does the wonderful Public Policy and The Past). Or anything in between.

But they would campaign, they would investigate, they would expose and they would analyse. Bringing together academic and old-school journalistic standards of integrity and verifiability.”

Mixed up in my head – and I think in David’s – is the question of “public accounting”, as well as sensemaking around current events and trends, and the extent to which it’s the role of “the media” or “academic” to perform such a function. I think there’s much to be said for reimagining how we inform and educate in a network-centric web-based world, and it’s yet another of those things on my list of things I intend to ponder further… See also: From Academic Privilege to Consultations as Peer Review.

Written by Tony Hirst

November 6, 2012 at 2:39 pm

Posted in Infoskills, onlinejournalismblog

Tagged with ,

Recently (And Not So Recently) Advertised Data Jobs…

I’m doing a couple of talks to undergrad and postgrad students next work – on data journalism at the University of Lincoln, and on open data at the University of Southampton – so I thought I’d do a quick round up of recently advertised data related jobs that I could reuse for an employability slide…

So, here are some of the things I’ve noticed recently:

  • The Technology Strategy board, funders of many a data related activity (including the data vouchers for SMEs) are advertising for a Lead Technologist – Data Economy (£45,000 to £55,000):

    The UK is increasingly reliant on its service economy, and on the ability to manage its physical economy effectively, and it exports these capabilities around the world. Both aspects of this are heavily dependent on the availability of appropriate information at the right place and time, which in turn depends on our ability to access and manipulate diverse sources of data within a commercial environment.

    The internet and mobile communications and the ready availability of computing power can allow the creation of a new, data-rich economy, but there are technical, human and business challenges still to be overcome. With its rich data resources, inventive capacity and supportive policy landscape, the UK is well placed to be the centre of this innovation.

    Working within the Digital team, to develop and implement strategies for TSB’s interventions in and around the relevant sectors.

    This role requires the knowledge and expertise to develop priorities for how the UK should address this opportunity, as well as the interpersonal skills to introduce the relevant communities of practice to appropriate technological solutions. It also requires a knowledge of how innovation works within businesses in this space, to allow the design and targeting of TSB’s activities to effectively facilitate change.

    Accessible tools include, but are not restricted to, networking and community building, grant-funding of projects at a wide range of scales, directing support services to businesses, work through centres such as the Open Data Institute and Connected Digital Economy Catapult, targeted procurement through projects such as LinkedGov, and inputs to policy. The role requires drawing upon this toolkit to design a coordinated programme of interventions that has impact in its own right and which also coordinates with other activities across TSB and the wider innovation landscape.

  • Via the ECJ, a relayed message from the NICAR-L mailing list about a couple of jobs going with The Times and Sunday Times:

    A couple of jobs that might be of interest to NICAR members here at the
    Times of London…

    The first is an investigative data journalist role, joining the new data journalism unit which will work across both The Times and The Sunday Times.

    The other is a editorial developer role: this will sit within the News Development Team and will focus on anything from working out how we tell stories in richer more immersive ways, to creating new ways of presenting Times and Sunday Times journalism to new audiences.

    Please get in touch if you are interested!

    Lucia
    Head of news development, The Times and Sunday Times
    @lucia_adams

Not a job ad as such, but an interesting recent innovation from the BirminghamMail:

We’ve launched a new initiative looking at the numbers behind our city and the stories in it.
‘Behind The Numbers’ is all about the explosion in ‘data’: information about our hospitals and schools, crime and the way it is policed, business and sport, arts and culture.
We’d like you to tell us what data you’d like us to publish and dig into. Email suggestions to ben.hurst@trinitymirror.com. Follow @bhamdatablog on Twitter for updates or to share ideas.

This was also new to me: FT Data, a stats/datablog from the FT? FullFact is another recent addition to my feed list, with a couple of interesting stories each day and plenty of process questions and methodological tricks that can be, erm, appropriated ;-) Via @JackieCarter, the Social Statistics blog looked interesting, but the partial RSS feed is a real turn off for me so I’ll probably drop it from my reader pretty quickly unless it turns up some *really* interesting posts.

Here are some examples of previously advertised jobs…

  • A job that was being advertised at the end of last year (now closed) by the Office of National Statistics (ONS) (current vacancies) was for the impressive sounding Head of Rich Content Development:

    The postholder is responsible for inspiring and leading development of innovative rich content outputs for the ONS website and other channels, which anticipate and meet user needs and expectations, including those of the Citizen User. The role holder has an important part to play in helping ONS to realise its vision “for official statistics to achieve greater impact on key decisions affecting the UK and to encourage broader use across the country”.

    Key Responsibilities:

    1. Inspires, builds, leads and develops a multi-disciplinary team of designers, developers, data analysts and communications experts to produce innovative new outputs for the ONS website and other channels.
    2. Keeps abreast of emerging trends and identifies new opportunities for the use of rich web content with ONS outputs.
    3. Identifies new opportunities, proposes new directions and developments and gains buy in and commitment to these from Senior Executives and colleagues in other ONS business areas.
    4. Works closely with business areas to identify, assess and commission new rich-content projects.
    5. Provides, vision, guidance and editorial approval for new projects based on a continual understanding of user needs and expectations.
    6. Develops and manages an ongoing portfolio of innovative content, maximising impact and value for money.
    7. Builds effective partnerships with media to increase outreach and engagement with ONS content.
    8. Establishes best practice in creation of rich content for the web and other channels, and works to improve practice and capability throughout ONS.

  • From December 2010, a short term contract at the BBC for a data journalist:

    The team is looking for a creative, tech-savvy data journalist (computer-assisted reporter) to join its website specials team to work with our online journalists, graphic designer and development teams.

    Role Purpose and Aims

    You will be required to humanize statistics; to make sense of potentially complicated data and present it in a user friendly format.

    You will be asked to focus on a range of data-rich subjects relating to long-term projects or high impact daily new stories, in line with Global News editorial priorities. These could include the following: reports on development, global poverty, Afghanistan casualties, internet connectivity around the world, or global recession figures.

    Key Knowledge and Experience

    You will be a self-starter, brimming with story ideas who is comfortable with statistics and has the expertise to delve beneath the headline figures and explain the fuller picture.
    You will have significant journalistic experience gained ideally from working in an international news environment.
    The successful candidate should have experience (or at least awareness) of visualising data and visualisation tools.
    You should be excited about developing the way that data is interpreted and presented on the web, from heavy number crunching, to dynamic mapping and interactive graphics. You must have demonstrated knowledge of statistics, statistical analysis, with a good understanding of the range and breadth of data sources in the UK and internationally, broad experience with data sources, data mining and have good visual and statistical skills.
    You must have a Computer-assisted reporting background or similar, including a good knowledge of the relevant software (including Excel and mapping software).
    Experience of producing and developing data driven web content a senior level within time and budget constraints.

    Central to the role is an ability to analyse complicated information and present it to our readers in a way that is visually engaging and easy to understand, using a range of web-based technologies, for which you should have familiarity with database interfaces and web presentation layers, as well as database concepting, content entry and management.
    You will be expected to have your own original ideas on how to best apply data driven journalism, either to complement stories when appropriate or to identify potential original stories while interpreting data, researching and investigating them, crunching the data yourself and working with designers and developers on creating content that will engage our audience, and provide them with useful, personalised information.

    FWIW, it’s probably worth remembering that the use of data is not necessarily a new thing.. for example, this post – The myth of the missing Data Scientist – does a good job debunking some of the myths around “data science”.

Written by Tony Hirst

February 1, 2013 at 1:02 pm

Posted in Jobs

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 787 other followers