I was pleased to be invited back to the University of Lincoln again yesterday to give a talk on data journalism to a couple of dozen or so journalism students…
I was hoping to generate a copy of the slides (as images) embedded in a markdown version of the notes but couldn’t come up with a quick recipe for achieving that…
When I get a chance, it looks as if the easiest way will be to learn some VBA/Visual Basic for Applications macro scripting… So for example:
If anyone beats me to it, I’m actually on a Mac, so from the looks of things on Stack Overflow, hacks will be required to get the VBA to actually work properly?
One of the big issues when it comes to working with data is that things are often done most easily using small fragments of code. Something I’ve been playing with recently are IPython notebooks, which provide an interactive browser based interface to a Python shell.
The notebooks are built around the idea of cells of various types, including header and markdown/HTML interpreted cells and executable code cells. Here are a few immediate thoughts on how we might start to use these notebooks to support distance and self-paced education:
The output from executing a code cell is displayed in the lower part of the cell when the cell is run. Code execution causes state changes in the underlying IPython session, the current state of which is accessible to all cells.
Graphical outputs, such as chart objects generated using libraries such as matplotlib, can also be displayed inline (not shown).
There are several ways we might include a call to action for students:
* invitations to run code cells;
* invitations to edit and run code cells;
* invitations to enter commentary or reflective notes.
We can also explore ways of providing “revealed” content. One way is to make use of the ability to execute code, for example by importing a crib class that we can call…
Here’s how we might prompt its use:
All code cells in a notebook can be executed one after another, or a cell at a time (You can also execute all cells above or below the cursor). This makes for easy testing of the notebook and self-paced working through it.
A helper application, nbconvert, allows you to generate alternative versions of a notebook, whether as HTML, python code, latex, or HTML slides (reveal.js). There is also a notebook viewer available that displays an HTML view of a specified ipynb notebook. (There is also an online version: IPython notebook viewer.)
Another advantage of the browser based approach is that the IPython shell can run in a virtual machine (for example, Cursory Thoughts on Virtual Machines in Distance Education Courses) and expose the notebook as a service that can be accessed via a browser on a host machine:
A simple configuration tweak allows notebook files and data files to occupy a folder that is shared across both the host machine and the guest virtual machine.
It is also possible to run remote notebook servers that can be accessed via the web. It would be nice if institutional IT services could support the sort of Agile EdTech that Jim Groom has been writing about recently, that would allow course teams developers to experiment with this sort of technology quickly and easily, but in the meantime, we can still do it ourselves…
For example, I am currently running a couple of virtual machines whose configurations I have “borrowed” from elsewhere – Matthew Russell’s Mining the Social Web 2nd Edition VM, and @datamineruk’s pandas infinite-intern machine.
I’ve only really just started exploring what may be possible with IPython notebooks, and how we might be able to use them as part of an e-learning course offering. If you’ve seen any particularly good (or bad!) examples of IPython notebooks being used in an educational context, please let me know via the comments…:-)
As part of a new course I’m working on, the course team has been making use of shared Google docs for working up the course proposal and “D0″ (zero’th draft; key topics to be covered in each of the weekly sessions). Although the course production hasn’t been approved yet, we’ve started drafting the actual course materials, with an agreement to share them for comment via Google docs.
The approach I’ve taken is to created a shared folder with the rest of the course teams, and set up documents for each of the weekly sessions I’ve taken the lead on.
The documents in this folder are all available to other members of the course team – for reference and /or comment – at any time, and represent the “live”/most current version of each document I’m working on. I suspect that others in the course team may take a more cautious approach, only sharing a doc when it’s in a suitable state for handover – or at least, comment – but that’s fine too. My docs can of course be used that way as well – no-one has to look at them until I do “hand them over” for full comment at the end of the first draft stage.
But what if others, such as the course team chair or course manager, do want to keep check on progress over the coming weeks?
The file listing shown above doesn’t give a lot away about the stare of each document, not even a file size, only when it was last worked on. So it struck me that it might be useful to have a visual indicator (such as a horizontal progress bar) about the progress on each document so that someone looking at the listing would know whether there was any point opening a document to have a look inside at all…
..because at the current time, a lot of the docs are just stubs, identifying tasks to be done.
Progress could be measured by proxy indicators, such as file size, “page count” equivalent, or line count. In these cases, the progress meter could be updated automatically. Additional insight could be provided by associating a target line count or page length metadata element, providing additional feedback to the author about progress with respect to that target. If a document exceeds the planned length, the progress meter should carry on going, possibly with a different colour denoting the overrun.
There are a couple of problems at least with this approach – documents that are being worked on may blend scruffy working notes along with actual “finished” text; several versions of the same paragraph may exist as authors try out different approaches, all adding to the line count. Long copied chunks from other sources may be in the text as working references, and so on.
So how about an additional piece of metadata for docs additionally tagged as “task” type in which a user can set a quick progress percentage estimate (a slider widget would make this easy to update) that is displayed in a bar on the file listing. Anyone checking the folder could then – at a glance – see which docs were worth looking at based on progress within the document-as-task. (Of course, having metadata available also opens up the possibility of additional mission creeping features, rulesets for generating alerts when a doc hits a particular percentage completion, for example.)
I’m not looking for more project management tools to take time away from a task, but in this case think the simple addition of a “progress” metadata element could weave an element of project management support into this sort of workflow? (changing the title of the doc would be another way – eg adding (20% done) to the title…
Thinks: hmm, I procrastinating, aren’t I? I should really be working on one of those docs…;-)
Over the last couple of years, I’ve settled into using R an python as my languages of choice for doing stuff:
- R, because RStudio is a nice environment, I can blend code and text using R markdown and knitr, ggplot2 and Rcharts make generating graphics easy, and reshapers such as plyr make wrangling with data realtvely easy(?!) once you get into the swing of it… (though sometimes OpenRefine can be easier…;-)
- python, because it’s an all round general purpose thing with lots of handy libraries, good for scraping, and a joy to work with in iPython notebook…
Sometimes, however, you know – or remember – how to do one thing in one language that you’re not sure how to do in another. Or you find a library that is just right for the task hand but it’s in the other language to the one in which you’re working, and routing the data out and back again can be a pain.
How handy it would be if you could make use of one language in the context of another? Well, it seems as if we can (note: I haven’t tried any of these recipes yet…):
Using R inside Python Programs
Whilst python has a range of plotting tools available for it, such as matplotlib, I haven’t found anything quite as a expressive as R’s ggplot2 (there is a python port of ggplot underway but it’s still early days and the syntax, as well as the functionality, is still far from complete as compared to the original [though not a far as it was given the recent update;-)] ). So how handy would it be to be able to throw a pandas data frame, for example, into an R data frame and then use ggplot to render a graphic?
(See also: ggplot2 in Python: A major barrier broken.)
Using python Inside R
Whilst one of the things I often want to do in python is plot R style ggplots, one of the hurdles I often encounter in R is getting data in in the first place. For example, the data may come from a third party source that needs screenscraping, or via a web API that has a python wrapper but not an R one. Python is my preferred tool for writing scrapers, so is there a quick way I can add a python data grabber into my R context? It seems as if there is: rPython, though the way code is included looks rather clunky and WIndows support appears to be moot. What would be nice would be for RStudio to include some magic, or be able to support python based chunks…
(See also: Calling Python from R with rPython.)
(Note: I’m currently working on the production of an Open University course on data management and use, and I can imagine the upset about overcomplicating matters if I mooted this sort of blended approach in the course materials. But this is exactly the sort of pragmatic use that technologists use code for – as a tool that comes to hand and that can be used quickly and relatively efficiently in concert with other tools, at least when you’re working in a problem solving (rather than production) mode.)
Every so often I do a round up of job openings in different areas. This is particular true around year end, as I look at my dwindling salary (no more increments, ever, and no hope of promotion, …) and my overall lack of direction, and try to come up with sort sort of resolution to play with during the first few weeks of the year.
The data journalism phrase has being around for some time now (was it really three and half years ago since Data Driven Journalism Round Table at the European Journalism Centre? (FFS, what have I been doing for the last three years?!:-( and it seems to be maturing a little. We’ve had the period of shiny, shiny web apps requiring multiskilled development teams and designers working with the hacks to produce often confusing and wtf am I supposed to be doing now?! interactives and things seem to be becoming a little more embedded… Perhaps…
My reading (as an outsider) is that there is now more of a move towards developing some sort of data skillbase that allows journalists to do “investigative” sorts of things with data, often using very small data sets or concise summary datasets. To complement this, there seems to be some sort of hope that visually appealing charts can be used to hook eyeballs into a story (rather than pushing eyeballs away) – Trinity Mirror’s Ampp3d (as led by Martin Belam) is a good example of this, as is the increasing(?) use of the DataWrapper library.
From working with the School of Data, as well as a couple of bits of data journalism not-really-training with some of the big news groups, I’ve come to realise there is probably some really basic, foundational work to be done in the way people think (or don’t think) about data. For example, I don’t believe that people in general read charts. I think they may glance at them, but they don’t try to read them. They have no idea what story they tell. Given a line chart that plots some figure over time. How many people ride the line to get a feel for how it really changed?
Hans Rosling famously brings data alive with his narrative commentary around animated development data charts, including bar charts…
But if you watch the video with the sound off, or just look at the final chart, do you have the feeling of being told the same story? Can you even retell yourself the story by looking at the chart. And how about if you look at another bar chart? Can you use any of Hans Rosling’s narrative or rhetorical tricks to help you read through those?
(The rhetoric of data (and the malevolent arts of persuasion) is something I want to ponder in more depth next year, along with the notion of data aesthetics and the theory of forms given a data twist.)
Another great example of narrated data storytelling comes from Kurt Vonnegut as he describes the shapes of stories:
Is that how you read a line chart when you see one?
One thing about the data narration technique is that it is based around the construction of a data trace. There is a sense of anticipation about where the line will go next, and uncertainty as to what sort of event will cause the line to move one way or another. Looking back at a completed data chart, what points do we pick from it that we want to use as events in our narration or reading of it? (The lines just connect the points – they are processional in the way they move us from one point of interest to the next, although the gradient of the line may provide us with ideas for embellishing or decorating the story a little.)
It’s important to make art because the people that get the most out of art are the ones that make it. It’s not … You know there’s this idea that you go to a wonderful art gallery and it’s good for you and it makes you a better person and it informs your soul, but actually the person who’s getting the most out of any artistic activity is the person who makes it because they’re sort of expressing themselves and enjoying it, and they’re in the zone and you know it’s a nice thing to do. [Grayson Perry, Reith Lectures 2013, Lecture 2, Q&A [transcript, PDF], audio]
In the same way, the person who gets the most out of a chart is the person who constructed it. They know what they left in and what they left out. They know why the axes are selected as they are, why elements are coloured or sized as they are. They know the question that led up to the chart and the answers it provides to those questions. They know where to look. Like an art critic who reads their way round a painting, they know how to read one or many different stories from the chart.
The interactives that appeared during the data journalism wave from a couple of years ago sought to provide a playground for people to play with data and tells their own stories with it. But they didn’t. In part because they didn’t know how to play with data, didn’t know how to use it in a constructive way as part of a narrative, (even a made up, playful narrative). And in part this comes back to not knowing how to read – that is, recover stories from – a chart.
It is often said that a picture saves a thousand words, but if the picture tells a thousand word story, how many of us try to read that thousand word story from each picture or chart? Maybe we need to use a thousand words as well as the chart? (How many words does Hans Rosling use? How many, Kurt Vonnegut?)
When producing a chart that essentially represents a summary of a conversation with have had with a dataset, it’s important to remember that for someone looking at the final chart it might not make as much sense in absence of the narrative that was used to construct it. Edward de Bono’s constructed illustrations helps read a the final image through recalling his narrative. But if we just look at a “completed” sketch from one of his talks, it will probably be meaningless.
One of the ideas that works for me when I reflect on my own playing with data is that it is a conversation. Meaning is constructed through the conversation I have with a dataset, and the things it reveals when I pose particular questions to it. In many cases, these questions are based on filtering a dataset, although the result may be displayed in many ways. The answers I get to a question inform the next question I want to ask. Questions take the form of constructing this chart as opposed to that chart, though I am free to ask the same question in many slightly different ways if the answers don’t appear to be revealing of anything.
It is in this direction – of seeing data as a source that can be interviewed and coaxed into telling stories – that I sense elements of the data journalism thang are developing. This leads naturally into seeing data journalism skills as core investigative style skills that all journalists would benefit from. (Seeing things as data allows you to ask particular sorts of question in very particular ways. Being able to cast things into a data form – as for example in Creating Data from Text – Regular Expressions in OpenRefine) – so that they become amenable to data-style queries, is the next idea I think we need to get across…
So what are the jobs that are out at the moment? Here’s a quick round-up of some that I’ve spotted…
- Data editor (Guardian): “develop and implement a clear strategy for the Data team and the use of data, numbers and statistics to generate news stories, analysis pieces, blogs and fact checks for The Guardian and The Observer.
You will take responsibility for commissioning and editing content for the Guardian and Observer data blogs as well as managing the research needs of the graphics department and home and foreign news desks. With day-to-day managerial responsibility for a team of three reporters / researchers working on the data blog, you will also be responsible for data analysis and visualisation: using a variety of specialist software and online tools, including Tableau, ARCGis, Google Fusion, Microsoft Access and Excel”
Perpetuating the “recent trad take” on data journalism, viewed as gonzo journalist hacker:
- Data Journalist [Telegraph Media Group]: “[S]ource, sift and surface data to find and generate stories, assist with storytelling and to support interactive team in delivering data projects.
“The Data Journalist will sit within the Interactive Data team, and will work with a team of designers, web developers and journalists on data-led stories and in developing innovative interactive infographics, visualisations and news applications. They will need to think and work fast to tackle on-going news stories and tight deadlines.
- One of the most exciting opportunities that I can see around data related published is in new workflows and minimising the gap between investigative tools and published outputs. This seems to me a bit risky in that it seems so conservative when it comes to getting data outputs actually published?
Designer [Trinity Mirror]: “Trinity Mirror’s pioneering data unit is looking for a first-class designer to work across online and print titles. … You will be a whizz with design software – such as Illustrator, Photoshop and InDesign – and understand the principles of designing infographics, charts and interactives for the web. You will also be able to design simple graphical templates for re-purposing around the group.
“You should have a keen interest in current affairs and sport, and be familiar with – and committed to – the role data journalism can play in a modern newsroom.”
- [Trinity Mirror]: Can you take an API feed and turn it into a compelling gadget which will get the whole country talking?
“Trinity Mirror’s pioneering data unit is looking for a coder/developer to help it take the next step in helping shape the future of journalism. …
“You will be able to create tools which automatically grab the latest data and use them to create interactive, dynamically-updated charts, maps and gadgets across a huge range of subjects – everything from crime to football. …
“The successful candidate will have a thorough knowledge of scraping techniques, ability to manage a database using SQL, and demonstrable ability in at least one programming language.”
But there is hope about the embedding of data skills as part of everyday journalistic practice:
- Culture report (Guardian): “We are looking for a Culture Reporter to generate stories and cover breaking news relating to Popular Culture, Film and Music … Applicants should also have expertise with digital tools including blogging, social media, data journalism and mobile publishing. “
- Investigations Correspondent [BBC Newsnight]: “Reporting to the Editor, the Investigations Correspondent will contribute to Newsnight by producing long term investigations as well as sometimes contributing to big ongoing stories. Some investigations will take months, but there will also be times when we’ll need to dig up new lines on moving the stories in days.
“We want a first rate reporter with a proven track record of breaking big stories who can comfortably work across all subject areas from politics to sport. You will be an established investigative journalist with a wide range of contacts and sources as well as having experience with a range of different investigative approaches including data journalism, Freedom Of Information (FOI) and undercover reporting.”
- News Reporter, GP [Haymarket Medical Media]: “GP is part of Haymarket Medical Media, which also produces MIMS, Medeconomics, Inside Commissioning, and mycme.com, and delivers a wide range of medical education projects. …
“Ideally you will also have some experience of data journalism, understand how social media can be used to enhance news coverage and have some knowledge of multimedia journalism, including video and blogs.”
- Reporter, ENDS Report [Haymarket]: “We are looking for someone who has excellent reporting and writing skills, is enthusiastic and able to digest and summarise in depth documents and analysis. You will also need to be comfortable with dealing with numbers and statistics and prepared to sift through data to find the story that no one else spots.
“Ideally you will have some experience of data journalism, understand how social media can be used to enhance news coverage.”
Are there any other current ones I’m missing?
I think the biggest shift we need is to get folk treating data as a source that responds to a particular style of questioning. Learning how to make the source comfortable and get it into a state where you can start to ask it questions is one key skill. Knowing how to frame questions so that discover the answers you need for a story are another. Choosing which bits of the conversation you use in a report (if any – maybe the conversation is akin to a background chat?) yet another.
Treating data as a source also helps us think about how we need to take care with it – how not to ask leading questions, how not to get it to say things it doesn’t mean. (On the other hand, some folk will undoubtedly force the data to say things it never intended to say…
“If you torture the data enough, nature will always confess” [Ronald Coase]
[Disclaimer: I started looking at some medical data for Haymarket.]
Via @wilm, I notice that it’s time again for someone (this time at the Wall Street Journal) to have written about the scariness that is your Google personal web history (the sort of thing you probably have to opt out of if you sign up for a new Google account, if other recent opt-in by defaults are to go by…)
It may not sound like much, but if you do have a Google account, and your web history collection is not disabled, you may find your emotional response to seeing months of years of your web/search history archived in one place surprising… Your Google web history.
Not mentioned in the WSJ article was some of the games that the Chrome browser gets up. @tim_hunt tipped me off to a nice (if technically detailed, in places) review by Ilya Grigorik of some the design features of the Chrome browser, and some of the tools built in to it: High Performance Networking in Chrome. I’ve got various pre-fetching tools switched off in my version of Chrome (tools that allow Chrome to pre-emptively look up web addresses and even download pages pre-emptively*) so those tools didn’t work for me… but looking at chrome://predictors/ was interesting to see what keystrokes I type are good predictors of web pages I visit…
* By the by, I started to wonder whether webstats get messed up to any significant effect by Chrome pre-emptively prefetching pages that folk never actually look at…?
In further relation to the tracking of traffic we generate from our browsing habits, as we access more and more web/internet services through satellite TV boxes, smart TVs, and catchup TV boxes such as Roku or NowTV, have you ever wondered about how that activity is tracked? LG Smart TVs logging USB filenames and viewing info to LG servers describes not only how LG TVs appear to log the things you do view, but also the personal media you might view, and in principle can phone that information home (because the home for your data is a database run by whatever service you happen to be using – your data is midata is their data).
there is an option in the system settings called “Collection of watching info:” which is set ON by default. This setting requires the user to scroll down to see it and, unlike most other settings, contains no “balloon help” to describe what it does.
At this point, I decided to do some traffic analysis to see what was being sent. It turns out that viewing information appears to be being sent regardless of whether this option is set to On or Off.
you can clearly see that a unique device ID is transmitted, along with the Channel name … and a unique device ID.
This information appears to be sent back unencrypted and in the clear to LG every time you change channel, even if you have gone to the trouble of changing the setting above to switch collection of viewing information off.
It was at this point, I made an even more disturbing find within the packet data dumps. I noticed filenames were being posted to LG’s servers and that these filenames were ones stored on my external USB hard drive.
Hmmm… maybe it’s time I switched out my BT homehub for a proper hardware firewalled router with a good set of logging tools…?
PS FWIW, I can’t really get my head round how evil on the one hand, or damp squib on the other, the whole midata thing is turning out to be in the short term, and what sorts of involvement – and data – the partners have with the project. I did notice that a midata innovation lab report has just become available, though to you and me it’ll cost 1500 squidlly diddlies so I haven’t read it: The midata Innovation Opportunity. Note to self: has anyone got any good stories to say about TSB supporting innovation in micro-businesses…?
PPS And finally, something else from the Ilya Grigorik article:
The HTTP Archive project tracks how the web is built, and it can help us answer this question. Instead of crawling the web for the content, it periodically crawls the most popular sites to record and aggregate analytics on the number of used resources, content types, headers, and other metadata for each individual destination. The stats, as of January 2013, may surprise you. An average page, amongst the top 300,000 destinations on the web is:
– 1280 KB in size
– composed of 88 resources
– connects to 15+ distinct hosts
Is it any wonder that pages take so long to load on a mobile phone off the 3G netwrok, and that you can soon eat up your monthly bandwidth allowance!
Whenever I make a data related FOI request or a request for data within my own organisation, I try to ask for copies of the data as data in the form of a spreadsheet, CSV file or database dump. One advantage of this approach is that it can help establish precedents that argue against the non-disclosure of similar datasets from other organisations. (It also provides an FOI Route to Real (Fake) Open Data via WhatDoTheyKnow, a means by which we can build an index of data files released in response to FOI requests made via WhatDoTheyKnow. I also note a technique from Paul Bradshaw for requesting schemas and data dictionaries as a basis for making specific requests where whole database dump requests may not be possible, or for enriching/decoding datasets that have been released.)
It hadn’t occurred to me to try to request data explicitly (and sneakily?!;-) in a form that if lazily produced might actually contain more data than the publisher intended, but it strikes me that this could be a handy social engineering trick. Take this case, for example, as described in the ICO blog: The risk of revealing too much, which describes in part how responses to FOI requests made via WhatDoTheyKnow may accidentally be revealing personally identifiable information:
The issue relates to responses to freedom of information (FOI) requests provided in spreadsheets, which are inadvertently revealing personal information. Public authorities will often respond to requests by supplying the information requested in spreadsheet format. Sometimes that will be in the form of a ‘pivot table’, which can neatly summarise the information, without revealing the underlying personal information the summary is based on.
Unfortunately, it has come to our attention that public authorities are not always properly removing the underlying data before disclosing. Pivot tables, both in Microsoft Excel and other spreadsheet programs, retain a copy of the source data used. This information is hidden from view, but is easily accessible.
This is just a variant on revealing the document metadata that describes tracked changes in a Word document (e.g. Microsystems white paper: What Lies Beneath YourDocuments May Embarrass, Hurt or Cost You) in that it relies on the user being unaware of how the document is actually structured or what metadata it may contain, and represents another example of how our folk understanding of IT is often at odds to what’s actually going on inside an application.
I don’t know if it’s still the case, but chart objects embedded in a Word doc from an Excel spreadsheet used to carry the original data from which the chart was derived with them, and which therefore provides another way of getting hold of actual data points… (For example, a document publisher may think they are giving you an image of a chart, not appreciating that the chart is actually constructed from data values contained within the chart object.)
Of course, if data were published using simple text formats, such as CSV, there would be nowhere to hide any embarrassing metadata…