How Reproducible Data Analysis Scripts Can Help You Route Around Data Sharing Blockers

For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it?

If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.)

But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.

In the above linked example, the Rmd script links to a local copy of a dataset I’d downloaded onto my local computer. But if I’d written a properly reusable, reproducible script, I should have done at least one of the following two things:

  • either added a local copy of the data to the repository and checked that the script correctly linked relatively to it;
  • and/or provided the original download link for the datafile (and the HTML web page on which the link could be found) and loaded the data in from that URL.

Where the license of a dataset allows sharing, the first option is always a possibility. But where the license does not allow sharing on, the second approach provides a de facto way of sharing the data without actually sharing it directly yourself. I may not be giving you a copy of the data, but I am giving you some of the means by which you can obtain a copy of the data for yourself.

As well as getting round licensing requirements that limit sharing of a dataset but allow downloading of it for personal use, this approach can also be handy in other situations.

For example, where a dataset is available from a particular URL but authentication is required to access it (this often needs a few more tweaks when trying to write the reusable downloader! A stop-gap is to provide the URL in reproducible report document and explicitly instruct the reader to download the dataset locally using their own credentials, then load it in from the local copy).

Or as Paul Bivand pointed out via Twitter, in situations “where data is secure like pupil database, so replication needs independent ethical clearance”. In a similar vein, we might add where data is commercial, and replication may be forbidden, or where additional costs may be incurred. And where the data includes personally identifiable information, such as data published under a DPA exemption as part of a public register, it may be easier all round not to publish your own copy or copies of data from such a register.

Sharing recipes also means you can share pathways to the inclusion of derived datasets, such as named entity tags extracted from a text using free, but non-shareable, (or at least, attributable) license key restricted services, such as the named entity extraction services operated by Thomson Reuters OpenCalais, Microsoft Cognitive Services, IBM Alchemy or Associated Press. That is, rather than tagging your dataset and then sharing and analysing the tagged data, publish a recipe that will allow a third party to tag the original dataset themselves and then analyse it.

Computer Spirits…

I doubt there are many readers of this blog who aren’t familiar with science fiction guru Arthur C. Clarke’s adage that “[a]ny sufficiently advanced technology is indistinguishable from magic”. And there may even be a playful few who invoke Rowlingesque spells on the commandline using Harry Potter bash aliases. So I was wondering again today about what other magical or folkloric ideas could be used to help engage folk’s curiosity in how the world of tech works, and maybe teach computing related ideas through stories.

For example, last week I noticed that a reasonable number of links on Wikipedia point to the Internet Archive.

I also picked from a recent Recode/Decode podcast interview between the person you may know as the awesomest tech interviewer ever, Kara Swisher, and Internet Archive champion, Brewster Kahle, that bots do the repair work. So things like the User:InternetArchiveBot and/or CyberBot II maybe? Broken links are identified, and link references updated to point to archival copies. (For more info, see: More than 1 million formerly broken links in English Wikipedia updated to archived versions from the Wayback Machine and Fixing broken links in Wikipedia (especially the comments).)

Hmm… helpful bots.. like helpful spirits, or Brownies in a folkloric sense. Things that come out at night and help invisibly around the home…

And if there are helpful spirits, there are probably malicious ones too. The code equivalent of boggarts and bogles that cause mischief or mayhem – robot phone callers, or scripts that raise pop-ups when you’re trying to read a post online, for example? Maybe we if we start to rethink of online tech inconveniences as malevolent spirits we’ll find better ways to ignore or dispel them?! Or at least find a way to engage people into thinking about them, and from that working out how best to get rid of them or banish them from our lives?

PS the problem of Link Rot is an issue for maintaining OU course materials too. As materials are presented year on year, link targets move away and/or die. Sometimes the materials are patched with a corrected link to wherever the resource moved to, other times we refresh materials and find a new resource to link to. But generally, I wonder, why don’t we make like Wikipedia and get a Brownie to help? Are there Moodle bots to do helpful work like this around the VLE?

Reuse and Build On – IW Broadband Reports

A couple of weeks ago I posted a demo of how to automate the production of a templated report (catchment for GP practices by LSOA on the Isle of Wight) using Rmd and knitr (Reporting in a Repeatable, Parameterised, Transparent Way).

Today, I noticed another report, with data, from the House of Commons Library on Superfast Broadband Coverage in the UK. This reports at the ward level rather than the LSOA level the GP report was based on, so I wondered how easy it would be to reuse the GP/LSOA code for a broadband/ward map…

After fighting with the Excel data file (metadata rows before the header and at the end of the table, cruft rows between the header and data table proper) and the R library I was using to read the file (it turned the data into a tibble, with spacey column names I couldn’t get to work with ggplot, rather than a dataframe – I ended saving to CSV then loading back in again…), not many changes were required to the code at all… What I really should have done was abstracted the code in to an R file (and maybe some importable Rmd chunks) and tried to get the script down to as few lines of bespoke code to handle the new dataset as possible – maybe next time…

The code is here and example PDF here.

I also had a quick play at generating a shiny app from the code (again, cut and pasting rather than abstracting into a separate file and importing… I guess at least now I have three files to look at when trying to abstract the code and to test against…!)

Shiny code here.

So what?

So this has got me thinking – what are the commonly produced “types” of report or report section, and what bits of common/reusuble code would make it easy to generate new automation scripts, at least at a first pass, for a new dataset?

Sharing the Data Load

A few weeks ago, I popped together a post listing a few Data Journalism Units on Github. These repos (that is, repositories), are being used to share code (for particular interactives, for example), data, and analysis scripts. They’re starting to hint at ways in which support for public reproducible local data journalism might start to emerge from developing (standardised) data repositories and reproducible workflows built around them.

Here are a handful of other signals that I think support this trend that I’ve come across in the last few weeks (if they haven’t appeared in your own feeds, a great shortcut to many of them is via @digidickinson’s weekly Media Mill Gazette):

Organisations:

Applications:

Data:

And here’s another one, from today – the Associated Press putting together a pilot with data publishing platform data.world “to help newsrooms find local stories within large datasets” (Localizing data, quantifying stories, and showing your work at The Associated Press ). I’m not sure what the pilot will involve, but the rationale sounds interesting:

Transparency is important. It’s a standard we hold the government to, and it’s a standard we should hold the press to. The more journalists can show their work, whether it’s a copy of a crucial document or the data underlying an analysis, the more reason their audience has to accept their findings (or take issue with them in an informed way). When we share our data and methodology with our members, those journalists give us close scrutiny, which is good for everyone. And when we can release the data more broadly and invite our readers to check our work, we create a more secure grounding for the relationship with the reader.

:-) S’what we need… Show your working…

Reporting in a Repeatable, Parameterised, Transparent Way

Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.

During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.

Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, ad hoc questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.

As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.

The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the folium library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).

One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.

Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and in particular the knitr tools.

So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.

__dropbox_parlidata_rdemos_-_rstudio

For example, from a single Rmd file we can knit an output HTML file that incorporates an interactive leaflet map, or a static PDF document.

It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.

The bookdown package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which bookdown supports partially parameterised compound document creation).

Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:

  • to look for sensible ways of comparing the workflow associated with each,
  • the ramp-up skills required, and blockers (including cultural blockers (also administrative / organisational blockers, h/t @dasbarrett)) associated with getting started with new tools such as Jupyter or RStudio, and
  • the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.

It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.

Tabloid Data Journalism?

At the risk of coming across as a bit snobbish, this ad for a Data Journalist for The Penny Hoarder riled me somewhat…

Do you have a passion for telling stories with data? We’re looking for a data journalist who can crunch statistics about jobs, budgeting, spending and saving — and produce compelling digital content that resonates with our readers. You should have expertise in data mining and analysis, and the ability to present the results in conversational, fun articles and/or telling graphics.

As our data journalist, you will produce revealing, clickable, data-driven articles and/or graphics, plus serve as a resource for our growing team of writers and editors. We envision using data sources such as the Bureau of Labor Statistics and U.S. Census Bureau to report on personal finance issues of interest to our national readership of young professionals, coupon fans and financially striving people of all ages. We want to infuse our blog with seriously interesting data while staying true to our vibe: fun, weird, useful.

Our ideal candidate…
– …
– Can write in a bloggy, conversational voice that emphasizes what the data means to real people
– Has a knack for identifying clicky topics and story angles that are highly shareable
– Gets excited when a blog post goes viral
– …

According to Wikipedia (who else?!;-), Tabloid journalism is a style of journalism that emphasizes sensational crime stories, gossip columns about celebrities and sports stars, junk food news and astrology.

(Yes, yes, I know, I know, tabloid papers can also do proper, hard hitting investigative journalism… But I’m thinking about that sense of the term…)

So what might tabloid data journalism be? See above?

PS ish prompted by @SophieWarnes, it’s probably worth mentioning the aborted Ampp3d project in this context… eg Ampp3d launches as ‘socially-shareable data journalism’ site, Martin Belam talks about Trinity Mirror’s data journalism at Ampp3d and The Mirror Is Making Widespread Cuts To Its Online Journalism.

PPS …and a write-up of that by Sophie: Is there room for ‘tabloid data journalism’?

Open Educational Resources from Government and Parliament

Mentioning to a colleague yesterday that the UK Parliamentary library published research briefings and reports on topics of emerging interest, as well as to support legislation, that often provided a handy, informed, and politically neutral  overview of a subject area that could make for a useful learning resource, the question was asked whether or not they might have anything on the “internet of things”. The answer is not much, but it got me thinking a bit more about the range of documents and document types produced across Parliament and Government that can be used to educate and inform, as well as contribute to debate.

In other words, to what extent might such documents be used in an educational sense, whether in the sense of providing knowledge and information about a topic, providing a structured review of a topic area and the issues associated with it, raising questions about an issue, or reporting on an analysis of it. (There are also opportunities for learning from some of the better Parliamentary debates, for example in terms of how to structure an argument, or explore the issues associated with an issue, but Hansard is out of scope of this post!)

(Also note that I’m coming at this as a technologist, interested as much in the social processes, concerns and consequences associated with science and technology as much as the deep equations and principles that tend to be be taught as the core of the subject, at least in HE. And that I’m interested not just on how we can support the teaching and learning of current undergrads, but also how we can enculturate them into the availability and use of certain types of resource that are likely to continue being produced into the future, and as such provide a class of resources that will continue to support the learning and education of students once they leave formal education.)

So using IoT as a hook to provide examples, here’s the range of documents I came up with. (At some point it maybe worth tabulating this to properly summarise the sorts of information these reports might came, the communicative point of the document (to inform, persuade, provide evidence for or against something, etc), and any political bias that may be likely (in policy docs, for example).

Parliamentary Library Research Briefings

The Parliamentary Library produces a range of research briefings to cover matters of general interest (Commons Briefing papers, Lords Library notes), perhaps identified through multiple questions asked of the Library by members?, as well as background to legislation (Commons Debate Packs, Lords in Focus), through the Commons and Lords Libraries respectively.

Some of the research briefings include data sets (do a web search for site:http://researchbriefings.files.parliament.uk/ filetype:xlsx) which can also be quite handy.

There are also POSTnotes from the Parliamentary Office of Science and Technology, aka POST.

For access to briefings on matters currently in the House, the Parliament website provides timely/handy pages that list briefings documents for matters in the House today/this week. In addition, there are feeds available for recent briefings from all three: Commons Briefing Papers feed, Lords Library Notes feed, POSTnotes feed. If you’re looking for long reads and still use a feed reader, get subscribing;-)

Wider Parliamentary Documents

The Parliament website also supports navigation of topical issues such as Science and Technology, as well as sub-topics, such as Internet and Cybercrime. (I’m not sure how the topics/sub-topics are identified or how the graph is structured… That may be one to ask about when I chat to Parliamentary Library folk next week?:-)

Within the topic areas, relevant Commons and Lords related Library research briefings are listed, as well as
POSTnotes, Select Committee Reports and Early Day Motions.

(By the by, it’s also worth noting that chunks of the Parliament website are currently in scope of a website redesign.)

Government Documents

Along with legislation currently going through Parliament that is published on the Parliament website (along with Hansard reports that record, verbatim(-ish!) proceedings of debates in either House), explanatory notes provided by the Government department bringing a bill provide additional, supposedly more accessible/readable, information around it.

Reports are also published by government offices. For example, the Blackett review (2014) on the Internet of things was a Government Office for Science report from the UK Government Chief Scientific Adviser at the time (The Internet of Things: making the most of the Second Digital Revolution). Or how about a report from the Intellectual Property Office on Eight great technologies: The internet of things.

Briefing documents also appear in a variety of other guises. For example, competitions (such as the Centre for Defence Enterprise (CDE) competition on security for the internet of things, or Ofcom’s consultation on More radio spectrum for the Internet of Things) and consultations may both provide examples of how to start asking questions about a particular topic area (questions that may help to develop critical thinking, prompt critical reflection, or even provide ideas for assessment!).

Occasionally, you can also turn up a risk assessments or cost benefit analysis, such as this Department for Business, Energy & Industrial Strategy Smart meter roll-out (GB): cost-benefit analysis.

EC Parliamentary Research Service

In passing, it’s also worth noting that the EC Parliamentary Research Service also do briefings, such as this report on The Internet Of Things: Opportunities And Challenges, as well as publishing resources linked from topic based pages, such as the Digital Single Market them topic page on The Internet of Things.

Summary

In providing support for all members of the House, the Parliamentary research services must produce research briefings that can be used by both sides of the House. This may stand in contrast to documents produced by Government that may be influenced by particular policy (and political) objectives, or formal reports published by industry bodies and the big consultancies (the latter often producing reports that are either commissioned on behalf of government or published to try to promote some sort of commercial interest that can be sold to government) that may have a lobbying aim.

As I’ve suggested previously, (News, Courses and Scrutiny and Learning Problems and Consultation Based Curricula), maybe we could/should be making more use of them as part of higher education course readings, not just as a way of getting a quick, NPOV view over a topic area, bus also as a way of introduce students to a form of free and informed content, produced in timely way in response to issues of the day. In short, a source that will continue to remain relevant and current over the coming years, as students (hopefully) become lifelong, continuing learners.