Tagged: ddj

Tilted Axes and a Reminder About “Glanceability”

Via the Flowing Data blog (Roger Federer career in rankings and wins) this pretty chart from an SRF story, 20 Jahre, 20 Titel.

The rotated axes balance the chart and make the achievements of Connors, Federer and Lendl more “central” to the story. (SRF are a Swiss news organisation…)

I quickly copied and rotated the chart, and here’s how it look with the axes arranged more traditionally:

The composition feels unbalanced, and at a glance feels like there is less of a story in the chart. (Glanceability… that takes me back…. I first heard the phrase from James Cridland – Learnin’ from Virgin – then, as I noted in Powerpoint Presentations That Support Glanceability, Lorcan Dempsey dug into it a little bit more: Glanceability.)

It also reminds me of “banking to 45 degrees”, “the idea that the average line slope in a line chart should be 45º. This has been dubbed banking to 45º and has turned into one of the bits of common wisdom in visualization as determining the ideal aspect ratio [although it more specifically relates to] the comparison between the slopes of two lines, and the slope is the average between those two lines. So if the goal is to be able to compare the rates of change between lines, the 45º average slope makes sense as a rule” Robert Kosara blog post on Aspect Ratio and Banking to 45 Degrees.

The original statement comes from Cleveland, W.S., McGill, M.E. and McGill, R., (1988) The shape parameter of a two-variable graph, Journal of the American Statistical Association, 83(402), pp.289-300 [JSTOR],  and I think I should probably read it again…

By the by, SRF are one the best news orgs I know for sharing their working – the recipes for the above story can be found on their Gihub repo at srfdata/2018-01-roger-federer – so I could probably have recreated the unrotated chart directly from that source, if I’d had the time to play.

PS see also: Five ways to read a scatterplot on the Datawrapper blog.

Reproducible Notebooks Help You Track Down Errors…

A couple of days ago on the Spreadsheet Journalism blog, I came across a post on UK Immigration Raids:.

The post described a spreadsheet process for reshaping a a couple of wide format sheets in a spreadsheet into a single long format dataset to make the data easier to work with.

One of my New Year resolutions was to try to look out for opportunities to rework and compare spreadsheet vs. notebook data treatments, so here’s a Python pandas reworking of the example linked to above in a Jupyter notebook: UK Immigration Raids.ipynb.

You can run the notebook “live” using Binder/Binderhub:

Look in the notebooks folder for the UK Immigration Raids.ipynb notebook.

A few notes on the process:

  • there was no link to the original source data in the original post, although there was a link to a local copy of it;
  • the original post had a few useful cribs that I could use to cross check my working with, such as the number of incidents in Bristol;
  • the mention of dirty data made me put in a step to find mismatched data;
  • my original notebook working contained an error – which I left in the notebook to show how it might propagate through and how we might then try to figure out what the error was having detected it.

As an example, it could probably do with another iteration to demonstrate a more robust process with additional checks that transformations have been correctly applied to data along the way.

Anyway, that’s another of my New Year resolutions half implemented: *share your working, share your mistakes(.

Fragment – Carillion-ish

A quick sketch of some companies that are linked by common directors based on a list of directors seeded from Carillion PLC.

The data was obtained from the Companies House API using the Python chwrapper package and some old code of my own that’ll I’ll share once I get a chance to strip the extraneous cruft out of the notebook it’s in.

The essence of the approach / recipe is an old one that I used to use with OpenCorporates data, as described here: Mapping corporate networks-with Opencorporates.

Note the sketch doesn’t make claims about anything much. The edges just show that companies are linked by the same directors.

A better approach may be to generate a network based on control / ownership registration data but I didn’t have any code to hand to do that (it’s on my to do list for my next company data playtime!).

One way of starting to use this sort of structure is to match companies that appear in the network with payments data to see the actual extent of public body contracting with Carillion group companies. For other articles on Carillion contracts, see eg Here’s the data that reveals the scale of Carillion’s big-money government deals.

Asking Questions of CSV Data, Using SQL In the Browser, With Franchise

Notebook style interfaces, in which content blocks are organised in separate cells that can be moved up or down a document, are starting to look as if their time may have come. Over the last week, I’ve come across two examples.

The first, an early preview of the OU’s under development OpenCreate authoring environment uses an outliner style editor to support the creation of a weekly study planner and topics within each week, and a notebook style interface for editing the topical content pages. I would show screenshots but I’ve pre-emptively been told to not to post videos or screenshots…

The second is an open project – a live demo and the code repository are available – and it comes in the form of Franchise, a simple app service that delivers a rich, browser based SQL query engine for querying simple data files (read the strapline and the name makes punful sense!).

Launching the service provides you with an interface that lets you connect to a database, or load in a data file, either by selecting it from a file browser or just dragging it onto the launch page.

Uploading a CSV document creates a SQLite3 database containing the data in a single table

Selecting the data table generates a SQL query that reveals the column names. Running the query generates a preview of the data table and also makes the output queryable as a “tagged” table.

The resulting table can then be queried using the tag name:

You can also use the download button to download the results table in a variety of formats:

If the resulting table has numerical columns, you can display the table using a range of charts, such as a bar chart.

For now, it looks as if the charts are quite simplistic – for example, we can’t stack or group the bars:

Other charts are available depending in a context sensitive way. For example, if there are two numerical columns we can plot a scatter char. Line charts are also available.

If the dataset contains latitude and longitude date, we we can use the data to plot points on a map.

For those new to SQL, there’s a handy cribsheet a the bottom of the page:

Franchise20

(If SQL is new to you, you might also find things like this handy: Asking Questions of Data – Garment Factories Data Expedition.)

We can also add textual commentary to the notebook in the form of markdown cells.

The markdown is styled responsively – but I couldn’t see how to go to to “preview” mode where the styling is applied but the markdown modifiers are hidden?

Cells are archived rather than deleted:

Although they can be deleted, as well as restored, from the archive.

Cells can also be reordered – click on the right hand siadebar of a cell to drag it into a slot above or below another cell, or alongside one.

Cells can also be duplicated, in which case they appear alongside the cloned cell.

The side by side view allows you to look at the effect of a changing a query compared to its original form.

I was quite excited by the idea that you could download the notebook:

and export it as an HTML file:

I had expected this to generated a standalone HTML file, but that appears not to be the case, at least for now. Instead, the cell data is packed into a JSON object:

and then passed to either a local Franchise server, or the web based one.

As a quick tool for querying data, Franchise looks to be pretty handy, although you soon realise how lacking in control it is over chart styles and axis labelling, for example (at least in int’s early form). If you could export standalone HTML, it would also make it more useful as an asset generating tool, but I guess it’s still early days.

According to a release thread – Franchise – An Open-Source SQL Notebook (franchise.cloud) – it looks as  if a standalone electron app version is on the way. (In the meantime, I followed the build instructions from the repo README to produce a quick docker container: psychemedia/franchise.)

The ability to get started querying data using SQL without the need to install anything offers a way of having a quick chat with a file based dataset. (I couldn’t get it to work with Excel or JSON files, and didn’t try a SQL file or connecting to a separate database server.)

At the moment, I don’t think you can connect to a Google spreadsheet, so you have to download one , although a SQL like API is available for Google Sheets (eg I used it for this simple SQL query interface to Google spreadhseets way back when).

From a getting started with data conversations perspective, though, this offers quite a nice on ramp to a SQL query environment without the need to worry about the DBA (database admin) chores of setting up a database, defining tables, importing the data and so on.

I also wonder if it might act as a gateway to more aggressive and powerful query engines that are capable of querying over large and multiple datasets contained in local files? Things like Apache Drill, for example?

See also:

 

Responsibilities and Required Skills in Recent “Data Journalist” Job ads…

A quick peek at a couple of recent job ads that describe what you might be expected to do…

From the BBC – Senior Data Journalist. In part, the role responsibilities include:

  • generating ideas for data-driven stories and for how they might be developed and visualized
  • exploring those ideas using statistical tools – and presenting them to wider stakeholders from a non-statistical background
  • reporting on and analysing data in a way that contributes to telling compelling stories on an array of news platforms
  • collaborating with reporters, editors, designers and developers to bring those stories to publication
  • exploring and summarizing data using relational database software
  • visualizing and to find patterns in spatial data using GIS software
  • using statistical tools to identify significant data trends
  • representing the data team and the Visual Journalism team at editorial meetings
  • leading editorially on data projects as required and overseeing the work of other data team colleagues
  • using skills and experience to advise on best approaches to data-led storytelling and the development and publication of data-led projects

Required technical skills include “a good understanding of statistics and statistical analysis; a strong grasp of how to clean, parse and query data; a good knowledge of some of the following: spreadsheet software, SQL, Python and R; demonstrable experience of visualising data, using either tools or scripts; experience of GIS or other mapping software; experience of gathering information from Freedom of Information requests”.

Desirable skills include “knowledge of basic scripting and HTML, as it might pertain to data visualization or data analysis and knowledge of several of the following; Carto, D3, QGIS, Tableau”.

And over at Trinity Mirror, there’s an open position for a data journalist, where role responsibilities include:

  • Having ideas for data-based stories and analysis, on a range of topics, which would be suitable for visualisation in regional newspapers across the group.
  • Working with a designer and the head of data journalism to come up with original and engaging ways of visualising this content.
  • Writing copy, as required, to accompany these visualisations.
  • Working on the data unit’s wider output, for print and web, as required by the head of data journalism.

As far as technical skills go, these “should include a broad knowledge of UK data sources, an ability to quickly and effectively interrogate data using spreadsheets, and an understanding of the pros and cons of different methods of visualising data”.  In addition, “[a]n ability to use scraping software, and some proficiency in using R, would be an advantage”.

Transparency in Parliament… And in Data Journalism?

Over the weekend, I picked up a copy of Parliament Ltd, a two hundred and fifty page rant (or should that be diatribe?!) against various MPs and Lords and their registered (and unregistered) interests. [Disclosure: I’ve picked up a few days paid work for the Parliamentary Digital Service this year.]

The book draws on data scraped from the Parliament website (presumably), as well as Companies House (via a collaboration – or business arrangement? I wasn’t quite sure..?! – with DueDil). As anyone who’s tried looking at registers of interests on the Parliament website, you’ll know they’re not published in the friendliest of formats, and the data is not made available as a machine readable downloadable dataset.

Sources of “Interests” Data From Parliament

By the by, the registers maintained on the Parliament website include:

There’s also the register of all-party groups, which includes statements of benefits received by groups from third parties (links to old scrapers here, possibly?

Another place we might look for associations between MPs/Lords and companies, or other organisations, is in Hansard. For example, Evan Odell recently published a dataset on Hansard Speeches and Sentiment that “provides information on each speech of ten words or longer, made in the House of Commons between 1980 and 2016, with information on the speaking MP, their party, gender and age at the time of the speech”. The R code is supplied, so we could presumably use that as a basis for running the transcripts through a named entity extractor to try to pull out the names of companies or organisation mentioned by each speaker (perhaps as well as something that looks out for declarations of interest mentioned whilst speaking?). It might also be interesting to try to match sentiment with organisation mentions?!

Where companies are mentioned in a debate, and the debate leads to a division (that is, a vote), we can then use sources such as The Public Whip to download information scraped from the Parliament website about who voted how on which division, and perhaps look for MPs voting against their party line but in favour of a particular interest.

(If you know other sources of scraper code, or APIs offered over scraped versions of any of the above registers, please let me know via the comments and I’ll add them in. Also any registers I’ve missed…)

Others Sources of Data Relating to Members’ Parliamentary and Government Activities

By the by, the APPG post also led me to another old post on scraping Ministers’ meetings. For an idea of the sorts of thing currently disclosed (at a departmental level?), see e.g. Cabinet Office: ministers’ transparency publications). There are possibly other forms of declaration on other Government Department websites?

In relation to lobbying firms, there is the Office of the Registrar of Consultant Lobbyists.

Also outside Parliament, the Electoral Commission provide information about donations and loans to individuals (including MPs) and candidate spending and donations at elections.

Other Sources of Information About Members’ External Interests

Companies House can also be used to look up whether a particular named individual is or has been listed as a company officer (such as a director), or is a person of significant control (PSC, sometimes referred to as a “beneficial owner”) of a particular company. Whilst the PSC register is currently available as a bulk data download, the director information isn’t (at least, not without making a personal request). It can be accessed in a piecemeal fashion via the Companies House API though. Current and recently disqualified directors can be found via The Insolvency Service or the Companies House API. The Insolvency Service also publish information about Individual Insolvency (that is, bankruptcies).

Where individuals are associated with an organisation and are registered as a data controller, they should also be listed as an entry on the ICO Data Protection Register.

Evan’s Github account also hosts a fork of a repo published by the NCVO for import[ing] data from the Charity Commission data extract, data that presumably lists trustees, and again that can be used as the basis for finding associations between individuals and organisations.

At a local level, local councils hold a variety of public registers, detailing for example the names of individuals licensed to sell alcohol, or to act as operators of betting, sex or animal breeding establishments. The CQC publish data listing the names of individuals in charge of operating care homes. NHS England list names of GPs working at particular practices. And so on…

More generally, the Advisory Committee on Business Appointments (Acoba) has details of Appointments taken up by former ministers. (Acoba also report on Appointments taken up by former Crown servants.)

So What?

So that’s all so much data, and as Martin Williams points out in his book, it can take a lot of effort to pull the data into some sort of shape where you can use it. And with data sourced from various places, there may be issues associated with sharing the data on once you have processed it.

To a certain extent, you might argue that Parliament is blocking “transparency” around members’ interests – and possible conflicts of interest – by publishing the data in a way that makes it difficult to process it as data without having to do a fair amount of work prepping the data. But I’m not sure how true that is. Journalists are, in part, competitive beasts, wanting to be the first to a story. If a data is well presented and comes with analysis scripts that identify story features and story points, essentially generating a press release around a dataset without much effort involved, there’s nothing there to find (nothing “hidden” in the data waiting for the intrepid journalist to reveal it). But when the data is messy and takes some effort to clean up, then the chances that anyone else will just stumble across the story point by chance are reduced. And when the data is “secret” but still publicly accessible, all the better. For example, it’s no surprise that a common request of Alvateli (the platform underpinning FOI request site WhatDoTheyKnow) was from journalists wanting to be able to hide, or at least embargo, their requests, and (data) responses provided to them (h/t Chris Adams for that observation and link).

Another question that arises around journalists who do clean datasets and then analyse them but who don’t then share their working, (the data cleaning and analysis scripts), is the extent to which they are themselves complicit in acting against transparency. Why should we believe the journalists’ accusations or explanations without seeing what they are actually based on? (Maybe in cleaning the dataset, they threw away explicit declarations of interest because they were too messy to process which then skewed the conclusions drawn from the data analysis?) By sharing analyses, you also provide others with the opportunity to spot errors in your working, or maybe even improve them (scary for some; but consider the alternative: you produce an analysis script that contains an error, and maybe reuse it, generating claims that are false and that cannot be supported by the data. Publishing those is not in your interest.) There also seems to be the implicit assumption that competitors are trying to steal your stories rather than find your own. They probably think and say the same about you. But who has the time to spend it all trying to crib over other people’s shoulders? (Other than me of course;-))

On the other hand, there may be some commercial or competitive intelligence advantage in having a cleaned dataset that you can work with efficiently that is not available to other journalists or that you believe may hide further stories. (A similar argument to the latter is often made by academic researchers who do not want to share their research data, lest someone else makes a discovery from it that eluded them.) But then, with a first mover advantage, you should be able to work with your data and link it to other data sets faster than your competitors. And if they are sharing data back too, then you may be able to benefit from their cleaned data and analysis scripts. Everyone gets to improve their game.

Another possible form of “competitive” advantage that comes from not publishing cleaned datasets or scripts is that is doesn’t tip the hand of the journalist and reveal investigative “trade secrets” to the subject or subjects of an investigation. For by revealing how a story was identified from a dataset, subjects may change their behaviour so as not to divulge information into the dataset in the same revealing way in the future.

One final considerations: when it comes to news stories, what is the extent to which part-time tinkerers and civic tech hackers such as myself spoil a possible story by doing a halfway hack on a dataset, bringing small scale attention to it, and as a consequence disabling or polluting it as a source of journalistic novelty/story-worthiness? Does anyone have examples of where a possible open data story was not pursued by the press because a local data geek blogger got there first?

How Reproducible Data Analysis Scripts Can Help You Route Around Data Sharing Blockers

For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it?

If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.)

But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.

In the above linked example, the Rmd script links to a local copy of a dataset I’d downloaded onto my local computer. But if I’d written a properly reusable, reproducible script, I should have done at least one of the following two things:

  • either added a local copy of the data to the repository and checked that the script correctly linked relatively to it;
  • and/or provided the original download link for the datafile (and the HTML web page on which the link could be found) and loaded the data in from that URL.

Where the license of a dataset allows sharing, the first option is always a possibility. But where the license does not allow sharing on, the second approach provides a de facto way of sharing the data without actually sharing it directly yourself. I may not be giving you a copy of the data, but I am giving you some of the means by which you can obtain a copy of the data for yourself.

As well as getting round licensing requirements that limit sharing of a dataset but allow downloading of it for personal use, this approach can also be handy in other situations.

For example, where a dataset is available from a particular URL but authentication is required to access it (this often needs a few more tweaks when trying to write the reusable downloader! A stop-gap is to provide the URL in reproducible report document and explicitly instruct the reader to download the dataset locally using their own credentials, then load it in from the local copy).

Or as Paul Bivand pointed out via Twitter, in situations “where data is secure like pupil database, so replication needs independent ethical clearance”. In a similar vein, we might add where data is commercial, and replication may be forbidden, or where additional costs may be incurred. And where the data includes personally identifiable information, such as data published under a DPA exemption as part of a public register, it may be easier all round not to publish your own copy or copies of data from such a register.

Sharing recipes also means you can share pathways to the inclusion of derived datasets, such as named entity tags extracted from a text using free, but non-shareable, (or at least, attributable) license key restricted services, such as the named entity extraction services operated by Thomson Reuters OpenCalais, Microsoft Cognitive Services, IBM Alchemy or Associated Press. That is, rather than tagging your dataset and then sharing and analysing the tagged data, publish a recipe that will allow a third party to tag the original dataset themselves and then analyse it.