Category: Anything you want

Experimenting With Sankey Diagrams in R and Python

A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.

One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.

In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).

Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.

The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.

Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.

One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:

If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.

If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.

As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.

In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)

The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…

The original (default) diagram looks like this:

and the final version, after a bit of data wrangling, looks like this:

Once again, all the code is provided in the notebook.

One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…

Grouping Numbers that are Nearly the Same – Casual Clustering

A couple of reasons for tinkering with WRC rally data this year, over and the above the obvious of wanting to find a way to engage with motorsport at a data level, specifically, I wanted a context for thinking a bit more about ways of generating (commentary) text from timing data, as well as a “safe” environment in which I could look for ways of identifying features (or storypoints) in the data that might provide a basis for making interesting text comments.

One way in to finding features is to look at a visual representations of the data (that is, just look at charts) and see what jumps out… If anything does, then you can ponder ways of automating the detection or recognition of those visually compelling features, or things that correspond to them, or proxy for them, in some way. I’ll give an example of that in the next post in this series, but for now, let’s consider the following question:how can we group numbers that are nearly the same? For example, if I have a set of stage split times, how can I identify groups of drivers that have recorded exactly, or even just nearly, the same time?

Via StackOverflow, I found the following handy fragment:

def cluster(data, maxgap):
    '''Arrange data into groups where successive elements
       differ by no more than *maxgap*

        cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    '''
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

It struck me that a tweak to the code could limit the range of any grouping relative to a maximum distance between the first and the last number in any particular grouping – maybe I don’t want a group to have a range more than 0.41 for example (that is, strictly more than a dodgy floating point 0.4…):

def cluster2(data, maxgap, maxrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0]) <=maxrange
        if abs(x - groups[-1][-1]) <= maxgap and inmaxrange:
            groups[-1].append(x)
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster2([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41))
[[2.1, 2.4, 2.5], [2.52], [3.9], [4.6], [7.4]]

A downside of this is we might argue we have mistakenly omitted a number that is very close to the last number in the previous group, when we should rightfully have included it, because it’s not really very far away from a number that is close to the group range threshold value…

In which case, we might pull back numbers into a group that are really close to the current last member in the group irrespective of whether we past the originally specified group range:

def cluster3(data, maxgap, maxrange=None, maxminrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0])<=maxrange
        inmaxminrange = False if maxminrange is None else abs(x-groups[-1][-1])<=maxminrange
        if (abs(x - groups[-1][-1]) <= maxgap and inmaxrange) or inmaxminrange:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

With these simple fragments, I can now find groups of times that are reasonably close to each other.

I can also look for times that are close to other times:

trythis = [x for x in cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25) if 2.4 in x]
trythis[0] if len(trythis) else ''
[2.1, 2.4, 2.5, 2.52]

PS I think the following vectorised pandas fragments assign group numbers to rows based on the near matches of numerics in a specified column:

def numclustergroup(x,col,maxgap):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    return x

def numclustergroup2(x,col,maxgap,maxrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = ((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)).cumsum()
    return x.drop('cdiff',1)

def numclustergroup3(x,col,maxgap,maxrange,maxminrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = (((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)) & (x[col].diff()>maxminrange) ).cumsum()
    return x.drop('cdiff',1)

#Test
uu=pd.DataFrame({'x':list(range(0,8)),'y':[1.3,2.1,7.4,3.9,4.6,2.5,2.4,2.52]})
numclustergroup(uu,'y',0.35)
numclustergroup2(uu,'y',0.35,0.41)
numclustergroup3(uu,'y',0.35,0.41,0.25)

The basic idea is to generate logical tests that evaluate as True whenever you want to increase the group number.

Transparency in Parliament… And in Data Journalism?

Over the weekend, I picked up a copy of Parliament Ltd, a two hundred and fifty page rant (or should that be diatribe?!) against various MPs and Lords and their registered (and unregistered) interests. [Disclosure: I’ve picked up a few days paid work for the Parliamentary Digital Service this year.]

The book draws on data scraped from the Parliament website (presumably), as well as Companies House (via a collaboration – or business arrangement? I wasn’t quite sure..?! – with DueDil). As anyone who’s tried looking at registers of interests on the Parliament website, you’ll know they’re not published in the friendliest of formats, and the data is not made available as a machine readable downloadable dataset.

Sources of “Interests” Data From Parliament

By the by, the registers maintained on the Parliament website include:

There’s also the register of all-party groups, which includes statements of benefits received by groups from third parties (links to old scrapers here, possibly?

Another place we might look for associations between MPs/Lords and companies, or other organisations, is in Hansard. For example, Evan Odell recently published a dataset on Hansard Speeches and Sentiment that “provides information on each speech of ten words or longer, made in the House of Commons between 1980 and 2016, with information on the speaking MP, their party, gender and age at the time of the speech”. The R code is supplied, so we could presumably use that as a basis for running the transcripts through a named entity extractor to try to pull out the names of companies or organisation mentioned by each speaker (perhaps as well as something that looks out for declarations of interest mentioned whilst speaking?). It might also be interesting to try to match sentiment with organisation mentions?!

Where companies are mentioned in a debate, and the debate leads to a division (that is, a vote), we can then use sources such as The Public Whip to download information scraped from the Parliament website about who voted how on which division, and perhaps look for MPs voting against their party line but in favour of a particular interest.

(If you know other sources of scraper code, or APIs offered over scraped versions of any of the above registers, please let me know via the comments and I’ll add them in. Also any registers I’ve missed…)

Others Sources of Data Relating to Members’ Parliamentary and Government Activities

By the by, the APPG post also led me to another old post on scraping Ministers’ meetings. For an idea of the sorts of thing currently disclosed (at a departmental level?), see e.g. Cabinet Office: ministers’ transparency publications). There are possibly other forms of declaration on other Government Department websites?

In relation to lobbying firms, there is the Office of the Registrar of Consultant Lobbyists.

Also outside Parliament, the Electoral Commission provide information about donations and loans to individuals (including MPs) and candidate spending and donations at elections.

Other Sources of Information About Members’ External Interests

Companies House can also be used to look up whether a particular named individual is or has been listed as a company officer (such as a director), or is a person of significant control (PSC, sometimes referred to as a “beneficial owner”) of a particular company. Whilst the PSC register is currently available as a bulk data download, the director information isn’t (at least, not without making a personal request). It can be accessed in a piecemeal fashion via the Companies House API though. Current and recently disqualified directors can be found via The Insolvency Service or the Companies House API. The Insolvency Service also publish information about Individual Insolvency (that is, bankruptcies).

Where individuals are associated with an organisation and are registered as a data controller, they should also be listed as an entry on the ICO Data Protection Register.

Evan’s Github account also hosts a fork of a repo published by the NCVO for import[ing] data from the Charity Commission data extract, data that presumably lists trustees, and again that can be used as the basis for finding associations between individuals and organisations.

At a local level, local councils hold a variety of public registers, detailing for example the names of individuals licensed to sell alcohol, or to act as operators of betting, sex or animal breeding establishments. The CQC publish data listing the names of individuals in charge of operating care homes. NHS England list names of GPs working at particular practices. And so on…

More generally, the Advisory Committee on Business Appointments (Acoba) has details of Appointments taken up by former ministers. (Acoba also report on Appointments taken up by former Crown servants.)

So What?

So that’s all so much data, and as Martin Williams points out in his book, it can take a lot of effort to pull the data into some sort of shape where you can use it. And with data sourced from various places, there may be issues associated with sharing the data on once you have processed it.

To a certain extent, you might argue that Parliament is blocking “transparency” around members’ interests – and possible conflicts of interest – by publishing the data in a way that makes it difficult to process it as data without having to do a fair amount of work prepping the data. But I’m not sure how true that is. Journalists are, in part, competitive beasts, wanting to be the first to a story. If a data is well presented and comes with analysis scripts that identify story features and story points, essentially generating a press release around a dataset without much effort involved, there’s nothing there to find (nothing “hidden” in the data waiting for the intrepid journalist to reveal it). But when the data is messy and takes some effort to clean up, then the chances that anyone else will just stumble across the story point by chance are reduced. And when the data is “secret” but still publicly accessible, all the better. For example, it’s no surprise that a common request of Alvateli (the platform underpinning FOI request site WhatDoTheyKnow) was from journalists wanting to be able to hide, or at least embargo, their requests, and (data) responses provided to them (h/t Chris Adams for that observation and link).

Another question that arises around journalists who do clean datasets and then analyse them but who don’t then share their working, (the data cleaning and analysis scripts), is the extent to which they are themselves complicit in acting against transparency. Why should we believe the journalists’ accusations or explanations without seeing what they are actually based on? (Maybe in cleaning the dataset, they threw away explicit declarations of interest because they were too messy to process which then skewed the conclusions drawn from the data analysis?) By sharing analyses, you also provide others with the opportunity to spot errors in your working, or maybe even improve them (scary for some; but consider the alternative: you produce an analysis script that contains an error, and maybe reuse it, generating claims that are false and that cannot be supported by the data. Publishing those is not in your interest.) There also seems to be the implicit assumption that competitors are trying to steal your stories rather than find your own. They probably think and say the same about you. But who has the time to spend it all trying to crib over other people’s shoulders? (Other than me of course;-))

On the other hand, there may be some commercial or competitive intelligence advantage in having a cleaned dataset that you can work with efficiently that is not available to other journalists or that you believe may hide further stories. (A similar argument to the latter is often made by academic researchers who do not want to share their research data, lest someone else makes a discovery from it that eluded them.) But then, with a first mover advantage, you should be able to work with your data and link it to other data sets faster than your competitors. And if they are sharing data back too, then you may be able to benefit from their cleaned data and analysis scripts. Everyone gets to improve their game.

Another possible form of “competitive” advantage that comes from not publishing cleaned datasets or scripts is that is doesn’t tip the hand of the journalist and reveal investigative “trade secrets” to the subject or subjects of an investigation. For by revealing how a story was identified from a dataset, subjects may change their behaviour so as not to divulge information into the dataset in the same revealing way in the future.

One final considerations: when it comes to news stories, what is the extent to which part-time tinkerers and civic tech hackers such as myself spoil a possible story by doing a halfway hack on a dataset, bringing small scale attention to it, and as a consequence disabling or polluting it as a source of journalistic novelty/story-worthiness? Does anyone have examples of where a possible open data story was not pursued by the press because a local data geek blogger got there first?

Computer Spirits…

I doubt there are many readers of this blog who aren’t familiar with science fiction guru Arthur C. Clarke’s adage that “[a]ny sufficiently advanced technology is indistinguishable from magic”. And there may even be a playful few who invoke Rowlingesque spells on the commandline using Harry Potter bash aliases. So I was wondering again today about what other magical or folkloric ideas could be used to help engage folk’s curiosity in how the world of tech works, and maybe teach computing related ideas through stories.

For example, last week I noticed that a reasonable number of links on Wikipedia point to the Internet Archive.

I also picked from a recent Recode/Decode podcast interview between the person you may know as the awesomest tech interviewer ever, Kara Swisher, and Internet Archive champion, Brewster Kahle, that bots do the repair work. So things like the User:InternetArchiveBot and/or CyberBot II maybe? Broken links are identified, and link references updated to point to archival copies. (For more info, see: More than 1 million formerly broken links in English Wikipedia updated to archived versions from the Wayback Machine and Fixing broken links in Wikipedia (especially the comments).)

Hmm… helpful bots.. like helpful spirits, or Brownies in a folkloric sense. Things that come out at night and help invisibly around the home…

And if there are helpful spirits, there are probably malicious ones too. The code equivalent of boggarts and bogles that cause mischief or mayhem – robot phone callers, or scripts that raise pop-ups when you’re trying to read a post online, for example? Maybe we if we start to rethink of online tech inconveniences as malevolent spirits we’ll find better ways to ignore or dispel them?! Or at least find a way to engage people into thinking about them, and from that working out how best to get rid of them or banish them from our lives?

PS the problem of Link Rot is an issue for maintaining OU course materials too. As materials are presented year on year, link targets move away and/or die. Sometimes the materials are patched with a corrected link to wherever the resource moved to, other times we refresh materials and find a new resource to link to. But generally, I wonder, why don’t we make like Wikipedia and get a Brownie to help? Are there Moodle bots to do helpful work like this around the VLE?

Reuse and Build On – IW Broadband Reports

A couple of weeks ago I posted a demo of how to automate the production of a templated report (catchment for GP practices by LSOA on the Isle of Wight) using Rmd and knitr (Reporting in a Repeatable, Parameterised, Transparent Way).

Today, I noticed another report, with data, from the House of Commons Library on Superfast Broadband Coverage in the UK. This reports at the ward level rather than the LSOA level the GP report was based on, so I wondered how easy it would be to reuse the GP/LSOA code for a broadband/ward map…

After fighting with the Excel data file (metadata rows before the header and at the end of the table, cruft rows between the header and data table proper) and the R library I was using to read the file (it turned the data into a tibble, with spacey column names I couldn’t get to work with ggplot, rather than a dataframe – I ended saving to CSV then loading back in again…), not many changes were required to the code at all… What I really should have done was abstracted the code in to an R file (and maybe some importable Rmd chunks) and tried to get the script down to as few lines of bespoke code to handle the new dataset as possible – maybe next time…

The code is here and example PDF here.

I also had a quick play at generating a shiny app from the code (again, cut and pasting rather than abstracting into a separate file and importing… I guess at least now I have three files to look at when trying to abstract the code and to test against…!)

Shiny code here.

So what?

So this has got me thinking – what are the commonly produced “types” of report or report section, and what bits of common/reusuble code would make it easy to generate new automation scripts, at least at a first pass, for a new dataset?

Sharing the Data Load

A few weeks ago, I popped together a post listing a few Data Journalism Units on Github. These repos (that is, repositories), are being used to share code (for particular interactives, for example), data, and analysis scripts. They’re starting to hint at ways in which support for public reproducible local data journalism might start to emerge from developing (standardised) data repositories and reproducible workflows built around them.

Here are a handful of other signals that I think support this trend that I’ve come across in the last few weeks (if they haven’t appeared in your own feeds, a great shortcut to many of them is via @digidickinson’s weekly Media Mill Gazette):

Organisations:

Applications:

Data:

And here’s another one, from today – the Associated Press putting together a pilot with data publishing platform data.world “to help newsrooms find local stories within large datasets” (Localizing data, quantifying stories, and showing your work at The Associated Press ). I’m not sure what the pilot will involve, but the rationale sounds interesting:

Transparency is important. It’s a standard we hold the government to, and it’s a standard we should hold the press to. The more journalists can show their work, whether it’s a copy of a crucial document or the data underlying an analysis, the more reason their audience has to accept their findings (or take issue with them in an informed way). When we share our data and methodology with our members, those journalists give us close scrutiny, which is good for everyone. And when we can release the data more broadly and invite our readers to check our work, we create a more secure grounding for the relationship with the reader.

:-) S’what we need… Show your working…

Tabloid Data Journalism?

At the risk of coming across as a bit snobbish, this ad for a Data Journalist for The Penny Hoarder riled me somewhat…

Do you have a passion for telling stories with data? We’re looking for a data journalist who can crunch statistics about jobs, budgeting, spending and saving — and produce compelling digital content that resonates with our readers. You should have expertise in data mining and analysis, and the ability to present the results in conversational, fun articles and/or telling graphics.

As our data journalist, you will produce revealing, clickable, data-driven articles and/or graphics, plus serve as a resource for our growing team of writers and editors. We envision using data sources such as the Bureau of Labor Statistics and U.S. Census Bureau to report on personal finance issues of interest to our national readership of young professionals, coupon fans and financially striving people of all ages. We want to infuse our blog with seriously interesting data while staying true to our vibe: fun, weird, useful.

Our ideal candidate…
– …
– Can write in a bloggy, conversational voice that emphasizes what the data means to real people
– Has a knack for identifying clicky topics and story angles that are highly shareable
– Gets excited when a blog post goes viral
– …

According to Wikipedia (who else?!;-), Tabloid journalism is a style of journalism that emphasizes sensational crime stories, gossip columns about celebrities and sports stars, junk food news and astrology.

(Yes, yes, I know, I know, tabloid papers can also do proper, hard hitting investigative journalism… But I’m thinking about that sense of the term…)

So what might tabloid data journalism be? See above?

PS ish prompted by @SophieWarnes, it’s probably worth mentioning the aborted Ampp3d project in this context… eg Ampp3d launches as ‘socially-shareable data journalism’ site, Martin Belam talks about Trinity Mirror’s data journalism at Ampp3d and The Mirror Is Making Widespread Cuts To Its Online Journalism.

PPS …and a write-up of that by Sophie: Is there room for ‘tabloid data journalism’?