Data Journalism Units on Github

Working as I do with an open notebook (this blog, my github repos, pinboard and twitter), I value works shared by other people too. Often, this can lead to iterative development, as one person sees an opportunity to use someone else’s work for a slightly different purpose, or spots a way to improve upon it.

A nice example of this that I witnessed in more or less realtime a few years ago was when data journalists from the Guardian and the Telegraph – two competing news outlets – bounced off each others’ work to produce complementary visualisations demonstrating electoral boundary changes (Data Journalists Engaging in Co-Innovation…). (By the by, boundary changes are up for review again in 2018 – the consultation is still open.)

Another example comes from when I starting to look for cribs around building virtual machines to support OU course delivery. Specifially, the Infinite Interns idea for distinct (and disposable) virtual machines that could be used to support data journalism projects (about).

Today, I variously chanced across a couple of Github repositories containing data, analyses, toolkits and application code from a couple of different news outlets. Github was originally developed as a social coding environment where developers could share and collaborate on software projects. But over the last few years, it’s also started to be used to share data and (text) documents, as well as reproducible data analyses – and not just by techies.

A couple of different factors have contributed to this, I think, that relate as much to how Github can be used to preview and publish documents, as act as a version control and issue tracking system:

options

Admittedly, using git and Github can be really complicated and scary, but you can also use it as a place to pop documents and preview them or publish them, as described above. And getting files in is easy too – just upload them via the web interface.

Anyway, that’s all by the by… The point of this post was to try to pull together a small collection of links to some of the data journalism units I’ve spotted sharing stuff on Github, and see to what extent they practice “reproducible data journalism”. (There’s also a Github list – Github showcase – open journalism.) So for example:

  • one of the first news units I spotted sharing research in a reproducible way was BuzzFeedNews and their tennis betting analysis. A quick skim of several of the repos suggest they use a similar format – a boilerplate README with a link to the story, the data used in the analysis, and a Jupyter notebook containing python/pandas code to do the analysis. They also publish a handy directory to their repos, categorised as Data and Analyses, Standalone Datasets, Libraries and Tools, GuidesI’m liking this a lot…
  • fivethirtyeight: There are also a few other data related repos at the top level, eg guns-data. Hmm… Data but no reproducible analyses?
  • SRF Data – srfdata (data-driven journalism unit of Swiss Radio and TV): several repos containing Rmd scripts (in English) analysing election related data. More of this, please…
  • FT Interactive News – ft-interactive: separate code repos for different toolkits (such as their nightingale-charts chart tools) and applications; a lot of the applications seem to appear in subscriber only stories – but I can you can try to download the code and run it yourself… Good for sharing code, poor for paywall stopping sharing of executed examples;
  • New York Times – NYTimes: plenty of developer focussed repos, although the gunsales repo creates an R package that works with a preloaded dataset and routines to visualise the data and the ingredient phrase tagger is a natural language parser trained to tag food recipe components. (Makes me wonder what other sorts of trained taggers might be useful…) One for the devs…
  • Washington Post – washingtonpost: more devops repos, they they have also dropped a database of shootings (as a CSV file) as one of the repos (data-police-shootings)). I’d hoped for more…
  • NYT Newsroom Developers: another developer focussed collection of repos, though rather than focussing on just front end tools there are also scrapers and API helpers. (It might actually be worth going through all the various news/media repos to build a metalist/collection of API wrappers, scrapers etc. i.e. tools for sourcing data). I’d half expected to see more here, too…?
  • Wall Street Journal Graphics Team – WSJ: not much here, but picking up on the previous point there is this example of a AP ballot API wrapper; Sparse…
  • The Times / Sunday Times – times: various repos, some of the link shares; the data one collects links to a few datasets and related stories. Also a bit sparse…
  • The Economist – economist-data-team: another unloved account – some old repos for interactive HTML applications; Another one for the devs, maybe…
  • BBC England Data Unit – BBC-Data-Unit: a collection of repositories, one per news project. Recent examples include: Dog Fights and Schools Chemical Alerts. Commits seem to come from a certain @paulbradshaw… Repos seem to include a data file and a chart image. How to create run the analysis/create the chart from the data is not shared… Could do better…

From that quick round up, a couple of impressions. Firstly, BuzzFeedNews seem to be doing some good stuff; the directory listing they use that breaks down different sorts of repos seems sensible, and could provide the basis for a more scholarly round up than the one presented here. Secondly, we could probably create some sort of matrix view over the various repos from different providers, that would allow us, for example, to see all the chart toolkits, or all the scrapers, or all the API wrappers, or all the election related stuff.

If you know of any more I should add to the list, please let me know via the comments below, ideally with a one or two line summary as per the above to give a flavour of what’s there…

I’m also mindful that a lot of people working for the various groups may also be publishing to personal repositories. If you want to suggest names for a round up of those, again, please do so via the comments.

PS I should really be recording the licenses that folk are releasing stuff under too…

Opportunities for Doing the Civic Thing With Open and Public Data

I’ve been following the data thing for a few years now, and it’s been interesting to see how data related roles have evolved over that time.

For my own part, I’m really excited to have got the chance to work with the Parliamentary Digital Service [PDS] [blog] for a few days this coming year. Over the next few weeks, I hope to be starting to nose around Parliament and the Parliamentary libraries getting a feel for the The Life of Data there, as well as getting in touch with users of Parliamentary data more widely (if you are one, or aspire to be one, say hello in the commentsto see if we can start to generate more opportunities for coffee…:-)

I’m also keen to see what the Bureau of Investigative Journalism’s Local Data Lab, headed up by Megan Lucero, starts to get up to. There’s still a chance to apply for starting role there if you’re a “a journalist who uses computational method to find stories, an investigative or local journalist who regularly uses data, a tech or computer science person who is interested in local journalism or a civic tech person keen to get involved”, and the gig looks like it could be a fun one:

  • We will take on datasets that have yet to be broken down to a local level, investigate and reveal stories not yet told and bring this to local journalists.
  • We will be mobile, agile and innovative. The team will travel around the country to hear the ideas and challenges of regional reporters. We will listen, respond and learn so to provide evidence-based solutions.
  • We will participate in all parts of the process. Every member will contribute to story ideas, data wrangling, problem solving, building and storytelling.
  • We will participate in open journalism. We will publish public interest stories that throw light on local and national issues. We will open our data and code and document our shortcomings and learnings. We will push for greater transparency. We will foster collaboration between reporters and put power into regional journalism.

I’m really hoping they start a fellowship model too, so I can find some way of getting involved and maybe try to scale some of the data wrangling I will be doing around Isle of Wight data this year to wider use. (I wonder if they’d be interested in a slackbot/datawire experiment or two?!) After all, having split the data out for one local area, it’s often trivial to change the area code and do the same for another:

(It’ll also be interesting to see how the Local Data Lab might complement things like the BBC Local Reporting Scheme,  or feed leads into the C4CJ led “representative network for community journalism”.)

Data journalism job ads are still appearing, too. A recent call for a Senior Broadcast Journalist (Data), BBC Look North suggests the applicant should be able:

  • To generate ideas for data-driven stories and for how they might be developed and visualized.
  • To explore those ideas using statistical tools – and present them to wider stakeholders from a non-statistical background.
  • To report on and analyse data in a way that contributes to telling compelling stories on an array of news platforms.
  • To collaborate with reporters, editors, designers and developers to bring those stories to publication.
  • To use statistical tools to identify significant data trends.

The ad suggests that required skills include good knowledge of Microsoft Excel, a strong grasp of how to clean, parse and query data as well as database management*, [and] demonstrable experience of visualising data and using visualisation tools such as SPSS, SAS, Tableau, Refine and Fusion Tables.

* I’m intrigued as to what this might mean. As an entry level, I like to think this is getting data into something like SQLite and then running SQL queries over it? It’s also worth remembering that Google Sheets also exposes an SQL like interface that you can query (example, about).

When I started pottering around in the muddy shores of “data journalism” as it became a thing, Google Sheets, Fusion Tables and Open (then Google) Refine were the tools I tried to promote because I saw them as a relatively easy way in to working with data. But particularly with the advent of accessible working environments like RStudio and Jupyter notebooks, I have moved very much more towards the code side. This is perceived as a much harder sell – it requires learning to code – but it’s amazing what you can do with a single line of code, and in many cases someone has already written that line, so all you have to do is copy it; environments like Jupyter notebooks also provide a nicer (simpler) environment for trying out code than scary IDEs (even the acronym is impenetrable;-). As a consequence of spending more time in code, it’s also made me think far more about reproducible and transparent research (indeed, “reproducible data journalism”), as well as the idea of literate programming, where code, text and, particularly in research workflows, code outputs, together form a (linear) narrative that make it easier to see and understand what’s going on…

As well as the data journalism front, I’ve also kept half an eye on how academic libraries have been engaging with data issues, particularly from an “IT skills” development perspective. Generally, they haven’t, although libraries are often tasked with supporting research data management projects, as this job ad posted recently by the University of Michigan (via @lorcanD) for a data workflows specialist shows:

The Research Data Workflows Specialist will advance the library’s mission to create and sustain data services for the c­ampus that support the mission of the University of Michigan researchers through Research Data Services (RDS), a new and growing initiative that will build the next generation of data curation systems. A key focus of this position will be to understand and connect with the various disciplinary workflows on campus in order to inform the development of our technical infrastructure and data services.

I suspect this is very much associated with research data management. It seems to me that there’s still a hole when it comes to helping people put together their own reproducible research toolchains and technology stacks together (as well as working out what sort of toolchain/stack are actually required…).

Finally, I note that NotWestminster is back later next month in Huddersfield (I managed to grab a ticket last night). I have no idea what to expect from the event, but it may generate a few ideas for what I can usefully do with Island data this year…

PS just spotted another job opportunity in a related opendata area: Data Labs and Learning Manager, 360 Giving.

A First Attempt at Wrangling WRC (World Rally Championship) Data With pandas and matplotlib

Last year was a quite year on the Wrangling F1 Data With R front, with a not even aborted start at doing a python/pandas equivalent project. With the new owners of F1 in place, things may change for the better in terms of engaging with fans and supporters, and I may revisit that idea properly, but in the meantime, I thought I started tinkering with a wider range of motorsport data.

The start to the BTCC season is still a few months away, but the WRC started over the weekend, and with review highlights and live coverage of one stage per rally on Red Bull TV, I thought I may give that data a go…

Results and timing info can be found on the WRC web pages (I couldn’t offhand find a source of official FIA timing sheets) so here’s a first quick sketch using stage results from the first rally of the year – Monte Carlo.

world_rally_championship_-_results_monte_carlo_-_wrc_com

To start with, we need to grab the data. I’m using the pandas library, which has a handy .read_html() method that can scrape tables (crudely) from an HTML page given its URL.

import pandas as pd

def getStageResultsBase(year,rallyid,stages):
    ''' Get stage results and overall results at end of stage '''
    
    # Accept one stage number or a list of stage numbers
    stages=[stages] if not isinstance(stages,list) else stages
    
    #There are actually two tables on the stage results page
    df_stage=pd.DataFrame()
    df_overallpart=pd.DataFrame()
    
    #Grab data for each stage
    for stage in stages:
        url='http://www.wrc.com/live-ticker/daten/{year}/{rallyid}/stage.{rallyid}.{stage}.all.html'.format(year=year, rallyid=rallyid, stage=stage)
        #scrape the data
        results=pd.read_html(url,encoding='utf-8')
        results[0].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        results[1].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        
        #Simple cleaning - cast the data types as required
        for i in [0,1]:
            results[i].fillna(0,inplace=True)
            results[i]['pos']=results[i]['pos'].astype(float).astype(int)
            for j in ['carNo','driverName','time','diffPrev','diffFirst']:
                results[i][j]=results[i][j].astype(str)
        
        #Add a stage identifier
        results[0]['stage']=stage
        results[1]['stage']=stage
        
        #Add the scraped stage data to combined stage results data frames
        df_stage=pd.concat([df_stage,results[0]])
        df_overallpart=pd.concat([df_overallpart,results[1]])

    return df_stage.reset_index(drop=True), df_overallpart.reset_index(drop=True)

The data we pull back looks like the following.

wrc_results_scraper1

Note that deltas (the time differences) are given as offset times in the form of a string. As the pandas library was in part originally developed for working with financial time series data, it has a lot of support for time handling. This includes the notion of a time delta:

pd.to_timedelta("1:2:3.0")
#Timedelta('0 days 01:02:03')

We can use this datatype to represent time differences from the results data:

#If we have hh:mm:ss format we can easily cast a timedelta
def regularTimeString(strtime):

    #Go defensive, just in case we're passed eg 0 as an int
    strtime=str(strtime)
    strtime=strtime.strip('+')

    modifier=''
    if strtime.startswith('-'):
        modifier='-'
        strtime=strtime.strip('-')

    timeComponents=strtime.split(':')
    ss=timeComponents[-1]
    mm=timeComponents[-2] if len(timeComponents)>1 else 0
    hh=timeComponents[-3] if len(timeComponents)>2 else 0
    timestr='{}{}:{}:{}'.format(modifier,hh,mm,ss)
    return pd.to_timedelta(timestr)

We can use the time handler to cast the time differences from the scraped data as timedelta typed data:

def getStageResults(year,rallyid,stages):
    df_stage, df_overallpart = getStageResultsBase(year,rallyid,stages)
    for col in ['time','diffPrev','diffFirst']:
        df_stage['td_'+col]=df_stage.apply(lambda x: regularTimeString(x[col]),axis=1)
        df_overallpart['td_'+col]=df_overallpart.apply(lambda x: regularTimeString(x[col]),axis=1)
    return df_stage, df_overallpart 

wrc_results_scraper2

The WRC results cover all entrants to the rally, but not all the cars are classed as fully blown WRC cars (class RC1). We can limit the data to just the RC1 cars and generate a plot showing the position of each driver at the end of each stage:

%matplotlib inline
import matplotlib.pyplot as plt

rc1=df_overall[df_overall['groupClass']=='RC1'].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='pos',ax=ax,legend=None);

wrc_results_scraper3

The position is actually the position of the driver across all entry classes, not just RC1. This means if a driver has a bad day, they could be placed well down the all-class field; but that’s not of too much interest if all we’re interested in is in-class ranking.,

So what about if we rerank the drivers within the RC1 class? And perhaps improve the chart a little by adding a name label to identify each driver at their starting position?

rc1['rank']=rc1.groupby('stage')['pos'].rank()

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='rank',ax=ax,legend=None)

#Add some name labels at the start
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, i+1, d.ix(i)['driverName'])

wrc_results_scraper4

This chart is a bit cleaner, but now we lose information around the lower placed in-class drivers, in particular that information about  there overall position when other classes are taken into account too…

The way the FIA recover this information in their stage chart displays that reports on the evolution of the race for the top 10 cars overall (irrespective of class)  that shows excursions in interim stages outside the top 10  “below the line”, annotating them further with their overall classification on the corresponding stage.

stage_chart___federation_internationale_de_l_automobile

We can use this idea by assigning a “re-rank” to each car if they are positioned outside the size of the class.

#Reranking...
rc1['xrank']= (rc1['pos']>RC1SIZE)
rc1['xrank']=rc1.groupby('stage')['xrank'].cumsum()
rc1['xrank']=rc1.apply(lambda row: row['pos'] if row['pos']<=RC1SIZE else row['xrank'] +RC1SIZE, axis=1)
fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None)

#Name labels
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName'])
for i,d in rc1[rc1['stage']==17].iterrows():
    ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])

wrc_results_scraper5The chart now shows the evolution of the race for the RC1 cars, retaining the spaced ordering of the top 12 positions that would be filled by WRC1/RC1 cars if they were all placed above cars from other classes and then bunching those placed outside the group size. (Only 11 names are shown because one the entries retired right at the start of the rally.)

So for example, in this case we see how Neuvill, Hanninen and Serderidis are classed outside Lefebvre, who was actually classed 9th overall.

Further drawing on the FIA stage chart, we can label the excursions outside the top 12, and also invert the y-axis.

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None);

for i,d in rc1[rc1['xrank']>RC1SIZE].iterrows(): ax.text(d.ix(i)['stage']-0.1, d.ix(i)['xrank'], d.ix(i)['pos'], bbox=dict( boxstyle='round,pad=0.3',color='pink')) #facecolor='none',edgecolor='black',
#Name labels
for i,d in rc1[rc1['stage']==1].iterrows(): ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName']) for i,d in rc1[rc1['stage']==17].iterrows(): ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])
#Flip the y-axis plt.gca().invert_yaxis()

Lefebvre’s excursions outside the top 12 are now recorded and plain to see.

wrc_results_scraper6

We now have a chart that blends rank ordering with additional information showing where cars are outpaced by cars from other classes, in a space efficient manner.

PS as with Wrangling F1 Data With R, I may well turn this into a Leanpub book, this time exploring the workflow to markdown (and/or maybe reveal.js slides!) from Jupyter notebooks, rather than from RStudio/Rmd.

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

  • identifying what data or insight you want to report from your dataset;
  • (automatically deriving the insights);
  • constructing appropriate sentences from the data;
  • organising the sentences into some sort of narrative structure;
  • making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

exceldata2slide

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
  X1[label='macro']
  X2[label='macro']

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

iw_transparency_spending_data

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

jupyterslidetextgen

{
  X1[label="text"]
  X2[label="Jupyter notebook\n(slide mode)"]
  X3[label="reveal.js\npresentation"]

  Y1[label="text"]
  Y2[label="text"]
  Y3[label="text"]

  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];
}

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

demo_-_interactive_shell_for_blockdiag_-_blockdiag_1_0_documentation

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

testfullslidenotebook2_slides1

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

testfullslidenotebook2_slides2

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

testfullslidenotebook2_slides3

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…

Accounting for the 80% – A Quick Pareto Principle Filter for pandas

Having decided (again) to try to do something with local government transparency spending data this year, I thought I take the take of generating some simple reports that just identify the significant spending areas within a particular directorate or service area.

It’s easy enough to render dozens of charts that show some bars bigger than others, and from this suggest the significant spending areas, but this still requires folk to spend time reading those charts and runs the risk that that they don’t “read” from the chart what you wanted them to… (This is one way in which titles and captions can help…)

dirspend

So how about putting the Pareto Principle, or 80/20 rule, to work, where 80% of some effect or other (such as spending) is accounted for by 20% of contributory effects (such as spending in a particular area, or to a particular set of suppliers)?

In other words, is one way in to the spending data to use it simply to see what accounts for 80%, or thereabouts, of monthly spend?

Here’s a quick function that tries to do something like that, that can be applied to a pandas Series:

def paretoXY(s, x, y,threshold=0):
    ''' Return a list until you account for X% of the whole and remainders are less than Y% individually.
        The threshold percentage allows you to hedge your bets and check items just past the treshold. '''
    #Generate percentages, and sort, and find accumulated total
    #Exclude any negative payments that can make the cumulative percentage go > 100% before we get to them
    df=pd.DataFrame( s[s>0].sort_values(ascending=False) )
    df['pc']= 100*s/s.sum()
    df['cum_pc']=df['pc'].cumsum()
    #Return ordered items that either together account at least for X% of the total,
    # and/or individually account for at least Y% of the total
    #The threshold provides a fudge factor on both components...
    return df[ (df['cum_pc']-df['pc'] <= x+ x*threshold/100) | (df['pc'] >= y-y*threshold/100) ]

iw_transparency_spending_-_adult_services

The resulting report simply describes just the components that either make up 80% (or whatever) of the total in each area, or that represent a significant contribution (howsoever defined), in their own right, to the corresponding total.

In the above case, the report describes the significant expense types in capital or revenue streams for each directorate for a given month’s spending data.

The resulting dataframe can also be converted to a crude text report summarising percentage contributions to specific areas easily enough:

iw_transparency_spending_-_adult_services2

Automatically Generating Two Dimensional Reveal.js Documents Using Jupyter Notebooks

One of the things I finally got round to exploring whilst at the Reproducible Research Using Jupyter Notebooks curriculum development hackathon was the ability to generate slideshows from Jupyter notebooks.

The underlying slideshow presentation framework is reveal.js. This uses a 1.5(?) dimensional slide geometry, so slides can transition left to right, or you can transition down to subslides off a single slide.

This got me wondering… could I use a notebook/script to generate a reveal.js slideshow that could provide a convenient way of navigating automatically generated slideshows made up from automatically generated data2text slides?

The 1.5/two dimensional component would mean that slides could be structured by topic horizontally, with subtopic vertically downwards within a topic.

A quick test suggests that this is absolutely doable…

import IPython.nbformat as nb
import IPython.nbformat.v4.nbbase as nb4

test=nb4.new_notebook()
test.cells.append(nb4.new_markdown_cell('# Test slide1',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('# Test slide2',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide2 extra content line 1\n\nSlide2 extra content line 2'))
test.cells.append(nb4.new_markdown_cell('# Test slide3',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide3 fragment 1',metadata={"slideshow": {"slide_type": "fragment"}}))
test.cells.append(nb4.new_markdown_cell('Slide3 fragment 2',metadata={"slideshow": {"slide_type": "fragment"}}))
test.cells.append(nb4.new_markdown_cell('# Slide4',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide4 extra content line 1\n\nSlide4 extra content line 2'))
test.cells.append(nb4.new_markdown_cell('# Slide4 subslide',metadata={"slideshow": {"slide_type": "subslide"}}))

nbf='testslidenotebook.ipynb'
nb.write(test,nbf)

#Generate and render slideshow
!jupyter nbconvert $nbf --to slides --post serve

Let the fun begin…:-)

PS here’s a second pass:

def addSlideComponent(notebook, content, styp=''):
    if styp in ['slide','fragment','subslide']: styp={"slideshow": {"slide_type":styp}}
    else: styp={}
    notebook.cells.append(nb4.new_markdown_cell(content, metadata=styp))

test=nb4.new_notebook()
addSlideComponent(test,'# Test2 slide1','slide')
addSlideComponent(test,'# Test slide2','slide')
addSlideComponent(test,'Slide2 extra content line 1\n\nSlide2 extra content line 2')
addSlideComponent(test,'# Test slide3','slide')
addSlideComponent(test,'Slide3 fragment 1','fragment')
addSlideComponent(test,'Slide3 fragment 2','fragment')
addSlideComponent(test,'# Slide4','slide')
addSlideComponent(test,'Slide4 extra content line 1\n\nSlide2 extra content line 1')
addSlideComponent(test,'# Slide4 subslide','subslide')

nbf='testslidenotebook2.ipynb'
nb.write(test,nbf)

Weekly Subseries Charts – Plotting NHS A&E Admissions

A post on Richard “Joy of Tax” Murphy’s blog a few days ago caught my eye – Data shows January is often the quietest time of the year for A & E departments – with a time series chart showing weekly admission numbers to A&E from a time when the numbers were produced weekly (they’re now produced monthly).

In a couple of follow up posts, Sean Danaher did a bit more analysis to reinforce the claim, again generating time series charts over the whole reporting period.

For me, this just cries out for a seasonal subseries plot. These are typically plotted over months or quarters and show for each month (or quarter) the year on year change of a indicator value. Rendering weekly subseries plots is a but more cluttered – 52 weekly subcharts rather 12 monthly ones – but still doable.

I haven’t generated subseries plots from pandas before, but the handy statsmodels Python library has a charting package that looks like it does the trick. The documentation is a bit sparse (I looked to the source…), but given a pandas dataframe and a suitable period based time series index, the chart falls out quite simply…

Here’s the chart and then the code… the data comes from NHS England, A&E Attendances and Emergency Admissions 2015-16 (2015.06.28 A&E Timeseries).

DO NOT TRUST THE FOLLOWING CHART

aeseasonalsubseries

(Yes, yes I know; needs labels etc etc; but it’s a crappy graph, and if folk want to use it they need to generate a properly checked and labelled version themselves, right?!;-)

import pandas as pd
# !pip3 install statsmodels
import statsmodels.api as sm
import statsmodels.graphics.tsaplots as tsaplots
import matplotlib.pyplot as plt

!wget -P data/ https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/04/2015.06.28-AE-TimeseriesBaG87.xls

dfw=pd.read_excel('data/2015.06.28-AE-TimeseriesBaG87.xls',skiprows=14,header=None,na_values='-').dropna(how='all').dropna(axis=1,how='all')
#Faff around with column headers, empty rows etc
dfw.ix[0,2]='Reporting'
dfw.ix[1,0]='Code'
dfw= dfw.fillna(axis=1,method='ffill').T.set_index([0,1]).T.dropna(how='all').dropna(axis=1,how='all')

dfw=dfw[dfw[('Reporting','Period')].str.startswith('W/E')]

#pandas has super magic "period" datetypes... so we can cast a week ending date to a week period
dfw['Reporting','_period']=pd.to_datetime(dfw['Reporting','Period'].str.replace('W/E ',''), format='%d/%m/%Y').dt.to_period('W') 

#Check the start/end date of the weekly period
#dfw['Reporting','_period'].dt.asfreq('D','s')
#dfw['Reporting','_period'].dt.asfreq('D','e')

#Timeseries traditionally have the datey-timey thing as the index
dfw=dfw.set_index([('Reporting', '_period')])
dfw.index.names = ['_period']

#Generate a matplotlib figure/axis pair to give us easier access to the chart chrome
fig, ax = plt.subplots()

#statsmodels has quarterly and montthly subseries plots helper functions
#but underneath, they use a generic seasonal plot
#If we groupby the week number, we can plot the seasonal subseries on a week number basis
tsaplots.seasonal_plot(dfw['A&E attendances']['Total Attendances'].groupby(dfw.index.week),
                       list(range(1,53)),ax=ax)

#Tweak the display
fig.set_size_inches(18.5, 10.5)
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=90);

As to how you read the chart – each line shows the trend over years for a particular week’s figures. The week number is along the x-axis. This chart type is really handy for letting you see a couple of things: year on year trend within a particular week; repeatable periodic trends over the course of the year.

A glance at the chart suggests weeks 24-28 (months 6/7 – so June/July) are the busy times in A&E?

PS the subseries plot uses pandas timeseries periods; see eg Wrangling Time Periods (such as Financial Year Quarters) In Pandas.

PPS Looking at the chart, it seems odd that the numbers always go up in a group. Looking at the code:

def seasonal_plot(grouped_x, xticklabels, ylabel=None, ax=None):
    """
    Consider using one of month_plot or quarter_plot unless you need
    irregular plotting.

    Parameters
    ----------
    grouped_x : iterable of DataFrames
        Should be a GroupBy object (or similar pair of group_names and groups
        as DataFrames) with a DatetimeIndex or PeriodIndex
    """
    fig, ax = utils.create_mpl_ax(ax)
    start = 0
    ticks = []
    for season, df in grouped_x:
        df = df.copy() # or sort balks for series. may be better way
        df.sort()
        nobs = len(df)
        x_plot = np.arange(start, start + nobs)
        ticks.append(x_plot.mean())
        ax.plot(x_plot, df.values, 'k')
        ax.hlines(df.values.mean(), x_plot[0], x_plot[-1], colors='k')
        start += nobs

    ax.set_xticks(ticks)
    ax.set_xticklabels(xticklabels)
    ax.set_ylabel(ylabel)
    ax.margins(.1, .05)
    return fig

there’s a df.sort() in there – which I think should be removed, assuming that the the data presented is pre-sorted in the group?