Algorithmic Pareidolia

define_pareidolia_-_google_search

Exactly…

(According to Collins English dictionary, pareidolia, (noun), the imagined perception of a pattern or meaning where it does not actually exist, as in considering the moon to have human features.)

Python Code Stepper / Debugger / Tutor for Jupyter Notebooks – nbtutor

Whilst reviewing / scoping* possible programming editor environments for the new level 1 courses, one of the things I was encouraged to look at was Philip Guo’s interactive Python Tutor.

According the the original writeup (Philip J. Guo. Online Python Tutor: Embeddable Web-Based Program Visualization for CS Education. In Proceedings of the ACM Technical Symposium on Computer Science Education (SIGCSE), March 2013), the application has an HTML front end that calls on on a backend debugger: “the Online Python Tutor backend takes the source code of a Python program as input and produces an execution trace as output. The backend executes the input program under supervision of the standard Python debugger module (bdb), which stops execution after every executed line and records the program’s run-time state.”

The current version of the online tutor supports a wider range of languages – Python, Java, JavaScript, TypeScript, Ruby, C, and C++ – which presumably have their own backend interpreter and use a common trace response format?

The tutor itself allows you to step through code snippets a line at a time, displaying a trace of the current variable values.

Another nice feature of the Online Python Tutor, though it was a bit ropey when I first tried it out a few months ago, was the shared session support, that a learner and a tutor see, via a shared link, the same session, with an additional chat box allowing them to chat over the shared experience in realtime.

Whilst the Online Python Tutor allows URLs to saved programs (“tutorials”) to be generated and shared: link to the demo shown in the movie above. The code is actually passed via the URL.

One of the problems with the Online Python Tutor is that requires a network connection so that the code can be passed to the interpreter back end, executed to generate the code trace, and then passed back to the browser. It didn’t take long for folk to start embedding the tutor in an iframe to give a pseudo-traceability experience in the notebook context, but now the Online Python Tutor inspired nbtutor extension makes cell based tracing against the local python kernel possible**.

The nbtutor extension provides cell by cell tracing (when running a cell, all the code in the cell is executed, the trace returned, and then available for visualising. Note that all variables in scope are displayed in the trace, even if they have been set in other cells outside of the nbtutor magic. (I’m not sure if there’s a setting that allows you just to display the variables that are referenced within the cell?)  It is also possible to clear all variables in the global scope via a magic parameter, with a prompt to confirm that you really do want to clear out all those variable values.

I’m not sure that the best way would be to go about framing nbtutor exercises in a Jupyter notebook context, but I note that the notebooks used to support the MPR213 (Programming and Information Technology) course from the Department of Mechanical and Aeronautical Engineering in the Faculty of Engineering, Built Environment and Information Technology at the University of Pretoria now include nbtutor examples.

Footnotes:

* A cynic might say scoping in the sense not seriously considering anything other than the environments that had already been decided on before the course production process had really started… ;-) I also preferred BlockPy over Scratch, for example. My feeling was that if the OU was going to put developer effort in (the original claim was we wouldn’t have to put effort into Scratch, though of course we are because Scratch wasn’t quite right…) we could add more value to the OU and the community by getting involved with BlockPy, rather than a programming environment developed for primary school kids. Looking again at the “friendly” error messages that the BlockPy environment offers, I’m starting to wondering if elements of that could be reused for some IPython notebook magic…

** Again, I’m of the mind that were it 20 years ago, porting the Online Python Tutor to the Jupyter notebook context might have been something we’d have considered doing in the OU…

Displaying Differences in Jupyter Notebooks – nbdime / nbdiff

One of the challenges of working with Jupyter notebooks to date has been the question of diffing, spotting the differences between two versions of the same notebook. This made collaborative authoring and reviewing of notebooks a bit tricky. It also acted as a brake on using notebooks for student assessment. It’s easy enough to to set an exercise using a templated notebook and then get students to work through it, but marking the completed notebook in return can be a bit fiddly. (The nbgrader system addresses this in part but at the expense of the overhead in terms of having to use additional nbgrader formatting and markup.)

However, there’s ongoing effort now around nbdime (docs). Given past success in getting Jupyter notebook previews displayed in Github, it wouldn’t be unreasonable to think that the diff view too might make it into that environment at some point too…

At the moment, nbdime works from the command line. It can produce a text diff in the console, or launch a notebook viewer in the browser that shows differences between two notebooks.

The differ works on a cell by cell basis and highlights changes and addtions. (Extra emphasis on the changed text in a markdown cell doesn’t seem to work at the moment?)

nbdime_-_diff_and_merge_your_jupyter_notebooks

If you change the contents of a code cell, or the outputs of a code cell have changed, those differences are identified too. (Note the extra emphasis in the code cell on the changed text, but not in the output.)

nbdime_-_diff_and_merge_your_jupyter_notebooks2

To improve readability, you can collapse the display of changed code cell output.

nbdime_-_diff_and_merge_your_jupyter_notebooks3

Where cell outputs include graphical objects, differences to these are highlighted too.

nbdime_-_diff_and_merge_your_jupyter_notebooks4

(Whilst I note that Github has various tools for exploring the differences between two versions of the same image, I suspect that sort of comparison will be difficult to achieve inline in the notebook differencer.)

I suspect one common way of using nbdime will be to compare the current state of a notebook with a checkpointed version. (Jupyter notebooks autosave the current state of the notebook quite regulalry. If you force a save, the current state is saved but a “checkpoint” version of the notebook is also saved to a hidden folder. If things go really wrong with your current notebook, you can restore it to the checkpointed version.)

If you’ve saved a checkpoint of a notebook, and want to compare the current (autosaved) version with it, you need to point to the checkpointed file in the checkpoint folder: nbdiff-web .ipynb_checkpoints/MY_FILE-checkpoint.ipynb MY_FILE.ipynb. It’d be nice if a switch could handle this automatically, eg nbdime_web --compare-checkpoint MY_FILE.ipynb (It would also be nice if the nbdiff command could force the notebook to autosave before a diff is run, but I’m not sure how that could be achieved?)

It also strikes me that when restoring from a checkpoint, it might be possible to combine the restoration action with the differencer view so that you can decide which bits of the current notebook you might want to keep (i.e. essentially treat the differences between the current and checkpointed version as conflicts that need to be resolved?)

This is probably pushing things a bit far, but I also wonder if lightweight, inline, cell level differencing would be possible, given that each cell in at running notebook has an undo feature that goes back multiple streps?

Finally, a note about using the differencer to support marking. The differencer view is an HTML file, so whilst you can compare a student’s notebook with the orignal  you can’t edit their notebook directly in the differencer to add marks or feedback. (I really do need to have another play with nbgrader, I think…)

PS It’s also worth noting that SageMathCloud has a history slider that lets you run over different autosaved versions of a notebook, although differences are not highlighted.

PPS Thinks: what I’d like is a differencer that generates a new notebook with addition/deletion cells highlighted and colour styled so that I could retain – or delete – the cell and add cells of my own… Something akin to track changes, for example. That way I could run different cells, add annotations, etc etc (related issue).

Data Journalism Units on Github

Working as I do with an open notebook (this blog, my github repos, pinboard and twitter), I value works shared by other people too. Often, this can lead to iterative development, as one person sees an opportunity to use someone else’s work for a slightly different purpose, or spots a way to improve upon it.

A nice example of this that I witnessed in more or less realtime a few years ago was when data journalists from the Guardian and the Telegraph – two competing news outlets – bounced off each others’ work to produce complementary visualisations demonstrating electoral boundary changes (Data Journalists Engaging in Co-Innovation…). (By the by, boundary changes are up for review again in 2018 – the consultation is still open.)

Another example comes from when I starting to look for cribs around building virtual machines to support OU course delivery. Specifially, the Infinite Interns idea for distinct (and disposable) virtual machines that could be used to support data journalism projects (about).

Today, I variously chanced across a couple of Github repositories containing data, analyses, toolkits and application code from a couple of different news outlets. Github was originally developed as a social coding environment where developers could share and collaborate on software projects. But over the last few years, it’s also started to be used to share data and (text) documents, as well as reproducible data analyses – and not just by techies.

A couple of different factors have contributed to this, I think, that relate as much to how Github can be used to preview and publish documents, as act as a version control and issue tracking system:

options

Admittedly, using git and Github can be really complicated and scary, but you can also use it as a place to pop documents and preview them or publish them, as described above. And getting files in is easy too – just upload them via the web interface.

Anyway, that’s all by the by… The point of this post was to try to pull together a small collection of links to some of the data journalism units I’ve spotted sharing stuff on Github, and see to what extent they practice “reproducible data journalism”. (There’s also a Github list – Github showcase – open journalism.) So for example:

  • one of the first news units I spotted sharing research in a reproducible way was BuzzFeedNews and their tennis betting analysis. A quick skim of several of the repos suggest they use a similar format – a boilerplate README with a link to the story, the data used in the analysis, and a Jupyter notebook containing python/pandas code to do the analysis. They also publish a handy directory to their repos, categorised as Data and Analyses, Standalone Datasets, Libraries and Tools, GuidesI’m liking this a lot…
  • fivethirtyeight: There are also a few other data related repos at the top level, eg guns-data. Hmm… Data but no reproducible analyses?
  • SRF Data – srfdata (data-driven journalism unit of Swiss Radio and TV): several repos containing Rmd scripts (in English) analysing election related data. More of this, please…
  • FT Interactive News – ft-interactive: separate code repos for different toolkits (such as their nightingale-charts chart tools) and applications; a lot of the applications seem to appear in subscriber only stories – but I can you can try to download the code and run it yourself… Good for sharing code, poor for paywall stopping sharing of executed examples;
  • New York Times – NYTimes: plenty of developer focussed repos, although the gunsales repo creates an R package that works with a preloaded dataset and routines to visualise the data and the ingredient phrase tagger is a natural language parser trained to tag food recipe components. (Makes me wonder what other sorts of trained taggers might be useful…) One for the devs…
  • Washington Post – washingtonpost: more devops repos, they they have also dropped a database of shootings (as a CSV file) as one of the repos (data-police-shootings)). I’d hoped for more…
  • NYT Newsroom Developers: another developer focussed collection of repos, though rather than focussing on just front end tools there are also scrapers and API helpers. (It might actually be worth going through all the various news/media repos to build a metalist/collection of API wrappers, scrapers etc. i.e. tools for sourcing data). I’d half expected to see more here, too…?
  • Wall Street Journal Graphics Team – WSJ: not much here, but picking up on the previous point there is this example of a AP ballot API wrapper; Sparse…
  • The Times / Sunday Times – times: various repos, some of the link shares; the data one collects links to a few datasets and related stories. Also a bit sparse…
  • The Economist – economist-data-team: another unloved account – some old repos for interactive HTML applications; Another one for the devs, maybe…
  • BBC England Data Unit – BBC-Data-Unit: a collection of repositories, one per news project. Recent examples include: Dog Fights and Schools Chemical Alerts. Commits seem to come from a certain @paulbradshaw… Repos seem to include a data file and a chart image. How to create run the analysis/create the chart from the data is not shared… Could do better…

From that quick round up, a couple of impressions. Firstly, BuzzFeedNews seem to be doing some good stuff; the directory listing they use that breaks down different sorts of repos seems sensible, and could provide the basis for a more scholarly round up than the one presented here. Secondly, we could probably create some sort of matrix view over the various repos from different providers, that would allow us, for example, to see all the chart toolkits, or all the scrapers, or all the API wrappers, or all the election related stuff.

If you know of any more I should add to the list, please let me know via the comments below, ideally with a one or two line summary as per the above to give a flavour of what’s there…

I’m also mindful that a lot of people working for the various groups may also be publishing to personal repositories. If you want to suggest names for a round up of those, again, please do so via the comments.

PS I should really be recording the licenses that folk are releasing stuff under too…

Opportunities for Doing the Civic Thing With Open and Public Data

I’ve been following the data thing for a few years now, and it’s been interesting to see how data related roles have evolved over that time.

For my own part, I’m really excited to have got the chance to work with the Parliamentary Digital Service [PDS] [blog] for a few days this coming year. Over the next few weeks, I hope to be starting to nose around Parliament and the Parliamentary libraries getting a feel for the The Life of Data there, as well as getting in touch with users of Parliamentary data more widely (if you are one, or aspire to be one, say hello in the commentsto see if we can start to generate more opportunities for coffee…:-)

I’m also keen to see what the Bureau of Investigative Journalism’s Local Data Lab, headed up by Megan Lucero, starts to get up to. There’s still a chance to apply for starting role there if you’re a “a journalist who uses computational method to find stories, an investigative or local journalist who regularly uses data, a tech or computer science person who is interested in local journalism or a civic tech person keen to get involved”, and the gig looks like it could be a fun one:

  • We will take on datasets that have yet to be broken down to a local level, investigate and reveal stories not yet told and bring this to local journalists.
  • We will be mobile, agile and innovative. The team will travel around the country to hear the ideas and challenges of regional reporters. We will listen, respond and learn so to provide evidence-based solutions.
  • We will participate in all parts of the process. Every member will contribute to story ideas, data wrangling, problem solving, building and storytelling.
  • We will participate in open journalism. We will publish public interest stories that throw light on local and national issues. We will open our data and code and document our shortcomings and learnings. We will push for greater transparency. We will foster collaboration between reporters and put power into regional journalism.

I’m really hoping they start a fellowship model too, so I can find some way of getting involved and maybe try to scale some of the data wrangling I will be doing around Isle of Wight data this year to wider use. (I wonder if they’d be interested in a slackbot/datawire experiment or two?!) After all, having split the data out for one local area, it’s often trivial to change the area code and do the same for another:

(It’ll also be interesting to see how the Local Data Lab might complement things like the BBC Local Reporting Scheme,  or feed leads into the C4CJ led “representative network for community journalism”.)

Data journalism job ads are still appearing, too. A recent call for a Senior Broadcast Journalist (Data), BBC Look North suggests the applicant should be able:

  • To generate ideas for data-driven stories and for how they might be developed and visualized.
  • To explore those ideas using statistical tools – and present them to wider stakeholders from a non-statistical background.
  • To report on and analyse data in a way that contributes to telling compelling stories on an array of news platforms.
  • To collaborate with reporters, editors, designers and developers to bring those stories to publication.
  • To use statistical tools to identify significant data trends.

The ad suggests that required skills include good knowledge of Microsoft Excel, a strong grasp of how to clean, parse and query data as well as database management*, [and] demonstrable experience of visualising data and using visualisation tools such as SPSS, SAS, Tableau, Refine and Fusion Tables.

* I’m intrigued as to what this might mean. As an entry level, I like to think this is getting data into something like SQLite and then running SQL queries over it? It’s also worth remembering that Google Sheets also exposes an SQL like interface that you can query (example, about).

When I started pottering around in the muddy shores of “data journalism” as it became a thing, Google Sheets, Fusion Tables and Open (then Google) Refine were the tools I tried to promote because I saw them as a relatively easy way in to working with data. But particularly with the advent of accessible working environments like RStudio and Jupyter notebooks, I have moved very much more towards the code side. This is perceived as a much harder sell – it requires learning to code – but it’s amazing what you can do with a single line of code, and in many cases someone has already written that line, so all you have to do is copy it; environments like Jupyter notebooks also provide a nicer (simpler) environment for trying out code than scary IDEs (even the acronym is impenetrable;-). As a consequence of spending more time in code, it’s also made me think far more about reproducible and transparent research (indeed, “reproducible data journalism”), as well as the idea of literate programming, where code, text and, particularly in research workflows, code outputs, together form a (linear) narrative that make it easier to see and understand what’s going on…

As well as the data journalism front, I’ve also kept half an eye on how academic libraries have been engaging with data issues, particularly from an “IT skills” development perspective. Generally, they haven’t, although libraries are often tasked with supporting research data management projects, as this job ad posted recently by the University of Michigan (via @lorcanD) for a data workflows specialist shows:

The Research Data Workflows Specialist will advance the library’s mission to create and sustain data services for the c­ampus that support the mission of the University of Michigan researchers through Research Data Services (RDS), a new and growing initiative that will build the next generation of data curation systems. A key focus of this position will be to understand and connect with the various disciplinary workflows on campus in order to inform the development of our technical infrastructure and data services.

I suspect this is very much associated with research data management. It seems to me that there’s still a hole when it comes to helping people put together their own reproducible research toolchains and technology stacks together (as well as working out what sort of toolchain/stack are actually required…).

Finally, I note that NotWestminster is back later next month in Huddersfield (I managed to grab a ticket last night). I have no idea what to expect from the event, but it may generate a few ideas for what I can usefully do with Island data this year…

PS just spotted another job opportunity in a related opendata area: Data Labs and Learning Manager, 360 Giving.

A First Attempt at Wrangling WRC (World Rally Championship) Data With pandas and matplotlib

Last year was a quite year on the Wrangling F1 Data With R front, with a not even aborted start at doing a python/pandas equivalent project. With the new owners of F1 in place, things may change for the better in terms of engaging with fans and supporters, and I may revisit that idea properly, but in the meantime, I thought I started tinkering with a wider range of motorsport data.

The start to the BTCC season is still a few months away, but the WRC started over the weekend, and with review highlights and live coverage of one stage per rally on Red Bull TV, I thought I may give that data a go…

Results and timing info can be found on the WRC web pages (I couldn’t offhand find a source of official FIA timing sheets) so here’s a first quick sketch using stage results from the first rally of the year – Monte Carlo.

world_rally_championship_-_results_monte_carlo_-_wrc_com

To start with, we need to grab the data. I’m using the pandas library, which has a handy .read_html() method that can scrape tables (crudely) from an HTML page given its URL.

import pandas as pd

def getStageResultsBase(year,rallyid,stages):
    ''' Get stage results and overall results at end of stage '''
    
    # Accept one stage number or a list of stage numbers
    stages=[stages] if not isinstance(stages,list) else stages
    
    #There are actually two tables on the stage results page
    df_stage=pd.DataFrame()
    df_overallpart=pd.DataFrame()
    
    #Grab data for each stage
    for stage in stages:
        url='http://www.wrc.com/live-ticker/daten/{year}/{rallyid}/stage.{rallyid}.{stage}.all.html'.format(year=year, rallyid=rallyid, stage=stage)
        #scrape the data
        results=pd.read_html(url,encoding='utf-8')
        results[0].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        results[1].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        
        #Simple cleaning - cast the data types as required
        for i in [0,1]:
            results[i].fillna(0,inplace=True)
            results[i]['pos']=results[i]['pos'].astype(float).astype(int)
            for j in ['carNo','driverName','time','diffPrev','diffFirst']:
                results[i][j]=results[i][j].astype(str)
        
        #Add a stage identifier
        results[0]['stage']=stage
        results[1]['stage']=stage
        
        #Add the scraped stage data to combined stage results data frames
        df_stage=pd.concat([df_stage,results[0]])
        df_overallpart=pd.concat([df_overallpart,results[1]])

    return df_stage.reset_index(drop=True), df_overallpart.reset_index(drop=True)

The data we pull back looks like the following.

wrc_results_scraper1

Note that deltas (the time differences) are given as offset times in the form of a string. As the pandas library was in part originally developed for working with financial time series data, it has a lot of support for time handling. This includes the notion of a time delta:

pd.to_timedelta("1:2:3.0")
#Timedelta('0 days 01:02:03')

We can use this datatype to represent time differences from the results data:

#If we have hh:mm:ss format we can easily cast a timedelta
def regularTimeString(strtime):

    #Go defensive, just in case we're passed eg 0 as an int
    strtime=str(strtime)
    strtime=strtime.strip('+')

    modifier=''
    if strtime.startswith('-'):
        modifier='-'
        strtime=strtime.strip('-')

    timeComponents=strtime.split(':')
    ss=timeComponents[-1]
    mm=timeComponents[-2] if len(timeComponents)>1 else 0
    hh=timeComponents[-3] if len(timeComponents)>2 else 0
    timestr='{}{}:{}:{}'.format(modifier,hh,mm,ss)
    return pd.to_timedelta(timestr)

We can use the time handler to cast the time differences from the scraped data as timedelta typed data:

def getStageResults(year,rallyid,stages):
    df_stage, df_overallpart = getStageResultsBase(year,rallyid,stages)
    for col in ['time','diffPrev','diffFirst']:
        df_stage['td_'+col]=df_stage.apply(lambda x: regularTimeString(x[col]),axis=1)
        df_overallpart['td_'+col]=df_overallpart.apply(lambda x: regularTimeString(x[col]),axis=1)
    return df_stage, df_overallpart 

wrc_results_scraper2

The WRC results cover all entrants to the rally, but not all the cars are classed as fully blown WRC cars (class RC1). We can limit the data to just the RC1 cars and generate a plot showing the position of each driver at the end of each stage:

%matplotlib inline
import matplotlib.pyplot as plt

rc1=df_overall[df_overall['groupClass']=='RC1'].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='pos',ax=ax,legend=None);

wrc_results_scraper3

The position is actually the position of the driver across all entry classes, not just RC1. This means if a driver has a bad day, they could be placed well down the all-class field; but that’s not of too much interest if all we’re interested in is in-class ranking.,

So what about if we rerank the drivers within the RC1 class? And perhaps improve the chart a little by adding a name label to identify each driver at their starting position?

rc1['rank']=rc1.groupby('stage')['pos'].rank()

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='rank',ax=ax,legend=None)

#Add some name labels at the start
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, i+1, d.ix(i)['driverName'])

wrc_results_scraper4

This chart is a bit cleaner, but now we lose information around the lower placed in-class drivers, in particular that information about  there overall position when other classes are taken into account too…

The way the FIA recover this information in their stage chart displays that reports on the evolution of the race for the top 10 cars overall (irrespective of class)  that shows excursions in interim stages outside the top 10  “below the line”, annotating them further with their overall classification on the corresponding stage.

stage_chart___federation_internationale_de_l_automobile

We can use this idea by assigning a “re-rank” to each car if they are positioned outside the size of the class.

#Reranking...
rc1['xrank']= (rc1['pos']>RC1SIZE)
rc1['xrank']=rc1.groupby('stage')['xrank'].cumsum()
rc1['xrank']=rc1.apply(lambda row: row['pos'] if row['pos']<=RC1SIZE else row['xrank'] +RC1SIZE, axis=1)
fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None)

#Name labels
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName'])
for i,d in rc1[rc1['stage']==17].iterrows():
    ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])

wrc_results_scraper5The chart now shows the evolution of the race for the RC1 cars, retaining the spaced ordering of the top 12 positions that would be filled by WRC1/RC1 cars if they were all placed above cars from other classes and then bunching those placed outside the group size. (Only 11 names are shown because one the entries retired right at the start of the rally.)

So for example, in this case we see how Neuvill, Hanninen and Serderidis are classed outside Lefebvre, who was actually classed 9th overall.

Further drawing on the FIA stage chart, we can label the excursions outside the top 12, and also invert the y-axis.

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None);

for i,d in rc1[rc1['xrank']>RC1SIZE].iterrows(): ax.text(d.ix(i)['stage']-0.1, d.ix(i)['xrank'], d.ix(i)['pos'], bbox=dict( boxstyle='round,pad=0.3',color='pink')) #facecolor='none',edgecolor='black',
#Name labels
for i,d in rc1[rc1['stage']==1].iterrows(): ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName']) for i,d in rc1[rc1['stage']==17].iterrows(): ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])
#Flip the y-axis plt.gca().invert_yaxis()

Lefebvre’s excursions outside the top 12 are now recorded and plain to see.

wrc_results_scraper6

We now have a chart that blends rank ordering with additional information showing where cars are outpaced by cars from other classes, in a space efficient manner.

PS as with Wrangling F1 Data With R, I may well turn this into a Leanpub book, this time exploring the workflow to markdown (and/or maybe reveal.js slides!) from Jupyter notebooks, rather than from RStudio/Rmd.

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

  • identifying what data or insight you want to report from your dataset;
  • (automatically deriving the insights);
  • constructing appropriate sentences from the data;
  • organising the sentences into some sort of narrative structure;
  • making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

exceldata2slide

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
  X1[label='macro']
  X2[label='macro']

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

iw_transparency_spending_data

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

jupyterslidetextgen

{
  X1[label="text"]
  X2[label="Jupyter notebook\n(slide mode)"]
  X3[label="reveal.js\npresentation"]

  Y1[label="text"]
  Y2[label="text"]
  Y3[label="text"]

  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];
}

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

demo_-_interactive_shell_for_blockdiag_-_blockdiag_1_0_documentation

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

testfullslidenotebook2_slides1

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

testfullslidenotebook2_slides2

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

testfullslidenotebook2_slides3

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…