Reporting in a Repeatable, Parameterised, Transparent Way

Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.

During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.

Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, ad hoc questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.

As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.

The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the folium library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).

One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.

Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and in particular the knitr tools.

So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.

__dropbox_parlidata_rdemos_-_rstudio

For example, from a single Rmd file we can knit an output HTML file that incorporates an interactive leaflet map, or a static PDF document.

It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.

The bookdown package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which bookdown supports partially parameterised compound document creation).

Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:

  • to look for sensible ways of comparing the workflow associated with each,
  • the ramp-up skills required, and blockers (including cultural blockers (also administrative / organisational blockers, h/t @dasbarrett)) associated with getting started with new tools such as Jupyter or RStudio, and
  • the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.

It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.

Tabloid Data Journalism?

At the risk of coming across as a bit snobbish, this ad for a Data Journalist for The Penny Hoarder riled me somewhat…

Do you have a passion for telling stories with data? We’re looking for a data journalist who can crunch statistics about jobs, budgeting, spending and saving — and produce compelling digital content that resonates with our readers. You should have expertise in data mining and analysis, and the ability to present the results in conversational, fun articles and/or telling graphics.

As our data journalist, you will produce revealing, clickable, data-driven articles and/or graphics, plus serve as a resource for our growing team of writers and editors. We envision using data sources such as the Bureau of Labor Statistics and U.S. Census Bureau to report on personal finance issues of interest to our national readership of young professionals, coupon fans and financially striving people of all ages. We want to infuse our blog with seriously interesting data while staying true to our vibe: fun, weird, useful.

Our ideal candidate…
– …
– Can write in a bloggy, conversational voice that emphasizes what the data means to real people
– Has a knack for identifying clicky topics and story angles that are highly shareable
– Gets excited when a blog post goes viral
– …

According to Wikipedia (who else?!;-), Tabloid journalism is a style of journalism that emphasizes sensational crime stories, gossip columns about celebrities and sports stars, junk food news and astrology.

(Yes, yes, I know, I know, tabloid papers can also do proper, hard hitting investigative journalism… But I’m thinking about that sense of the term…)

So what might tabloid data journalism be? See above?

PS ish prompted by @SophieWarnes, it’s probably worth mentioning the aborted Ampp3d project in this context… eg Ampp3d launches as ‘socially-shareable data journalism’ site, Martin Belam talks about Trinity Mirror’s data journalism at Ampp3d and The Mirror Is Making Widespread Cuts To Its Online Journalism.

Open Educational Resources from Government and Parliament

Mentioning to a colleague yesterday that the UK Parliamentary library published research briefings and reports on topics of emerging interest, as well as to support legislation, that often provided a handy, informed, and politically neutral  overview of a subject area that could make for a useful learning resource, the question was asked whether or not they might have anything on the “internet of things”. The answer is not much, but it got me thinking a bit more about the range of documents and document types produced across Parliament and Government that can be used to educate and inform, as well as contribute to debate.

In other words, to what extent might such documents be used in an educational sense, whether in the sense of providing knowledge and information about a topic, providing a structured review of a topic area and the issues associated with it, raising questions about an issue, or reporting on an analysis of it. (There are also opportunities for learning from some of the better Parliamentary debates, for example in terms of how to structure an argument, or explore the issues associated with an issue, but Hansard is out of scope of this post!)

(Also note that I’m coming at this as a technologist, interested as much in the social processes, concerns and consequences associated with science and technology as much as the deep equations and principles that tend to be be taught as the core of the subject, at least in HE. And that I’m interested not just on how we can support the teaching and learning of current undergrads, but also how we can enculturate them into the availability and use of certain types of resource that are likely to continue being produced into the future, and as such provide a class of resources that will continue to support the learning and education of students once they leave formal education.)

So using IoT as a hook to provide examples, here’s the range of documents I came up with. (At some point it maybe worth tabulating this to properly summarise the sorts of information these reports might came, the communicative point of the document (to inform, persuade, provide evidence for or against something, etc), and any political bias that may be likely (in policy docs, for example).

Parliamentary Library Research Briefings

The Parliamentary Library produces a range of research briefings to cover matters of general interest (Commons Briefing papers, Lords Library notes), perhaps identified through multiple questions asked of the Library by members?, as well as background to legislation (Commons Debate Packs, Lords in Focus), through the Commons and Lords Libraries respectively.

Some of the research briefings include data sets (do a web search for site:http://researchbriefings.files.parliament.uk/ filetype:xlsx) which can also be quite handy.

There are also POSTnotes from the Parliamentary Office of Science and Technology, aka POST.

For access to briefings on matters currently in the House, the Parliament website provides timely/handy pages that list briefings documents for matters in the House today/this week. In addition, there are feeds available for recent briefings from all three: Commons Briefing Papers feed, Lords Library Notes feed, POSTnotes feed. If you’re looking for long reads and still use a feed reader, get subscribing;-)

Wider Parliamentary Documents

The Parliament website also supports navigation of topical issues such as Science and Technology, as well as sub-topics, such as Internet and Cybercrime. (I’m not sure how the topics/sub-topics are identified or how the graph is structured… That may be one to ask about when I chat to Parliamentary Library folk next week?:-)

Within the topic areas, relevant Commons and Lords related Library research briefings are listed, as well as
POSTnotes, Select Committee Reports and Early Day Motions.

(By the by, it’s also worth noting that chunks of the Parliament website are currently in scope of a website redesign.)

Government Documents

Along with legislation currently going through Parliament that is published on the Parliament website (along with Hansard reports that record, verbatim(-ish!) proceedings of debates in either House), explanatory notes provided by the Government department bringing a bill provide additional, supposedly more accessible/readable, information around it.

Reports are also published by government offices. For example, the Blackett review (2014) on the Internet of things was a Government Office for Science report from the UK Government Chief Scientific Adviser at the time (The Internet of Things: making the most of the Second Digital Revolution). Or how about a report from the Intellectual Property Office on Eight great technologies: The internet of things.

Briefing documents also appear in a variety of other guises. For example, competitions (such as the Centre for Defence Enterprise (CDE) competition on security for the internet of things, or Ofcom’s consultation on More radio spectrum for the Internet of Things) and consultations may both provide examples of how to start asking questions about a particular topic area (questions that may help to develop critical thinking, prompt critical reflection, or even provide ideas for assessment!).

Occasionally, you can also turn up a risk assessments or cost benefit analysis, such as this Department for Business, Energy & Industrial Strategy Smart meter roll-out (GB): cost-benefit analysis.

EC Parliamentary Research Service

In passing, it’s also worth noting that the EC Parliamentary Research Service also do briefings, such as this report on The Internet Of Things: Opportunities And Challenges, as well as publishing resources linked from topic based pages, such as the Digital Single Market them topic page on The Internet of Things.

Summary

In providing support for all members of the House, the Parliamentary research services must produce research briefings that can be used by both sides of the House. This may stand in contrast to documents produced by Government that may be influenced by particular policy (and political) objectives, or formal reports published by industry bodies and the big consultancies (the latter often producing reports that are either commissioned on behalf of government or published to try to promote some sort of commercial interest that can be sold to government) that may have a lobbying aim.

As I’ve suggested previously, (News, Courses and Scrutiny and Learning Problems and Consultation Based Curricula), maybe we could/should be making more use of them as part of higher education course readings, not just as a way of getting a quick, NPOV view over a topic area, bus also as a way of introduce students to a form of free and informed content, produced in timely way in response to issues of the day. In short, a source that will continue to remain relevant and current over the coming years, as students (hopefully) become lifelong, continuing learners.

Personalised Parliamentary Printing On Demand?

After starting to reread my 6th edition copy of How Parliament Works over the weekend, which is now notably dated, I had a quick poke around Amazon looking to see whether there’s a more recent edition (there is…). In doing so, I saw various mentions to  historical “Standing Orders of the House of Commons“. A quick search of the Parliament website turned up an appropriate page, and a link to a PDF of the 2016 orders.

Having a print copy of such a document to leave laying around means I’ll able to start to pick up stuff from it using osmotic reading(?!;-), but I couldn’t find anywhere to buy such a copy. And printing it out on looseleaf A4 is way too much like faffing around.

However, it seems that on the one hand Parliamentary licensing is quite liberal (the Open Parliament License), and on the other, no-one would know anyway if I uploaded the PDF and got it printed on-demand, in bound copy, private access style, from Lulu:

content_creation_wizard

 

With a two or three quid for postage, that comes in at less than the “cover price” of £10 too.

Which got me thinking… maybe I should try to find some other reference material to bundle into the “book” too? The additional page charge for another couple of hundred pages makes no difference to the marginal cost of the postage etc…

(Unfortunately, Parliament doesn’t distribute an electronic copy of Erskine May. Instead, you need a library, or several hundred quid to give to Lexis Nexis.)

It’s a shame Lulu closed their API down, too… that could have been a useful way of eg auto-generating some POD/book printed copies of report and consultation document readings that I typically open into tabs and then never read. (Osmotic reading of long form content through a screen is something I still struggle to do…)

PS If you’ve never tried a Lulu book before, here’s one I prepared earlier… ;-)

With Tech as My Witness

We will, I think, be seeing increasing use of the surveillance devices we’ve carry with us and have installed in homes as sources of “tech witness” evidence in the courts…

For example, at the end of last year were reports of the prosecution of a 2015 crime in which the police requested copies of records (court papers, 08/26/2016 01:36 PM SEARCH WARRANT FILED) from Amazon’s audio surveillance device, the Amazon Echo (BBC, Guardian, Independent; the article that broke the story from The Information is subscription only).

ck_image_present2_

Form the justification of the request for the search warrant:

ck_image_present2

On ??, the Honorable Judge ?? reviewed and approved a search warrant for ??’s residence once again, located at ??, specifically for the search and seizure of electronic devices capable of storing and transmitting any form of data that could be related to this investigation. Officers executed this search warrant on this same date and during the course of the search, I located an Amazon Echo device in the kitchen, lying on the kitchen counter next to the refrigerator, plugged into the wall outlet. I had previously observed this device in the same position and state during the previous search warrant on ??.

While searching ??’ residence, we discovered numerous devices that were used for “smart home” services, to include a “Nest” thermometer that is Wi-Fi connected and remotely controlled, a Honeywell alarm system that included door monitoring alarms and motion sensor in the living room, a wireless weather monitoring system outside on the back patio, and WeMo devices in the garage area for remote-activated lighting purposes that had not been opened yet. All of these devices, to include the Amazon Echo device, can be controlled remotely using a cell phone, computer, or other device capable of communicating through a network and are capable of interacting with one another through the use of one or more applications or programs. Through investigation, it was learned that during the time period of ??’s time at the residence, music was being wirelessly streamed throughout the home and onto the back patio of the residence, which could have been activated and controlled utilizing the Amazon Echo device or an application for the device installed on ??’s cell Apple iPhone.

The Amazon Echo device is constantly listening for the “wake” command of “Alexa” or “Amazon,” and records any command, inquiry, or verbal gesture given after that point, or possibly at all times without the “wake word” being issued, which is uploaded to Amazon.com’s servers at a remote location. It is believed that these records are retained by Amazon.com and that they are evidence related to the case under investigation.

On ??, Amazon.com was served with a search warrant that was reviewed and approved by Circuit Court Judge ?? on the same date. The search warrant was sent through Amazon’s law enforcement email service and was also sent through United States Postal Service Certified Mail to their corporate headquarters in Tumwater, Washington. The search warrant was received by Amazon through the mail on ??, and representatives with Amazon have been in contact with this agency since receiving the search warrant. In speaking with their law enforcement liaison, Greg Haney, I was informed on two separate occasions that Amazon was in possession of the requested data in the search warrant but needed to consult with their counsel prior to complying with the search warrant. As of ??, Amazon has not provided our agency with the requested data and an extension for the originally ordered search warrant was sought.

After being served with the second search warrant, Amazon did not comply with providing all of the requested information listed in the search warrant, specifically any information that the Echo device could have transmitted to their servers. This agency maintains custody of the Echo device and it has since been learned that the device contains hardware capable of storing data, to potentially include time stamps, audio files,\nor other data. It is believed that the device may contain evidence related to this investigation and a search of the device itself will yield additional data pertinent to this case.

Our agency has also maintained custody of ??’s cell phone, an LG Model LG—E980, and ??’s cell phone, a Huawei Nexus cell phone, that was seized from ?? as a result of his arrest on ??, and we have been unable to access the data stored on the devices due to a passcode lock on them. Despite efforts to obtain the passcode, the devices could not be accessed. Our agency now has the ability to utilize data extraction methods that negate the need for passcodes and efforts to search ?? and ??’s devices will continue upon issuance of this warrant.

Today, via @charlesarthur (& also Schneier), I notice a story describing how Cops use pacemaker data to charge homeowner with arson, insurance fraud. (I found some (court records Middletown, Butler County, 16CRA04386) but couldn’t find/see the filing for the warrant?) It seems that “[p]olice set out to disprove ??’s story … by obtaining a search warrant to collect data from [his] pacemaker. WLWT5 reported that the cops wanted to know “??’s heart rate, pacer demand and cardiac rhythms before, during and after the fire.”

This builds on previous examples of Fitbit data being called on as evidence in at least of couple of US court cases, challenging claims made by individuals that they were engaged in one sort of behaviour when their logged physiological data suggested they were not.

And of course, many cars now have their own black box, which is likely to include ever more detailed data logs. For example, a recent report by the US Depart of Transportation National Highway Traffic Safety Administration (NHTSA) included reference to “data logs, image files, and records related to the crashes … provided by Tesla in response to NHTSA subpoenas.”

It’ll be interesting to see the extent to which contemporary data/video/audio collecting devices will be viewed as reliable (or unreliable) witnesses, and further down the line, the extent to which algorithmic classifications are trusted. For example, in using OCR to extract the text from the scanned PDF of the court filing shown above, which for some reason I had to convert to a JPG image before Apache Tika running on docker cloud would extract text from it, I noticed on one page it has mis-recognised Amazon servers as Amazon sewers.

PS in passing, I’m quite amazed at how much personal information is made available via public documents associated with the justice system in the US.

PPS in passing, I note this (now closed) consultation from ofgem on mandatory half-hour settlement (which is to say, logging your energy usage every half an hour). FWIW, here’s the ICO’s response.

Ideas for Data Journalism Exercises Number 237 – “Prescribing Cuts”

Via Andy Dickinson’s Media Mill Gazette open data / data journalism newsletter (issue 92), I notice that Croydon Clinical Commissioning Group appears to have taken a decision to stop prescribing specialist baby formula.

Hmmm…

Although hospital prescription data is not typically released as public data (though I wonder, is it FOIable?), which ruled out a quick Sunday morning data dive chasing the weekend newspaper story that Drugs firms are accused of putting cancer patients at risk over price hikes, prescribing data is available for GPs, both as an open data download and via the openprescribing API.

So a wondering for a possible data dive… For GPs in a particular CCG (easy enough to find), could we find prescriptions relating to the baby milk formulas mentioned in the Croydon story (Nutramigen and Neocate) and then see how related prescribing – and costs of prescribing – have changed over the last 12 months?

Yet another thing to add to the “could do this if my time was my own” list…

Detecting Features in Data Using Symbolic Coding and Regular Expression Pattern Matching

One of the reasons I dive into motorsport results and timing data every so often is that it gives me a quite limited set of data to play with. In turn, this means I have to get creative when it comes to reshaping the data to see what visuals I can pull out of it, as generating derived datasets to see what other story forms and insights might be hidden in there.

One of the things I hope to do with the WRC data is push a bit more on automatically generating text-based race reports from the data. Part of the trick here is spotting patterns that can be be mapped onto textual tropes, common sorts of phrase or sentence that you are likely to see in the more vanilla forms of sports reporting. (“X led the race from the start”, “Despite a poor start to the stage, Y went on to win it, N seconds ahead of Z in second place” and so on.)

So how can we spot the patterns? One way is to write a SQL query that detects a particular pattern in the data and uses that to flag a possible event (for example, Detecting Undercuts in F1 Races Using R). Another might be to cast the data as a graph and then detect features using graph based algorithms (eg Identifying Position Change Groupings in Rank Ordered Lists).

During the middle of last night, I woke up wondering whether or not it would be possible to cast simple feature components as symbols and then use a regular expression pattern matcher to identify a particular sort of pattern from a symbolic string. So here’s a quick proof of concept…

From the WRC Monte Carlo 2107 rally, stage 3, some split times and rank positions at each split.

wrc_results_scraper10

Here’s a visual representation of the same (the number labels are rank position at each split, the y-axis is the delta to the fastest time recorded over that split (the “sector time”, if you will, derived data from the original results data).

wrc_results_scraper_and_mentions8

For each driver, you may be able to spot several shapes. For example, Ogier is way behind at the first split, but then gains over the rest of the stage, Kreeke and Breen lose time at the second split, Hanninen loses it on the final part of the stage, and so on. Can we code for these different patterns, and then detect them?

wrc_results_scraper11

So that seems to work okay… Now all I need to do is come up with some suitable symbolic encodings and pattern matching strings…

Hmmm… Vague memories… I wonder if there are any symbolic dynamics algorithms or finite state machine grammar parsers I could make use of?