Category: Anything you want

Standards or Interoperability?

An interesting piece, as ever, from Tim Davies (Slow down with the standards talk: it’s interoperability & information quality we should focus on) reflecting on the question of whether we need more standards, or better interoperability, in the world of (open) data publishing. Tim also links out to Friedrich Lindenberg’s warnings about 8 things you probably believe about your data standard, which usefully mock some of the claims often casually made about standards adoption.

My own take on many standards in the area is that conventions are the best we can hope for, and that even then they will be interpreted in variety of ways, which means you have to be forgiving when trying to read them. All manner of monstrosities have been published in the guise of being HTML or RSS, so the parsers had to do the best they could getting the mess into a consistent internal representation at the consumer side of the transaction. Publishers can help by testing that whatever they publish does appear to parse correctly with the current “industry standard” importers, ideally open code libraries. It’s then up to the application developers to decide which parser to use, or whether to write their own.

It’s all very well standardising your data interchange format, but the application developer will then want to work on that data using some other representation in a particular programming language. Even if you have a formal standard interchange format, and publishers stick to religiously and unambiguously, you will still get different parsers generating internal representations that the application code will work on that are potentially very different, and may even have different semantics. [I probably need to find some examples of that to back up that claim, don’t I?!;-)])

I also look at standards from the point of view of trying to get things done with tools that are out there. I don’t really care if a geojson feed is strictly conformant with any geojson standard that’s out there, I just need to know that something claimed to be published as as geojson works with whatever geojson parser the Leaflet Javascript library uses. I may get frustrated by the various horrors that are published using a CSV suffix, but if I can open it using pandas (a Python programming library), RStudio (an R programming environment) or OpenRefine (a data cleaning application), I can work with it.

At the data level, if councils published their spending data using the same columns and same number, character and data formats for those columns, it would make life aggregating those datasets mush easier. But even then, different councils use the same thing differently. Spending area codes, or directorate names are not necessarily standardised across councils, so just having a spending area code or directorate name column (similarly identified) in each release doesn’t necessarily help.

What is important is that data publishers are consistent in what they publish so that you can start to take into account their own local customs and work around those. Of course, internal consistency is also hard to achieve. Look down any local council spending data transaction log and you’ll find the same company described in several ways (J. Smith, J. Smith Ltd, JOHN SMITH LIMITED, and so on), some of which may match the way the same company is recorded by another council, some of which won’t…

Stories are told from the Enigma codebreaking days of how the wireless listeners could identify Morse code operators by the cadence and rhythm of their transmissions, as unique to them as any other personal signature (you know that the way you walk identifies you, right?). In open data land, I think I can spot a couple of different people entering transactions into local council spending transaction logs, where the systems aren’t using controlled vocabularies and selection box or dropdown list entry methods, but instead support free text entry… Which is say – even within a standard data format (a spending transaction schema) published using a conventional (though variously interpreted) document format (CSV) that nay be variously encoded (UTF-8, ASCII, Latin-1), the stuff in the data file may be all over the place…

An approach I have been working towards for my own use over the last year or so is to adopt a working environment for data wrangling and analysis based around the Python pandas programming library. It’s then up to me how to represent things internally within that environment, and how to get the data into that representation within that environment. The first challenge is getting the data in, the second getting it into a state where I can start to work with it, the third getting it into a state where I can start to normalise it and then aggregate it and/or combine it with other data sets.

So for example, I started doodling a wrapper for nomis and looking at importers for various development data sets. I have things call on the Food Standards Agency datasets (and when I get round to it, their API) and scrape reports from the CQC website, I download and dump Companies House data into a database, and have various scripts for calling out to various Linked Data endpoints.

Where different publishers use the same identifier schemes, I can trivially merge, join or link the various data elements. For approxi-matching, I run ad hoc reconciliation services.

All this is to say that at the end of the day, the world is messy and standardised things often aren’t. At the end of the day, integration occurs in your application, which is why it can be handy to be able to code a little, so you can whittle and fettle the data you’re presented with into a representation and form that you can work with. Wherever possible, I use libraries that claim to be able to parse particular standards and put the data into representations I can cope with, and then where data is published in various formats or standards, go for the option that I know has library support.

PS I realise this post stumbles all over the stack, from document formats (eg CSV) to data formats (or schema). But it’s also worth bearing in mind that just because two publishers use the same schema, you won’t necessarily be able to sensibly aggregate the datasets across all the columns (eg in spending data again, some council transaction codes may be parseable and include dates, accession based order numbers, department codes, others may be just be jumbles of numbers). And just because two things have the same name and the same semantics, doesn’t mean the format will be the same (2015-01-15, 15/1/15, 15 Jan 2015, etc etc)

Problems of Data Quality

One of the advantages of working with sports data, you might have thought, is that official sports results are typically good quality data. With a recent redesign of the Formula One website, the official online (web) source of results is now the FIA website.

As well as publishing timing and classification (results) data in a PDF format intended for consumption by the press, presumably, the FIA also publish “official” results via a web page.

But as I discovered today, using data from a scraper that scrapes results from the “official” web page rather than the official PDF documents is no guarantee that the “official” web page results bear any resemblance at all to the actual result.

formula_one_spanish_grand_prix_2015_q_off_class_pdf__page_2_of_2__and_Session_Classifications___Federation_Internationale_de_l_Automobile

Yet another sign that the whole F1 circus is exactly that – an enterprise promoted by clowns…

Routine Sources, Court Reporting, the Data Beat and Metadata Journalism

In The Re-Birth of the “Beat”: A hyperlocal online newsgathering model (Journalism Practice 6.5-6 (2012): 754-765), Murray Dick cites various others to suggest that routine sources are responsible for generating a significant percentage of local news reports:

Schlesinger [Schlesinger, Philip (1987) Putting ‘Reality’ Together: BBC News. Taylor & Francis: London] found that BBC news was dependent on routine sources for up to 80 per cent of its output, while later [Franklin, Bob and Murphy, David (1991) Making the Local News: Local Journalism in Context. Routledge: London] established that local press relied upon local government, courts, police, business and voluntary organisations for 67 per cent of their stories (in [Keeble, Richard (2009) Ethics for Journalists, 2nd Edition. Routledge: London], p114-15)”].

As well as human sources, news gatherers may also look to data sources at either a local level, such as local council transparency (that is, spending data), or national data sources with a local scope as part of a regular beat. For example, the NHS publish accident and emergency statistics as the provider organisation level on a weekly basis, and nomis, the official labour market statistics publisher, publish unemployment figures at a local council level on a monthly basis. Ratings agencies such as the Care Quality Commission (CQC) and the Food Standards Agency (FSA) publish inspections data for local establishments as it becomes available, and other national agencies publish data annually that can be broken down to a local level: if you want to track car MOT failures at the postcode region level, the DVLA have the data that will help you do it.

To a certain extent, adding data sources to a regular beat, or making a beat purely from data sources enables the automatic generation of data driven press releases that can be used to shorten the production process of news reports about a particular class of routine stories that are essentially reports about “the latest figures” (see, for example, my nomis Labour Market Statistics textualisation sketch).

Data sources can also be used to support the newsgathering process by processing the data in order to raise alerts or bring attention to particular facts that might otherwise go unnoticed. Where the data has a numerical basis, this might relate to sorting a national dataset on the basis of some indicator value or other and highlighting to a particular local news outlet that their local X is in the top M or bottom N of similar establishments in the rest of the country, and that there may be a story there. Where the data has a text basis, looking for keywords might pull out paragraphs or records that are of particular interest, or running a text through an entity recognition engine such as Thomson Reuters’ OpenCalais might automatically help identify individuals or organisations of interest.

In this context of this post, I will be considering the role that metadata about court cases that is contained within court lists and court registers might have to play in helping news media identify possibly newsworthy stories arising from court proceedings. I will also explore the extent to which the metadata may be processed, both in order to help identify court proceedings that may be worth reporting on, as well to produce statistical summaries that may in themselves be newsworthy and provide a more balanced view over the activity of the courts than the impression one might get about their behaviour simply from the balance of coverage provided by the media.

Continue reading

Creating Interactive Election Maps Using folium and IPython Notebooks

During the last couple of weeks of Cabinet Office Code Clubs, we’ve started to explore how we can use the python folium library to generate maps. Last week we looked at getting simple markers onto maps along with how to pull data down from a third party API (the Food Standards Agency hygiene ratings), and this week we demonstrated how to use shapefiles.

As a base dataset, I used Chris Hanretty et al.’s election forecasts data as a foil for making use of Westminster parliamentary constituency shapefiles. The dataset gives a forecast of the likelihood of each party winning a particular seat, so within a party we can essentially generate a heat map of how likely a party is to win each seat. So for example, here’s a forecast map for the Labour party

Week_8_-_Shapefiles_likelihood_LAB

Although the election data table doesn’t explicitly say which party has the highest likelihood of winning each seat, we can derive that from the data with a little bit of code to melt the original dataset into a form where a row indicates a constituency and party combination (rather than a single row per constituency, with columns for each party’s forecast), then grouping by constituency, sorting by forecast value and picking the first (highest) value. (Ties will be ignored…)

electionforecast_reshape

We can then generate a map based on the discrete categorical values of which party has the highest forecast likelihood of taking each seat.

Week_8_-_Shapefiles_likelyparty

An IPython notebook showing how to generate the maps can be found here: how to use shapefiles.

One problem with this sort of mapping technique for the election forecast data is that the areas we see coloured are representative of geographical area, not population size. Indeed, the population of each constituency is roughly similar, so our impression that the country is significantly blue is skewed by the relative areas of the forecast blue seats compared to the forecast red ones, for example.

Ways round this are to use cartograms, or regularly sized hexagonal boundaries, such as described on Benjamin Hennig’s Views of the World website, from which the following image is republished; (see also the University of Sheffield’s (old) Social and Spatial Inequalities Research Group election mapping project website):

UK_election2010_MapsCompared

(A hexagonal constituency KML file, coloured by 2010 results, and corresponding to constituencies defined for that election, can be found from this post.)

From Front Running Algorithms to Bot Fraud… Or How We’ve Lost Control of the Bits…

I’ve just finished reading Michael Lewis’ Flash Boys, a cracking read about algorithmic high frequency trading and how the code and communication systems that contribute to the way stock exchanges operate can be gamed by front-running bots. (For an earlier take, see also Scott Patterson’s Dark Pools; for more “official” takes, see things like the SEC’s regulatory ideas response to the flash crash of May 6, 2010, an SEC literature review on high frequency trading, or this Congressional Research Service report on High-Frequency Trading: Background, Concerns, and Regulatory Developments).

As the book describes, some of the strategies pursued by the HFT traders were made possible because of the way the code underlying the system was constructed. As Lessig pointed out way back way in Code and Other Laws of Cyberspace, and revisited in Codev2:

There is regulation of behavior on the Internet and in cyberspace, but that regulation is imposed primarily through code. The differences in the regulations effected through code distinguish different parts of the Internet and cyberspace. In some places, life is fairly free; in other places, it is more controlled. And the difference between these spaces is simply a difference in the architectures of control — that is, a difference in code.

The regulation imposed on the interconnected markets by code was gameable. Indeed, it seems that it could be argued that it was even designed to be gameable…

Another area in which the bots are gaming code structures is digital advertising. A highly amusing situation is described in the following graphic, taken from The Bot Baseline: Fraud in Digital Advertising (via http://www.ana.net/content/show/id/botfraud):

ANA-White_Ops_-_The_Bot_Baseline_-_Fraud_in_Digital_Advertising_pdf

A phantom layer of “ad laundering” fake websites whose traffic comes largely from bots is used to generate ad-impression revenue. (Compare this with networks of bots on social media networks that connect to each other, send each other messages, and so on, to build up “authentic” profiles of themselves, at least in terms of traffic usage dynamics. Examples: MIT Technlogy Review on Fake Persuaders; or this preprint on The Rise of Social Bots.)

As the world becomes more connected and more and more markets become exercises simply in bit exchange between bots, I suspect we’ll be seeing more and more of these phantom layer/bot audience combinations on the one hand, and high-speed, market stealing, front running algorithms on the other.

PS Not quite related, but anyway: how you’re being auctioned in realtime whenever you visit a website that carries ads – The Curse of Our Time – Tracking, Tracking Everywhere.

PPS Interesting example of bots reading the business wires and trading on the back of them: The Wolf of Wall Tweet: A Web-reading bot made millions on the options market.

Data Journalism in Practice

For the last few years, I’ve been skulking round the edges of the whole “data journalism” thing, pondering it, dabbling with related tools, technologies and ideas, but never really trying to find out what the actual practice might be. After a couple of twitter chats and a phone conversation with Mark Woodward (Johnston Press), one of the participants at the BBC College of Journalism data training day held earlier this year, I spent a couple of days last week in the Harrogate Advertiser newsroom, pitching questions to investigations reporter and resident data journalist Ruby Kitchen, and listening in on the development of an investigations feature into food inspection ratings in and around the Harrogate area.

Here’s a quick debrief-to-self of some of the things that came to mind…

There’s not a lot of time available and there’s still “traditional” work to be done
One of Ruby’s takes on the story was to name low ranking locations, and contact each one that was going to be named to give them a right to response. Contacting a couple of dozen locations takes time and diplomacy (which even then seemed to provoke a variety of responses!), as does then writing those responses into the story in a fair and consistent way.

Even simple facts can take the lead in a small story
…for example, x% of schools attained the level 5 rating, something that can then also be contextualised and qualified by comparing it to other categories of establishment or national, regional or neighbouring locale averages. As a data junkie, it can be easy to count things by group, perhaps overlooking a journalistic take that many of these counts could be used as the basis of a quick filler story or space-filling, info-snack glanceable breakout box in a larger story.

Is the story tellable?
Looking at data, you can find all sorts of things that are perhaps interesting in their subtlety or detail, but if you can’t communicate a headline or what’s interesting in a few words, it maybe doesn’t fit… (Which is not to say that data reporting needs to be dumbed down or simplistic…) Related to this is the “so what?” question..? (I guess for news, if you wouldn’t share it in the pub or over dinner have read it – that is, if you wouldn’t remark on it – you’d have to ask: is it really that interesting? (Hmm… is “Liking” the same as remarking on something? I get the feeling it’s less engaged…)

There’s a huge difference between the tinkering I do and production warez

I have all manner of pseudo-workflows that allow me to generate quick sketches in an exploratory data analysis sort of way, but things that work for the individual “researcher” are not the same as things can work in a production environment. For example, I knocked up a quick interactive map using the folium library in an IPython notebook, but there are several problems with this:

  1. to regenerate the map requires someone having an IPython notebook environment set up and appropriate libraries installed
  2. there is a certain “distance” between producing a map as a single HTML file and getting the map actually published. For example, the HTML page pulls in all manner of third party files (javascript, css, image tiles, marker-icon/css-sprite image files) and so on. For example, working out whether (and if so, where) to host these various resources on a local production server so as not to inappropriately draw them down from third party server.
  3. there isn’t much time available… so you need to think about what to offer. For example:
    • the map I produced was a simple one – just markers and popups. At the time, I hadn’t worked out how to colour the markers or add new icons to them (and I still don’t have a route for putting numbers into the markers…), so the look is quite simple (and clunky)
    • there is no faceted navigation – so you can’t for example search for particular sorts of establishment or locations with a particular rating.

    Given more time, it would have been possible to consider richer, faceted navigation, for example, but for a one off, what’s reasonable? If a publisher starts to use more and more maps, one possible workflow may to be iterate on previous precedents. (To an extent, I tried to do this with things I’ve posted on the OU OpenLearn site over the years. For example, first step was to get a marker displaying map embedded, which required certain technical things being put in place the first time but could then be reused for future maps. Next up was a map with user submitted marker locations – this represented an extension of the original solution, but again resulted in a new precedent that could be reused and in turn extended or iterated on again.)

    This suggests an interesting development process in which ever richer components can perhaps be developed iteratively over an extended period of time or set of either related or independent stories, as the components are used in more and more stories. Where a media group has different independent publications, other ways of iterating are possible…

    The whole tech angle also suggests that a great stumbling block to folk getting (interactive) data elements up on a story page is not just the discovery of the data, the processing and cleaning of it, and the generation of the initial sketch to check it could be something that could add to the telling of a story, (each of which may require a particular set of skills), but also the whole raft of production related issues that then result (which could require a whole raft of other technical skills (which are, for example, skills I know that I don’t really have, even given my quirky skillset…). And if the corporate IT folk take ownership of he publication element, there is then a cascade back upstream of constraints relating to how the data needs to be treated so it can fit in with the IT production system workflow.

Charts
Whilst I tend to use ggplot a lot in R for exploring datasets graphically, rather than producing presentation graphics to support the telling of a particular story. Add to that, I’m still not totally up to speed on charting in the python context, and the result is that I didn’t really (think to) explore how graphical, chart based representations might be used to support the story. One thing that charts can do – like photographs – is take up arbitrary amounts of space, which can be a Good Thing (if you need to fill the space) or a Bad Thing (is space is at a premium, or page (print or web) layout is a constraint, perhaps due to page templating allowances, for example.

Some things I didn’t consider but that now come to mind now are:

  1. how are charts practically handed over? (As Excel charts? as image files?)
  2. does a sub-editor or web-developer then process the charts somehow?
  3. for print, are there limitations on use of colour, line thickness, font-size and style?

Print vs Web
I didn’t really consider this, but in terms of workflow and output, are different styles of presentation required for:

  • text
  • data tables
  • charts
  • maps

Many code based workflows now allow you to “style” outputs in the same way you can style web pages (eg the CSS Zen Garden sites are all visually distinct but have exactly the same content – just the style is changed; thinks: data zen garden.. hmmm… (and related: chart redesigns…). For example, in the python environment ggplot or Seaborn style charts can be styled visually using themes to generate charts that can be save as image files, for example, or converted to interactive web charts (using eg mpld3, which converts base matplotlib charts (which ggplot and seaborn generate) to d3js interactive charts); alternatively, libraries such as pandas highcharts (or in the R context, rCharts) let you generate interactive charts using well-developed javascript chart libraries.

If you want data tables, there are various libraries or tools for styling charts, but again the question of workflow and the actual form in which items are handed over for print or web publication needs to be considered.

Being right/being wrong
Every cell in a data table is a “fact”. If your code is wrong and and one column, or row, or cell is wrong, that can cause trouble. When you’re tinkering in private, that doesn’t matter so much – every cell can be used as the basis for another question that can be used to test, probe or check that fact further. If you publish that cell, and it’s wrong, you’ve made a false claim… Academics are cautious and don’t usually like to commit to anything without qualifying it further (sic;-). I trust most data, metadata and my own stats skills little enough that I see stats’n’data as a source that needs corroborating, which means showing it to someone else with my conclusions and a question along the lines of “it seems to me that this data suggests that – would you agree?”. This perhaps contrasts with relaying a fact (eg a particular food hygiene score) and taking it as-is as a trusted fact, given it was published from a trusted authoritative source, obtained directly from that source, and not processed locally, but then asking the manager of that establishment for a comment about how that score came about or what action they have taken as a result of getting it.)

I’m also thinking it’d be interesting to compare the similarities and differences between journalists and academics in terms of their relative fears of being wrong…!

Human Factors
One of things I kept pondering – and have been pondering for months – is the extent to which templated analyses can be used to create local “press release” story packs around national datasets that can be customised for local or regional use. That’s a far more substantial topic for another day, but it was put into relief last week by my reading of Nick Carr’s The Glass Cage which got me thinking about the consequences of “robot” written stories… (More about that in a forthcoming post.)

Overall
Lots of skills issues, lots of process and workflow issues, lots of story discovery, story creation, story telling and story checking issues, lots of production constraints, lots of time constraints. Fascinating. Got me really excited again about the challenges of, and opportunities for, putting data to work in a news context…:-)

Thanks to all at the Harrogate Advertiser, in particular Ruby Kitchen for putting up with my questions and distractions, and Mark Woodward for setting it all up.

Software Apps As Independent, Free Running, Self-Contained Services

The buzz phrase for elements of this (I think?) is microservices or microservice architecture (“a particular way of designing software applications as suites of independently deployable services”, [ref.]) but the idea of being able to run apps anywhere (yes, really, again…!;-) seems to have been revitalised by the recent excitement around, and rapid pace of development of, docker containers.

Essentially, docker containers are isolated/independent containers that can be run in a single virtual machine. Containers can also be linked together within so that they can talk to each other and yet remain isolated from other containers in the same VM. Containers can also expose services to the outside world.

In my head, this is what I think various bits and pieces of it look like…

docker-config

A couple of recent announcements from docker suggest to me at least one direction of travel that could be interesting for delivering distance education and remote and face-to-face training include:

  • docker compose (fig, as was) – “with Compose, you define your application’s components – their containers, their configuration, links, volumes, and so on – in a single file, then you can spin everything up with a single command that does everything that needs to be done to get your application running.”
  • docker machine“a tool that makes it really easy to go from ‘zero to Docker’. Machine creates Docker Engines on your computer, on cloud providers, and/or in your data center, and then configures the Docker client to securely talk to them.” [Like boot2docker, but supports cloud as well?]
  • Kitematic UI“Kitematic completely automates the Docker installation and setup process and provides an intuitive graphical user interface (GUI) for running Docker containers on the Mac.” ) [Windows version coming soon]

I don’t think there is GUI support for configuration management provided out of docker directly, but presumably if they don’t buy up something like panamax they’ll be releasing their own version of something similar at some point soon?!

(With the data course currently in meltdown, I’m tempted to add a bit more to the confusion by suggesting we drop the monolithic VM approach and instead go for a containerised approach, which feels far more elegant to me… It seems to me that with a little bit of imagination, we could come up with a whole new way of supporting software delivery to students. eg an OU docker hub with an app container for each app we make available to students, container compositions for individual courses, a ‘starter kit’ DVD (like the old OLA CD-ROM) with a local docker hub to get folk up and running without big downloads etc etc. ..) It’s unlikely to happen of course – innovation seems to be too risky nowadays, despite the rhetoric…:-(

As well as being able to run docker containers locally or in the cloud, I also wonder about ‘plug and play’ free running containers that run on a wifi enabled Raspberry Pi that you can grab off the shelf, switch on, and immediately connect to? So for example, a couple of weeks ago Wolfram and Raspberry announced the Wolfram Language and Mathematica on Raspberry Pi, for free [Wolfram’s Raspberry Pi pages]. There are also crib sheets for how to run docker on a Raspberry Pi (the downside of this being that you need ARM based images rather than x86 ones), which could be interesting?

So pushing the thought a bit further, for the mythical submariner student who isn’t allowed to install software onto their work computer, could we give them a Raspberry Pi running their OU course software as service they could remotely connect to?!

PS by the by, at the Cabinet Office Code Club I help run for Open Knowledge last week, we had an issue with folk not being able to run OpenRefine properly on their machines. Fortunately, I’d fired up a couple of OpenRefine containers on a cloud host so we could still run the session as planned…