Category: Anything you want

I So Want Try a Docker/Kitematic ContainerBook…

So it seems that Chrome OS joins forces with VMware to accelerate the adoption of Chromebooks in the enterprise.

From a quick skim, it seems as if VMWare’s Workspace ONE product, which is at the heart of the announcement, provides a secure online environment for launching managed, personally contextualised, virtualised services

Which is just pusshin more and more stuff to the web, and more and more requiring that always on netwrok connection.

What I keep thinking I’d like to have is a containerbook, rather than netbook. Think: Docker + Kitematic + Docker Compose + a browser.

Kitematic UI

Services/apps in the container(s) then either run as headless/macine accessed services, or expose an HTML UI accessed via a browser.

Which reminds me: Kitematic still doesn’t support Docker Compose, does it? (Is Panamax still a think in this regard?)

PS another take would be a browser that had Virtualbox built in that could be used to run containers, or could otherwise access desktop virtualisation. This could all get a bit messy though… CF. also things like Windows 10S won’t run Chrome, or the Chrome O/S requirement to use the browser that is the O/S, rather than installing your own browser – such as Microsoft Edge, for example.

Visualising WRC Rally Stages With Relive?

A few days ago, via Techcrunch, I came across the Relive application that visualises GPS traces of bike rides using 3D Google Earth style animations using a range of map data sources.

Data is uploaded using GPX, TCX, or FIT formatted data – all of which are new to me. Standard KML uploads don’t work – time stamps are required for each waypoint.

Along the route, photographic waypoints can be added to illustrate the journey, which got me thinking: this could be a really neat addition to the Rally-maps.com website, annotating stage maps after a race with:

  • photographs from various locations on the stage;
  • images at each split point showing the leaderboard and time splits from each stage;
  • pace info, showing the relative pace across each stage, perhaps captured from a reconnaissance vehicle or zero car.

Alternatively, it might be something that the WRC – or Red Bull TV, who are providing online and TV coverage of this year’s rallys – could publish?

And if they want to borrow some of my WRC chart styles for waypoint images, I’m sure something could be arranged:-)

Local Election Fragments

Reusing stuff from before, a notebook with code to scrape Local Election Notice of Poll PDFs. Includes scripts for geocoding addresses, trying to find whether candidates live in ward or out of ward, searches for possible directorships of locally registered companies amongst the candidates:

[https://gist.github.com/psychemedia/f611f36dbdae5e744a434216690d6c47]

Other things that come to mind, with a bit more data:

  • is a candidate standing for re-election?
  • has a candidate stood previously (and for which party), and/or previously been a councillor?
  • how may committee membership change if a councillor loses their seat?
  • which seats are vulnerable based on previous voting numbers?
  • what are demographics of council wards?

Tracking down Data Files Associated With Parliamentary Business

One of the ways of finding data related files scattered around an organisations website is to run a web search using a search limit that specifies a data-y filetype, such as xlsx  for an Excel spreadsheet (csv and xls are also good candidates). For example, on the Parliament website, we could run a query along the lines of filetype:xlsx site:parliament.uk and then opt to display the omitted results:

Taken together, these files form an ad hoc datastore (e.g. as per this demo on using FOI response on WhatDoTheyKnow as an “as if” open datastore).

Looking at the URLs, we see that data containing files are strewn about the online Parliamentary estate (that is, the website;-)…

Freedom of Information Related Datasets

Parliament seems to be quite open in the way is handles its FOI responses, publishing disclosure logs and releasing datafile attachments rooted on https://www.parliament.uk/documents/foi/:

Written Questions

Responses to Written Questions often come with datafile attachments.

These are files are posted to the subdomain http://qna.files.parliament.uk/qna-attachments.

Given the numeric key for a particular question, we can run a query on the Written Answers API to find details about the attachment:

Looking at the actual URL , something like http://qna.files.parliament.uk/qna-attachments/454264/original/28152%20-%20table.xlsx, it looks as if some guesswork is required generating the URL from the data contained in the API response? (For example, how might original attachments might distinguish from other attachments (such as “revised” ones, maybe?).)

Written Statements

Written statements often come with one of more data file attachments.

The data files also appear on the http://qna.files.parliament.uk/ subdomain although it looks like they’re on a different path to the answered question attachments (http://qna.files.parliament.uk/ws-attachments compared to http://qna.files.parliament.uk/qna-attachments). This subdomain doesn’t appear to have the data files indexed and searchable on Google? I don’t see a Written Statements API on http://explore.data.parliament.uk/ either?

Deposited Papers

Deposited papers often include supporting documents, including spreadsheets.

Files are located under http://data.parliament.uk/DepositedPapers/Files/:

At the current time there is no API search over deposited papers.

Committee Papers

A range of documents may be associated with Committees, including reports, responses to reports, and correspondence, as well as evidence submissions. These appear to mainly be PDF documents. Written evidence documents are rooted on http://data.parliament.uk/writtenevidence/committeeevidence.svc/evidencedocument/ and can be found from committee written evidence web (HTML) pages rooted on the same path (example).

A web search for site:parliament.uk inurl:committee (filetype:xls OR filetype:csv OR filetype:xlsx) doesn’t turn up any results.

Parliamentary Research Briefings

Research briefings are published by Commons and Lords Libraries, and may include additional documents.

Briefings may be published along with supporting documents, including spreadsheets:

The files are published under the following subdomain and path:  http://researchbriefings.files.parliament.uk/.

The file attachments URLs can be found via the Research Briefings API.

This response is a cut down result – the full resource description, including links to supplementary items, can be found by keying on the numeric identifier from the URI _about which the “naturally” identified resource (e.g. SN06643) is described.

Summary

Data files can be found variously around the Parliamentary website, including down the following paths:

(I don’t think the API supports querying resources that specifically include attachments in general, or attachments of a particular filetype?)

What would be nice would be support for discovering some of these resources. A quick way in to this would be the ability to limit search query responses to webpages that link to a data file, on the grounds that the linking web page probably contains some of the keywords that you’re likely to be searching for data around?

Data Cleaning – Finding Near Matches in Names

In the post What Nationality Did You Say You Were, Again? I showed how we could use the fuzzyset python library to try to reconcile user supplied nationalities entered via a free text entry form to “preferred” nationalities listed in the FCO Register of Country Names.

Here’s another quick example of how to use fuzzyset to help clean a list of names, possibly repeated, that may include near misses or partial matches.

import pandas as pd
names=['S. Smith', 'John Smith','James Brown','John Brown','T. Smith','John Brown']
df=pd.DataFrame({'name':names})

# Set the thresh value (0..1) to tweak match strength
thresh=0.8

import fuzzyset
names=df['name'].tolist()

cleaner = fuzzyset.FuzzySet()
collisions=[]
for name in names:
    maybe=cleaner.get(name)
    # If there is a possible match, get a list of tuples back: (score, matchstring)
    # The following line finds the highest match score
    m=max(maybe,key=lambda item:item[0]) if maybe is not None else (-1,'')
    # If there is no match, or the max match score is below the threshold value,
    if not maybe or m[0] < thresh:         # assume that it's not a match and add name to list of "clean" names…         cleaner.add(name)     elif m[0] >= thresh:
        # But if there is a possible match, print a warning
        txt='assuming {} is a match with {} ({}) so not adding'.format(name,m[1],m[0])
        print(txt)
        # and add the name to a collisions list
        collisions.append((name,m))

#Now generate a simple report
print('------\n\n# Cleaning Report:\n\n## Match Set:\n{}\n\n------\n\n## Collisions:\n{}'.format(cleaner.exact_set, collisions))

The report looks something like this:

Sometimes, you may want to be alerted to exact matches; for example, if you are expecting the values to all be unique.

However, at other times, you may be happy to ignore duplicates, in which case you might consider dropping them from the names list. One way to do this is to convert the names list to a set, and back again, names=list(set(names)), although this changes the list order.

Alternatively, from the pandas dataframe column, just take unique values: names=df['name'].unique().tolist().

You may also want to know how many times duplicate (exact matches) occur. In such a case, we can list items that appear at list twice in the names list using the pandas dataframe value_counts() method:

#Get a count of the number of times each value occurs in a column, along with the value
vals=df['name'].value_counts()
#Select items where the value count is greater than one
vals[vals > 1]
#John Brown    2

PS Another way of detecting, and correcting, near matches is to use an application such as OpenRefine, in particular its clustering tool:

Talking to Developers and Civic Hackers on Their Own Terms…

Looking at the (new to me) Lords Amendments  website yesterday, I wondered whether the search was being fed by an API call, or whether an API is available elsewhere to the underlying data. (An API is available – find it via explore.data.parliament.uk.)

There are a couple of ways of doing this. One way is to View Source” (in Chrome, View -> Developer -> View Source), because as everybody *should know, you can inspect the code running in your browser; another is to use Developer tools, (from the same browser menu) to look at the browser network activity and see what URLs are called when a new selection is made on the web page (the data has to come from somewhere right? And again, you can look at this if you want to.)

Anyway, it struck me that most folk don’t tend to use these tools, but those who do are probably interested in something you’re doing – either how the page was constructed to give a particular effect, or where the data is coming from. If you’re building screenscraper’s, you’d typically look to the source too.

So if you’re trying to engage with developers, whey not leave them messages where they’re likely to look. For example, if you want to promote an API, or perhaps if you’re recruiting. Which reminded me that the Guardian used to have an open developer recruitment ad running in their webpage source. Indeed, they still do:

So if your page is API powered somewhere along the line, and you want to promote the API, why not pop a message at the top of the page source?

Or, as I learned from James Bridle (he of The New Aesthetic; you do follow that photoblog, right?), one of the most thought provoking artists around at the moment (I hesitate to say “digital artist” because that’s still an artist, right… erm…. (data) journalism… erm…  hypocrite…), why not use the console too?

James even provides a script to help…. welcome.js.

PS for a recent example of James’ work, which also invokes the idea of magic-related computing metaphors (cf. here, for example), see this recent interview: Meet the Artist Using Ritual Magic to Trap Self-Driving Cars.

PPS This has got me wondering whether we could actually deliver a “just below the surface” uncourse or training through HTML source, console messages and Javascript comments. Documented code with a view to teaching how to get the most out of an API, or how to do webdesign. The medium as the educational message. See also: Search Engine Powered Courses…

Tinkering With Parliament Data APIs: Commons Written Questions And Parliamentary Written Answers

So…. inspired by @philbgorman, I had a quick play last night with Parliament Written Questions data, putting together a recipe (output) for plotting a Sankey diagram showing the flow of questions from Members of the House of Commons by Party to various Answering Bodies for a particular parliamentary session.

The response that comes back from the Written Questions API includes a question uin (unique identification number?). If you faff around with date settings on the Parliamentary Questions web page you can search for a question by this ID:

Here’s an example of the response from a build download of questions (by 2015/16 session) from the Commons Written Questions API, deep filtered by the uin:

If you tweak the _about URI, which I think refers to details about the question, you get the following sort of response, built around a numeric identifier (447753 in this case):

There’s no statement of the actual answer text in that response, although there is a reference to an answer resource, again keyed by the same numeric key:

The numeric key from the _about identifier is also used with both the Commons Written Questions API and the Parliamentary Questions Answered API.

For example, questions:

And answers:

The uin values can’t be used with either of these APIs, though?

PS I know, I know, the idea is that we just follow resource links (but they’re broken, right? the leading lda. is missing from the http identifiers), but sometimes it’s just as easy to take a unique fragment of the URI (like the numeric key) and then just drop it into the appropriate context when you want it. In this case, contexts are

IMHO, any way… ;-)

PPS for a full list of APIs, see explore.data.parliament.uk