Category: Open Data

Bands Incorporated

A few weeks ago, as I was doodling with some Companies House director network mapping code and simple Companies House chatbot ideas, I tweeted an example of Iron Maiden’s company structure based on co-director relationships. Depending on the original search is seeded, the maps may also includes elements of band members’ own personal holdings/interests. The following map, for example, is seeded just from the Iron Maiden LLP company number:

iron_maiden

If you know anything about the band, you’ll know Bruce Dickinson’s aircraft interests make complete sense…

That graph is actually a bipartite graph – nodes are either directors or companies. We can easily generate a projection of the graph that replaces directors that link companies by edges that represent “common director” links between companies:

ireonmaiden2.png

(The edges are actually weighted, so the greater the edge weight, the more directors there are in common between the linked companies.)

In today’s Guardian, I notice they’re running a story about Radiohead’s company structure, with a parallel online piece, Radiohead’s corporate empire: inside the band’s dollars and cents which shows how to get a story out of such a map, as well as how to re-present the original raw map to provide to a bit more spatial semantic structure to it:

Radiohead_s_corporate_empire__inside_the_band_s_dollars_and_cents___Music___The_Guardian

(The story also digs into the financial reports from some of the companies.)

By way of comparison, here’s my raw map of Radiohead’s current company structure, generated from Companies House data seeded on the company number for Radiohead Trademark:

radiohead

It’s easy enough to grab the data for other bands. So how about someone like The Who? If we look in the immediate vicinity of The Who Group, we see core interests:

who1

But if we look for linkage to the next level of co-director links, we start to see other corporate groups that hold some at least one shared interest with the band members:

who2

So what other bands incorporated in the UK might be worth mapping?

Want to Get Started With Open Data? Looking for an Introductory Programming Course?

Want to learn to code but never got round to it? The next presentation of OUr FutureLearn course Learn to Code for Data Analysis will teach you how to write you own programme code, a line a time, to analyse real open data datasets. The next presentation starts on 6 June, 2016, and runs for 4 weeks, and takes about 5 hrs per week.

I’ve often thought that there are several obstacles to getting started with programming. Firstly, there’s the rationale or context: why bother/what could I possibly use programming for? Secondly, there are the practical difficulties: to write and execute programmes, you need to get an programming environment set up. Thirdly, there’s the so what: “okay, so I can programme now, but how do I use this in the real world?”

Many introductory programming courses reuse educational methods and motivational techniques or contexts developed to teach children (and often very young children) the basics of computer programming to set the scene: programming a “turtle” that can drive around the screen, for example, or garishly coloured visual programming environments that let you plug logical blocks together as if they were computational Lego. Great fun, and one way of demonstrating some of the programming principles common to all programming languages, but they don’t necessarily set you up for seeing how such techniques might be directly relevant to an IT problem or issue you face in your daily life. And it can be hard to see how you might use such environments or techniques at work to help you get perform real tasks… (Because programmes can actually be good at that – automating the repetitive and working through large amounts of stuff on your behalf.) At the other extreme are professional programming environments, like geekily bloated versions of Microsoft Word or Excel, with confusing preference setups and menus and settings all over the place. And designed by hardcore programmers for hardcore programmers.

So the approach we’ve taken in the OU FutureLearn course Learn to Code for Data Analysis is slightly different to that.

The course uses a notebook style programming environment that blends text, programme code, and the outputs of running that code (such as charts and tables) in a single, editable web page accessed via your web browser.

Learn_to_Code_-_SageMathCloud

To motivate your learning, we use real world, openly licensed data sets from organisations such as the World Bank and the United Nations – data you can download and access for yourself – that you can analyse and chart using your own programme code. A line at a time. Because each line does it’s own thing, each line is useful, and you can see what each line does to your dataset directly.

So that’s the rationale: learn to code so you can work with data (and that includes datasets much larger than you can load into Excel…)

The practicalities of setting up the notebook environment still have to be negotiated, of course. But we try to help you there too. If you want to download and install the programming environment on your computer, you can do, in the form of the freely available Anaconda Scientific Computing Python Distribution. Or you can access an online versions of the notebook based programming environment via SageMathCloud and do all your programming online, through your browser.

So that’s the practical issues hopefully sorted.

But what about the “so what”? Well, the language you’ll be learning is Python, a widely used language programming language that makes it ridiculously easy to do powerful things.

Pyython cartoon - via https://xkcd.com/353/

But not that easy, perhaps..?!

The environment you’ll be using – Jupyter notebooks – is also a “real world” technology, inspired as an open source platform for scientific computing but increasingly being used by journalists (data journalism, anyone?) and educators. It’s also attracted the attention of business, with companies such as IBM supporting the development of a range of interactive dashboard tools and backend service hooks that allow programmes written using the notebooks to be deployed as standalone online interactive dashboards.

The course won’t take you quite that far, but it will get you started, and safe in the knowledge that whatever you learn, as well as the environment you’re learning in, can be used directly to support your own data analysis activities at work, or at home as a civically minded open data armchair analyst.

So what are you waiting for? Sign up now and I’ll see you in the comments:-)

Trawling the Companies House API to Generate Co-Director Networks

Somewhen ago (it’s always somewhen ago; most of the world never seems to catch up with what’s already happened!:-( I started dabbling with the OpenCorporates API to generate co-director corporate maps that showed companies linked by multiple directors. It must have been a bad idea because no-one could see any point in it, not even interestingness…  (Which suggests to me that boards made up of directors are similarly meaningless? In which case, how are companies supposed to hold themselves to account?)

I tend to disagree. If I hadn’t been looking at connected companies around food processing firms, I would never have learned that one that meat processors cope with animal fat waste is to feed it into the biodiesel raw material supply chain.

Anyway, if we ever get to see a beneficial ownership register, a similar approach should work to generate maps showing how companies sharing beneficial owners are linked. (The same approach also drives my emergent social positioning Twitter maps and the Wikipedia semantic maps I posted about again recently.)

As a possible precursor to that, I thought I’d try to reimplement the code (in part to see if a better approach came to mind) using data grabbed directly from Companies House via their API. I’d already started dabbling with the API (Chat Sketches with the Companies House API) so it didn’t take much more to get a grapher going…

But first, I realise in that earlier post I’d missed the function for actually calling the API – so here it is:

import urllib2, base64, json
from urllib import urlencode
from time import sleep

def url_nice_req(url,t=300):
    try:
        return urllib2.urlopen(url)
    except HTTPError, e:
        if e.code == 429:
            print("Overloaded API, resting for a bit...")
            time.sleep(t)
            return url_req(url)

#Inspired by http://stackoverflow.com/a/2955687/454773
def ch_request(CH_API_TOKEN,url,args=None):
    if args is not None:
        url='{}?{}'.format(url,urlencode(args))
    request = urllib2.Request(url)
    # You need the replace to handle encodestring adding a trailing newline 
    # (https://docs.python.org/2/library/base64.html#base64.encodestring)
    base64string = base64.encodestring('%s:' % (CH_API_TOKEN)).replace('\n', '')
    request.add_header("Authorization", "Basic %s" % base64string)   
    result = url_nice_req(request)

    return json.loads(result.read())

CH_API_TOKEN='YOUR_API_TOKEN_FROM_COMPANIES_HOUSE'

In the original implementation, I stored the incremental search results in a dict; in the reimplementation, I thought I’d make use of a small SQLite database.

import sqlite3
db=None
memDB=":memory:"
tmpDB='example.db'
if db in locals():
    db.close()
    
db = sqlite3.connect(tmpDB)
c = db.cursor()

for drop in ['directorslite','companieslite','codirs','coredirs','singlecos']:
    c.execute('''drop table if exists {}'''.format(drop))
              
c.execute('''create table directorslite
         (dirnum text primary key,
          dirdob integer,
          dirname text)''')

c.execute('''create table companieslite
         (conum text primary key,
          costatus text,
          coname text)''')

c.execute('''create table codirs
         (conum text,
          dirnum text,
          typ text,
          status text)''')

c.execute('''create table coredirs
         (dirnum text)''')

c.execute('''create table singlecos
         (conum text,
          coname text)''')

cosdone=[]
cosparsed=[]
dirsdone=[]
dirsparsed=[]
codirsdone=[]

The code itself runs in two passes. The first pass builds up a seed set of directors from a single company or set of companies using a simple harvester:

def updateOnCo(seed,typ='current',role='director'):
    print('harvesting {}'.format(seed))
    
    #apiNice()
    o=ch_getCompanyOfficers(seed,typ=typ,role=role)['items']
    x=[{'dirnum':p['links']['officer']['appointments'].strip('/').split('/')[1],
          'dirdob':p['date_of_birth']['year'] if 'date_of_birth' in p else None,
          'dirname':p['name']} for p in o]
    z=[]
    for y in x:
        if y['dirnum'] not in dirsdone:
            z.append(y)
            dirsdone.append(y['dirnum'])
        if isinstance(z, dict): z=[z]
    print('Adding {} directors'.format(len(z)))
    c.executemany('INSERT INTO directorslite (dirnum, dirdob,dirname)'
                     'VALUES (:dirnum,:dirdob,:dirname)', z)
    for oo in [i for i in o if i['links']['officer']['appointments'].strip('/').split('/')[1] not in dirsparsed]:
        oid=oo['links']['officer']['appointments'].strip('/').split('/')[1]
        print('New director: {}'.format(oid))
        #apiNice()
        ooo=ch_getAppointments(oid,typ=typ,role=role)
        #apiNice()
        #Play nice with the api
        sleep(0.5)
        #add company details
        x=[{'conum':p['appointed_to']['company_number'],
          'costatus':p['appointed_to']['company_status'] if 'company_status' in p['appointed_to'] else '',
          'coname':p['appointed_to']['company_name'] if 'company_name' in p['appointed_to'] else ''} for p in ooo['items']]
        z=[]
        for y in x:
            if y['conum'] not in cosdone:
                z.append(y)
                cosdone.append(y['conum'])
        if isinstance(z, dict): z=[z]
        print('Adding {} companies'.format(len(z)))
        c.executemany('INSERT INTO companieslite (conum, costatus,coname)'
                     'VALUES (:conum,:costatus,:coname)', z)
        for i in x:cosdone.append(i['conum'])
        #add company director links
        dirnum=ooo['links']['self'].strip('/').split('/')[1]
        x=[{'conum':p['appointed_to']['company_number'],'dirnum':dirnum,
            'typ':'current','status':'director'} for p in ooo['items']]
        c.executemany('INSERT INTO codirs (conum, dirnum,typ,status)'
                     'VALUES (:conum,:dirnum,:typ,:status)', x)
        print('Adding {} company-directorships'.format(len(x)))
        dirsparsed.append(oid)
    cosparsed.append(seed)

The set of seed companies may be companies associated with one or more specified seed directors, for example:

def dirCoSeeds(dirseeds,typ='all',role='all'):
    ''' Find companies associated with dirseeds '''
    coseeds=[]
    for d in dirseeds:
        for c in ch_getAppointments(d,typ=typ,role=role)['items']:
            coseeds.append(c['appointed_to']['company_number'])
    return coseeds

dirseeds=[]
for d in ch_searchOfficers('Bernard Ecclestone',n=10,exact='forename')['items']:
    dirseeds.append(d['links']['self'])
    
coseeds=dirCoSeeds(dirseeds,typ='current',role='director')

Then I call a first pass of the co-directed companies search with the set of company seeds:

typ='current'
#Need to handle director or LLP Designated Member
role='all'
for seed in coseeds:
    updateOnCo(seed,typ=typ,role=role)
c.executemany('INSERT INTO coredirs (dirnum) VALUES (?)', [[d] for d in dirsparsed])

seeder_roles=['Finance Director']
#for dirs in seeded_cos, if dir_role is in seeder_roles then do a second seeding based on their companies
#TO DO

depth=0

Then we go for a crawl for as many steps as required… The approach I’ve taken here is to search through the current database to find the companies heuristically defined as codirected, and then feed these back into the harvester.

seeder=True
oneDirSeed=True
#typ='current'
#role='director'
maxdepth=3
#relaxed=0
while depth<maxdepth:
    print('---------------\nFilling out level - {}...'.format(depth))
    if seeder and depth==0:
        #Another policy would be dive on all companies associated w/ dirs of seed
        #In which case set the above test to depth==0
        tofetch=[u[0] for u in c.execute(''' SELECT DISTINCT conum from codirs''')]
    else:
        duals=c.execute('''SELECT cd1.conum as c1,cd2.conum as c2, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 
                        ON cd1.dirnum = cd2.dirnum AND cd1.dirnum
                        WHERE cd1.conum < cd2.conum GROUP BY c1,c2 HAVING COUNT(*)>1
                        ''')
        tofetch=[x for t in duals for x in t[:2]]
        #The above has some issues. eg only 1 director is required, and secretary IDs are unique to company
        #Maybe need to change logic so if two directors OR company just has one director?
        #if relaxed>0:
        #    print('Being relaxed {} at depth {}...'.format(relaxed,depth))
        #    duals=c.execute('''SELECT cd.conum as c1,cl.coname as cn, count(*) FROM codirs as cd JOIN companieslite as cl 
        #                 WHERE cd.conum= cl.conum GROUP BY c1,cn HAVING COUNT(*)=1
        #                ''')
        #    tofetch=tofetch+[x[0] for x in duals]
        #    relaxed=relaxed-1
    if depth==0 and oneDirSeed:
        #add in companies with a single director first time round
        sco=[]
        for u in c.execute('''SELECT DISTINCT cd.conum, cl.coname FROM codirs cd  JOIN companieslite cl ON
                                cd.conum=cl.conum'''):
            #apiNice()
            o=ch_getCompanyOfficers(u[0],typ=typ,role=role)
            if len(o['items'])==1 or u[0]in coseeds:
                sco.append({'conum':u[0],'coname':u[1]})
                tofetch.append(u[0])
        c.executemany('INSERT INTO singlecos (conum,coname) VALUES (:conum,:coname)', sco)
    #TO DO: Another stategy might to to try to find the Finance Director or other names role and seed from them?
    
    #Get undone companies
    print('To fetch: ',[u for u in tofetch if u not in cosparsed])
    for u in [x for x in tofetch if x not in cosparsed]:
            updateOnCo(u,typ=typ,role=role)
            cosparsed.append(u)
            #play nice
            #apiNice()
    depth=depth+1
    #Parse companies

To visualise the data, I opted for Gephi, which meant having to export the data. I started off with a simple CSV edgelist exporter:

data=c.execute('''SELECT cl1.coname as Source,cl2.coname as Target, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN companieslite as cl1 JOIN companieslite as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum 1''')
import csv
with open('output1.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerow(['Source', 'Target'])
    writer.writerows(data)
    
data= c.execute('''SELECT cl1.coname as c1,cl2.coname as c2 FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN singlecos as cl1 JOIN singlecos as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum &lt; cd2.conum''')
with open('output1.csv', 'ab') as f:
    writer = csv.writer(f)
    writer.writerows(data)

but soon changed that to a proper graph file export, based on a graph built around the codirected companies using the networkx package:

import networkx as nx

G=nx.Graph()

data=c.execute('''SELECT cl.conum as cid, cl.coname as cn, dl.dirnum as did, dl.dirname as dn
FROM codirs AS cd JOIN companieslite as cl JOIN directorslite as dl ON cd.dirnum = dl.dirnum and cd.conum=cl.conum ''')
for d in data:
    G.add_node(d[0], Label=d[1])
    G.add_node(d[2], Label=d[3])
    G.add_edge(d[0],d[2])
nx.write_gexf(G, "test.gexf")

I then load the graph file into Gephi to visualise the data.

Here’s an example of the sort of thing we can get out for a search seeded on companies associated with the Bernie Ecclestone who directs at least one F1 related company:

Gephi_0_9_1_-_Project_2

On the to do list is to automate this a little bit more by adding some netwrok statistics, and possibly a first pass layout, in the networkx step.

In terms of time required to collect the data, the ,a href=”https://developer.companieshouse.gov.uk/api/docs/index/gettingStarted/rateLimiting.html”>Companies House API is rate limited to allow 600 requests within a five minute period. Many company networks can be mapped within the 600 call limit, but even for larger networks, the trawl doesn’t take too long even if two or three rest periods are required.

Chat Sketches with the Companies House API, Before the F***kWit UKGov Sell It Off

Ranty title a gut reaction response to news that the Land Registry faces privatisation.

Sketching around similar ideas to my Slack/slash conversational autoresponder around the Parliament data platform API, I thought I’d have a quick play with the UK Companies House API, which provides a simple interface to company registration data, director information and disqualified director information.

Bulk downloads are available for company registration information (here’s a quick howto about working with it; I’ll post a howto showing how to work with it using a containerised database when I get a chance, but for now, here are some clues) and from the API developer forums it looks as if bulk director’s information is available by request.

Working with your own bulk copy of the data is preferable, because it means you can write your own arbitrarily complex queries over any or all of the columns. The Companies House API, on the other hand, gives you a weak search over company and directors names, and the ability to look up individual known records. To do any sort of processing, you need to grab a large amount of search data, and/or make lots of individual known item records to build you own local datastore, and then search or filter across that.

So for example, here’s the first fumblings of my own function for filtering down on a list of officers associated with a particular company based on director role or current status (which I called typ for some reason? Bah:-(:

def ch_getCompanyOfficers(cn,typ='all',role='all'):
    #typ: all, current, previous
    url=&quot;https://api.companieshouse.gov.uk/company/{}/officers&quot;.format(cn)
    co=ch_request(CH_API_TOKEN,url)
    if typ=='current':
        co['items']=[i for i in co['items'] if 'resigned_on' not in i]
        #should possibly check here that len(co['items'])==co['active_count'] ?
    elif typ=='previous':
        co['items']=[i for i in co['items'] if 'resigned_on' in i]
    if role!='all':
        co['items']=[i for i in co['items'] if role==i['officer_role']]
    return co

The next function runs a search over officers by name, but then also lets you filter down the responses to show just those directors who also match a particular search string as part of any company name they are associated with.

def ch_searchOfficers(q,n=50,start_index='',company=''):
    url= 'https://api.companieshouse.gov.uk/search/officers'
    properties={'q':q,'items_per_page':n,'start_index':start_index}
    o=ch_request(CH_API_TOKEN,url,properties)
    if company != '':
        for p in o['items']:
            p['items'] = [i for i in ch_getAppointments(p['links']['self'])['items'] if company.lower() in i['appointed_to']['company_name'].lower()]
        o['items'] = [i for i in o['items'] if len(i['items'])]
    return o

You get the gist, hopefully. Run a crude API call, and then filter down the result according to particular data properties contained within the search result.

Anyway, as far as the chatting goes, here’s what I’ve started playing around with…

First, let’s just ask what companies a director with a particular name is associated with.

Companies_House_API_Bot1

We can take this a bit further by filtering down on the directors associated with a particular company. (Actually, this is simplified now to call the reporting function simply as dirCompanies(c)).

Companies_House_API_Bot

Alternatively, we might try to narrow the search for directors associated with companies in a particular locality. (I’m still trying to get my head round the different logics of this, because companies as well as directors are associated with addresses. I really need to try some specific investigative tasks to get a better feel for how to tune this sort of filter…)

Companies_House_API_Bot2

I’ve also started trying to think around the currency of appointments, for example supporting the ability to filter down based on resigned appointments:

Companies_House_API_Bot3

Associated with this sort of query (in the sense of exploring the past) are filters that let us search around dissolved companies, for example:

Companies_House_API_Bot4

(I should probably also put some time filters in there, for example to search for companies that a particular person was a director of at a particular time…)

We can also search into the disqualified directors register. To try to reduce the sense of turning this into a fishing trip, searching by director name and then filtering by locality feels like it could be handy (though again, this needs some refinement on the way I apply the locality filter.)

Companies_House_API_Bot5

Next step on this particular task is to tidy these up a little bit more and then pop them into a Slack responder.

But there are also some other Companies House goodies to come…such as revisiting the notion of co-director based company maps.

 

Calling an OData Service From Python – UK Parliament Members Data Platform

Whilst having a quick play producing Slack bots and slash commands around the UK Parliament APIs, I noticed (again) that the Members data platform has an OData endpoint.

OData is a data protocol for querying online data services via HTTP requests although it never really seemed to have caught the popular imagination, possibly because Microsoft thought it up, possibly because it seems really fiddly to use…

I had a quick look around for Python client/handler for it, and the closest I came was the pyslet package. I’ve posted a notebook showing my investigations to date here: Handling the UK Parliament Members Data Platform OData Feed, but it seems really clunky and I’m not sure I’ve got it right! (There doesn’t seem to be a lot of tutorial support out there, either?)

Here’s an example of the sort of mess I got myself in:

UK_Parliament_api_test_and_members_data_platform_OData_service_test

To make the Parliament OData service more useful needs a higher level Python wrapper, I think, that abstracts a bit further and provides some function calls that make it a tad easier (and natural) to get at the data. Or maybe I need to step back, have a read of the OData blocks, properly get my head around the pyslet OData calls, and try again!

Chatting With ONS Data Via a Simple Slack Bot

A recent post on the ONS Digital blog – Dueling with datasets – describes some of the design decisions taken when putting together the new Office for National Statistics website (such as having a single page for a particular measure that would provide the current figures at the top as well as historical figures further down the page) and some of the challenges still facing the team (such as the language and titling used to describe the statistics).

The emphasis is still very much on publishing the data via a website, however, which promotes two particular sorts of interaction style: browse and search. Via Laura Dewis (Deputy Director, Digital Publishing at Office for National Statistics, and ex- of the OpenLearn parish), I got a peek at some of the popular search terms used on the pre-updated website, which suggest (to me) a mix of vernacular keyword search terms as well as official terms (for example, rpi, baby names, cpi, gdp, retail price index, population, Labour Market Statistics unemployment, inflation, labour force survey).

Over the last couple of years, regular readers will have noticed that I’ve been dabbling with some simple data2text conversions, as well as dipping my toes into some simple custom slackbots (that is, custom slack robots…) capable of responding to simple parameterised queries with texts automatically generated from online data sources (for example, querying the Nomis JSA figures as part of a Slackbot Data Wire, Initial Sketch or my First Steps in a Conversational Slackbot interface to CQC Inspection Data ).

I’m still fumbling around how best to try to put these bots together. On the one hand is trying to work out what sorts of thing we might want to ask of the data, as well as how we might actually ask for it in natural language terms. On the other, is generating queries over the data, and figuring out how to provide the response (creating a canned text around the results from a data query).

But what if there was already a ready source of text interpreting particular datasets that could be used as the response part of a conversational data agent? Then all we’d have to focus on would be parsing queries and matching them to the texts?

A couple of weeks ago, when the new ONS website came out of beta, the human facing web pages were complemented with a data view in the form of JSON feeds that mirrored the HTML text (I don’t know if the HTML is actually generated from the JSON feeds?), as described in More Observations on the ONS JSON Feeds – Returning Bulletin Text as Data. So here we have a ready source of data interpreting text that we may be able to use to provide a backend to a conversational UI to the ONS content. (Whether or not the text is human generated or machine generated is irrelevant – though it does also provide a useful model for developing and testing my own data to text routines!)

So let’s see… it being to wet to go and dig the vegetable patch yesterday, I thought I’d have a quick play trying to put together some simple response rules, in part building on some of the ONS JSON parsing code I started putting together following the ONS website refresh.

Here’s a snapshot of where I’m at…

Firstly, asking for a summary of some popular recent figures:

dtest___OUseful_Slack_1

The latest figures are assumed for some common keyword driven queries. We can also ask for a chart:

dtest___OUseful_Slack_2

The ONS publish different sorts of product that can be filtered against:

rate_-_Search_-_Office_for_National_Statistics

So for example, we can run a search to find what bulletins are available on a particular topic:

dtest___OUseful_Slack_3

(For some reason, the markdown isn’t being interpreted as such?)

We can then go on to ask about a particular bulletin, and get the highlights from it:

dtest___OUseful_Slack_4

(I did wonder about numbering the items in the list, retaining the state of the previous response in the bot, and then allowing an interaction along the lines of “tell me more about item 3”?)

We can also ask about other publication types, but I haven’t checked the JSON yet to see whether it makes sense to handle the response from those slightly differently:

dtest___OUseful_Slack_5

At the moment, it’s all a bit Wizard of Oz, but it’s amazing how fluid you can be in writing queries that are matched by some very simple regular expressions:

dtest___OUseful_Slack_woz

So not bad for an hour or two’s play… Next steps would require getting a better idea about what sorts of conversation folk might want to have with the data, and what they actually expect to see in return. For example, it would be possible to mix in links to datafiles, or perhaps even upload datafiles to the slack channel?

PS Hmm, thinks.. what would a slack interface to a Jupyter server be like…?

More Observations on the ONS JSON Feeds – Returning Bulletin Text as Data

Whilst starting to sketch out some python functions for grabbing the JSON data feeds from the new ONS website, I also started wondering how I might be able to make use of them in a simple slackbot that could provide a crude conversational interface to some of the ONS stats.

(To this end, it would also be handy to see some ONS search logs to see what sort of things folk search – and how they phrase their searches…)

One of the ways of using the data is as the basis for some simple data2text scripts, that can report the outcomes of some simple canned analyses of the data (comparing the latest figures with those from the previous month, or a year ago, for example). But the ONS also produce commentary on various statistics for via their statistical bulletins – and it seems that these, too, are available in JSON form simply by adding /data to the end of the IRL path as before:

UK_Labour_Market_-_Office_for_National_Statistics

One thing to note is that whist the HTML view of bulletins can include a name element to focus the page on a particular element:

http://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/bulletins/uklabourmarket/february2016/#comparison-between-unemployment-and-the-claimant-count

the name attribute switch doesn’t work to filter the JSON output to that element (though it would be easy enough to script a JSON handler to return that focus) so there’s no point adding it to the JSON feed URL:

http://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/bulletins/uklabourmarket/february2016/data

One other thing to note about the JSON feed is that it contains cross-linked elements for items such as charts and tables. If you look closely at the above screenshot, you’ll see it contains a reference to an ons-table.

...
sections: [
...
{
title: "Summary of latest labour market statistics",
markdown: "Table A shows the latest estimates, for October to December 2015, for employment, unemployment and economic inactivity. It shows how these estimates compare with the previous quarter (July to September 2015) and the previous year (October to December 2014). Comparing October to December 2015 with July to September 2015 provides the most robust short-term comparison. Making comparisons with earlier data at Section (ii) has more information. <ons-table path="cea716cc" /> Figure A shows a more detailed breakdown of the labour market for October to December 2015. <ons-image path="718d6bbc" />"
},
...
]
...

This resource is then described in detail elsewhere in the data feed linked by the same ID value:

www_ons_gov_uk_employmentandlabourmarket_peopleinwork_employmentandemployeetypes_bulletins_uklabourmarket_february2016_data_comparison-between-unemployment-and-the-claimant-count

...
tables: [
{
title: "Table A: Summary of UK labour market statistics for October to December 2015, seasonally adjusted",
filename: "cea716cc",
uri: "/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/bulletins/uklabourmarket/february2016/cea716cc"
}
],
...

Images are identified via the ons-image tag, charts via the ons-chart tag, and so on.

So now I’m thinking – maybe this is the place to start thinking about a simple conversational UI? Something that can handle simple references into different parts of a bulletin, and return the ONS text as the response?