OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘scraperwiki

To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info

In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)

From the Scraperwiki database, we need to interrogate the API:

scraperwiki api

To do this, we’ll use a database query language – SQL.

What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:

select * from support where support in (select candinit from support)

Note that “support” refers to two things here – these are columns:

select * from support where support in (select candinit from support)

and these are the table the columns are being pulled from:

select * from support where support in (select candinit from support)

Here’s the result of Runing the query:

sql select on scraperwiki

We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).

candidates mutual table

There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:

Tidier intra-candidate support map

Simples:-)

Written by Tony Hirst

May 8, 2013 at 10:50 am

Boundary Files for Electoral Wards Covered by a Particular Geography

A week or so ago, I went looking for boundary lines for UK electoral wards, with half a mind towards trying to do something with them related to this week’s local council elections. One source I came across was the UK 2011 Census, (2011 Census geography products for England and Wales) which publishes the appropriate data from across England and Wales in large single shapefiles. For the amateur cartographer (which I very much am!) wanting to work at a local level, this presents something of a challenge: not only do I have to find a way of downloading and opening the large dataset, I also need to find a way of extracting from it the data for my local area, which is what I actually want. (Getting the local data from the national dataset can be blocking, in other words.)

Another source of data is MySociety’s MapIt service, which provides data about various geographies, including electoral wards, covered by other geographical areas:

mapit areas covered

as well as boundary files for each geography in a variety of formats:

Mapit Geometry

Unfortunately, the MapIt API doesn’t (yet?!) support the ability to generate a single file that contains boundary data for all the electoral wards in a single council or local authority area. So here’s a quick hack, posted as a view on Scraperwiki, that generates a single KML file for the electoral wards contained by a council area (including the wider boundary of that council area) – KML Merge Test. (There are probably better ways of doing this?! I would if I should try to make sense of the MapIt code to see if I can work out how to submit a proper patch…)

import scraperwiki,simplejson,urllib2
from lxml import etree
from copy import deepcopy


#--via @mhawksey
# query string crib https://views.scraperwiki.com/run/python_querystring_cheat_sheet/?
import cgi, os
qstring=os.getenv("QUERY_STRING")

key='65791' #Use the Isle of Wight as a default
typ='UTE'
#typs are:
#CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency)

if qstring!=None:
    get = dict(cgi.parse_qsl(qstring))
    if 'key' in get: key=get['key']
    if 'typ' in get: typ=get['typ'] 
#---


#Get a stub KML file for the local council level
url='http://mapit.mysociety.org/area/'+str(key)+'.kml'
xmlraw = urllib2.urlopen(url).read()
xml=etree.fromstring(xmlraw)

#Get the list of electoral wards covered by that council area
wards=simplejson.load(urllib2.urlopen('http://mapit.mysociety.org/area/'+str(key)+'/covers?type='+typ))

#Get the KML for each ward, extract the Placemark data, and add it to our comprehensive KML tree
for ward in wards:
    url='http://mapit.mysociety.org/area/'+ward+'.kml'
    xmlraw = scraperwiki.scrape(url)
    xml2=etree.fromstring(xmlraw)
    p= xml2.xpath('//geo:Placemark',namespaces={'geo':'http://www.opengis.net/kml/2.2'})
    xml.append( deepcopy(p[0] ))

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")
print etree.tostring(xml)

The key value is the ID of the council area within which you want to find the electoral wards. So for example the Isle of Wight parliamentary constituency page -http://mapit.mysociety.org/area/65791.html – gives us the ID 65791, which we pass as an argument to the Scraperwiki view. The subdivision we want to grab data for is given by the typ parameter:

CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency)

So for example, here’s a link to an aggregate KML file for Unitary authority electoral divisions (UTE) on the Isle of Wight – https://views.scraperwiki.com/run/kml_merge_test/?key=65791; and here’s one for Unitary authority wards (UTW) in Milton Keynes: https://views.scraperwiki.com/run/kml_merge_test/?key=2588&typ=UTW

If you save the resulting file as a .kml file (for example, as kml_merge_test_mk.kml) you can then load it into something like Google Fusion tables to view it:

GOogle fusion table map

Note that the MapIt API is rate limited (I think), so be gentle ;-)

Written by Tony Hirst

May 2, 2013 at 9:02 am

By Me, on the Scraperwiki Blog: Glue Logic and Flowable Data

Regular readers will know I quite often make use of Scraperwiki for grabbing datasets and hosting views over scraped scraped data. A few days ago, I contributed a guest post to the Scraperwiki blog:

As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool: joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.

Read the whole thing here: Glue Logic and Flowable Data.

PS I hope to write more about “flowable data”, feeds, and feed enrichment in a later post here on OUseful.info.

Written by Tony Hirst

March 6, 2013 at 1:13 pm

Posted in elsewhere, Tinkering

Tagged with

Organisations Providing Benefits to All-Party Parliamentary Groups, Part 1

Via a tweet from the author, I came across Rob Fenwick’s post on APPGs – the next Westminster scandal? (APPG = All Party Parliamentary Groups):

APPGs are entitled to a Secretariat. Set aside any images you have of a sensibly dressed person of a certain age mildly taking dictation, the provision of an APPG Secretariat is one of the main routes used by public affairs agencies, charities, and businesses to cosey up to MPs and Peers. These “secretaries” often came up with the idea of setting up the group in the first place, to advance the interests of a client or cause.

The post describes some of the organisations that provide secretariat services to APPGs, and in a couple of cases also takes the next step: “Take the APPG on the Aluminium Industry, the secretarial services of which are provided by Aluminium Federation Ltd which is “a not-for-profit organisation.” That sounds suitably reassuring – if the organisation is not-for-profit what chance can there be of big business buying favoured access? It’s only when you look at the Federation’s website, and examine each of its nine sub-associations in turn, that it becomes clear that this not-for-profit organisation is a membership umbrella for private business. This is above board, within the rules, published, and transparent. Transparent if you’re prepared to invest the time to look, of course.”

It’s worth reading the post in full… Go on…

… Done that? Go on.

Right. Here’s the list of registered All Party Groups. Here’s an example of what’s recorded:

APG form

Conveniently, there’s a Scraperwiki scraper (David Jones / All Party Groups) grabbing this information, so I though I’d have a play with it.

Looking at the benefits, there is a fair bit of convention in the way benefits are described. For example, we see recurring things of the form:

  • 5000 from CAFOD, 6000 from Christian Aid – that is, [AMOUNT] from [ORGANISATION]
  • Age UK (a charity) acts as the groups secretariat. – that is, [ORGANISATION] {(OPTIONAL_TYPE)} acts as the groups secretariat.

We could parse these things out directly (on the to do list!) but as a short cut, I thought I’d try a couple of content analysis/entity extraction services to see if they could pull out the names of companies and charities from the benefits list. You can find the scraper I used to enhance David Jones’ APPG scraper here: APG Enhancer.

Here are a couple of sample reports from my scraper:

This gives a first pass attempt at extracting organisation (company and charity) names from the APPG register, and in passing provides a partial directory for looking up companies by APG (partial because the entity extractors aren’t perfect and don’t manage to identify every company, charity, association or other recognised group.

A more thorough way to look up particular companies is to do a site’n’path limited web search: eg
aviva site:http://www.publications.parliament.uk/pa/cm/cmallparty/register/

How might we go further, though? One way would be to look up companies on OpenCorporates, and pull down a list of directors:

opencorposrates - look up directors

And then we can start to walk through the database of directors, looking for other companies that appear to have the same director:

opencorporates - director lookup

(Note: we need to watch out for false positives, whereby one director has the same name as another person who is also a company director. There may be false negatives too, where we don’t find a directorship held by a specific person because a slightly different variation of their name was used on a registration document.)

We can also look up charities on OpenCharities to find charity trustees:

OpenCharities charity lookup

If we’re graph walking, we might then look up the trustees on OpenCorporates to see whether or not the trustees are directors of any companies with possible interests in the area, and as a result identify companies who may be trying to influence Parliamentarians through APPGs that benefit from the direct support of a charity, via that charity.

In this way, we can start to build out a wider direct interest graph around a Parliamentary group. I’m not sure how useful or even how meaningful any of this actually is, but it’s increasingly possible, and once the scripted patterns are there, increasingly easy to deploy in other contexts (for example, wherever there is a list of company names, charity names, or names of people who may be directors. I guess a trustee search on OpenCharities may also be available at some point? From a graph linking point of view, I also wonder if any charities share registered addresses with companies, etc…)

PS by the by, here’s a guest post I just wrote on the OpenCorporates blog: Data Sketching With the OpenCorporates API.

Written by Tony Hirst

December 17, 2012 at 2:48 pm

When Machine Readable Data Still Causes “Issues” – Wrangling Dates…

With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I’ll give asn example of a dataset that is, arguably, released in a machine readable way – as an Excel spreadsheet, but that still requires quite a bit of work to become useful as data; because presumably the intent behind the aforementioned amendement to the FOI is to make data releases useful and useable as data? As a secondary result, through trying to make the data useful as data, I realise I have no idea what some of the numbers that are reported in the context of a date range actually relate to… Which makes those data columns misleading at best, useless at worst…And as to the February data in a release allegedly relating to a weekly release from November…? Sigh…

[Note - I'm not meaning to be critical in the sense of "this data is too broken to be useful so don't publish it". My aim in documenting this is to show some of the difficulties involved with actually working with open data sets and at least flag up some of the things that might need addressing so that the process can be improved and more "accessible" open data releases published in the future. ]

So what, and where is, the data…? Via my Twitter feed over the weekend, I saw an exchange between @paulbradshaw and @carlplant relating to a scraper built around the NHS Winter pressures daily situation reports 2012 – 13. This seems like a handy dataset for anyone wanting to report on weekly trends, spot hospitals that appear to be under stress, and so on, so I had a look at the scraper, took issue with it ;-) and spawned my own…

The data look like it’ll be released in a set of weekly Excel spreadsheets, with a separate sheet for each data report.

All well and good… almost…

If we load the data into something like Scraperwiki, we find that some of the dates are actually represented as such; that is, rather than character strings (such as the literal “9-Nov-2012″), they are represented as date types (in this case, the number of days since a baseline starting date). A quick check on StackOverflow turned up the following recipe for handling just such a thing and returning a date element that Python (my language of choice on Scraperwiki) recognises as such:

#http://stackoverflow.com/a/1112664/454773
import datetime

def minimalist_xldate_as_datetime(xldate, datemode):
    # datemode: 0 for 1900-based, 1 for 1904-based
    return (
        datetime.datetime(1899, 12, 30)
        + datetime.timedelta(days=xldate + 1462 * datemode)
        )

The next thing we notice is that some of the date column headings actually specify: 1) date ranges, 2) in a variety of styles across the different sheets. For example:

  • 16 – 18/11/2012
  • 16 Nov 12 to 18-NOV-2012
  • 16 to 18-Nov-12

In addition, we see that some of the sheets split the data into what we might term further “subtables” as you should notice if you compare the following sheet with the previous one shown above:

Notwithstanding that the “shape” of the data table is far from ideal when it comes to aggregating data from several weeks in the same database (as I’ll describe in another post), we are faced with a problem here that if we want to look at the data by date range in a mechanical, programmable way, we need to cast these differently represented date formats in the same way, ideally as a date structure that Python or the Scraperwiki SQLlite database can recognise as such.

[For a library that can automatically reshape this sort of hierarchical tabular data arrangement in R, see Automatic Conversion of Tables to LongForm Dataframes]

The approach I took was as follows (it could be interesting to try to replicate this approach using OpenRefine?). Firstly, I took the decision to map dates onto “fromDates” and “toDates”. ***BEWARE – I DON’T KNOW IF THIS IS CORRECT THING TO DO**** Where there is a single specified date in a column heading, the fromDate and toDate are set to one and the same value. In cases where the date value was specified as an Excel represented date (the typical case), the code snippet above casts it to a Pythonic date value then I can then print out as required (I opted to display dates in the YYYY-MM-DD format) using a construction along the lines of:

dateString=minimalist_xldate_as_datetime(cellValue,book.datemode).date().strftime("%Y-%m-%d")

In this case, cellValue is the value of a header cell that is represented as an Excel time element, book is the workbook, as parsed using the xlrd library:

import xlrd
xlbin = scraperwiki.scrape(spreadsheetURL)
book = xlrd.open_workbook(file_contents=xlbin)

and book.datemode is a library call that looks up how dates are being represented in the spreadsheet. If the conversion fails, we default to setting dateString to the original value:
dateString=cellvalue

The next step was to look at the date range cells, and cast any “literal” date strings into a recognised date format. (I’ve just realised I should have optimised the way this is called in the Scraperwiki code – I am doing so many unnecessary lookups at the moment!) In the following snippet, I look to see if we can split the date into a cell range functions,

import time
from time import mktime
from datetime import datetime

def dateNormalise(d):
    #This is a bit of a hack - each time we find new date formats for the cols, we'll need to extend this
    #The idea is to try to identify the date pattern used, and parse the string accordingly
    for trials in ["%d %b %y",'%d-%b-%y','%d-%b-%Y','%d/%m/%Y','%d/%m/%y']:
        try:
            dtf =datetime.datetime.fromtimestamp(mktime(time.strptime(d, trials)))
            break
        except: dtf=d
    if type(dtf) is datetime.datetime:
        dtf=dtf.strftime("%Y-%m-%d")
    return dtf

def patchDate(f,t):
    #Grab the month and year elements from the todate, and add in the from day of month number
    tt=t.split('-')
    fromdate='-'.join( [ str(tt[0]),str(tt[1]),str(f) ])
    return fromdate

def dateRangeParse(daterange):
    #In this first part, we simply try to identify from and to portions
    dd=daterange.split(' to ')
    if len(dd)<2:
        #That is, split on 'to' doesn't work
        dd2=daterange.split(' - ')
        if len(dd2)<2:
            #Doesn't split on '-' either; set from and todates to the string, just in case.
            fromdate=daterange
            todate=daterange
        else:
            fromdate=dd2[0]
            todate=dd2[1]
    else:
        fromdate=dd[0]
        todate=dd[1]
    #By inspection, the todate looks like it's always a complete date, so try to parse it as such 
    todate=dateNormalise(todate)
    #I think we'll require another fudge here, eg if date is given as '6 to 8 Nov 2012' we'll need to finesse '6' to '6 Nov 2012' so we can make a date from it
    fromdate=dateNormalise(fromdate)
    if len(fromdate)<3:
        fromdate=patchDate(fromdate,todate)
    return (fromdate,todate)

#USAGE:
(fromdate,todate)=dateRangeParse(dateString)

One thing this example shows, I think, is that even though the data is being published as a dataset, albeit in an Excel spreadsheet, we need to do some work to make it properly useable.

XKCD - ISO 8601

The sheets look as if they are an aggregate of data produced by different sections, or different people: that is, they use inconsistent ways of representing date ranges.

When it comes to using the date, we will need to take care in how we represent or report on figures collected over a date range (presumably a weekend? I haven’t checked), compared to daily totals. Indeed, as the PS below shows, I’m now starting to doubt what the number in the date range column represents? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period? [This was written under assumption it was summed daily values over period, which PS below suggests is NOT the case, in one sheet at least?] One approach might be to generate “as-if daily” returns simply by dividing ranged date totals by the number of days in the range. A more “truthful” approach may be to plot summed counts over time (date on the x-axis, sume of values to date on the y-axis), with the increment for the date-ranged values that is being added in to the summed value taking the “toDate” date as its x/date value?

When I get a chance, I’ll do a couple more posts around this dataset:
– one looking at datashaping in general, along with an example of how I shaped the data in this particular case
– one looking at different queries we can run over the shaped data.

PS Another problem… on the NHS site, we see that there appear to be weekly spreadsheet releases and an aggregated release:

Because I didn’t check the stub of scraper code used to pull off the spreadsheet URLs from the NHS site, I accidentally scraped weekly and aggrgeated sheets. I’m using a unique key based on a hash that includes the toDate as part of the hashed value, in an attempt to keep dupes out of the data from just this sort of mistake, but looking at a query over the scraped data I spotted this:

If we look at the weekly sheet we see this:

That is, a column for November 15th, and then one for November 18th, but nothing to cover November 16 or 17?

Looking at a different sheet – Adult Critical Care – we get variation at the other end of the range:

If we look into the aggregated sheet, we get:

Which is to say – the weekly report displayed a single data as a column heading where the aggregated sheet gives a date range, although the same cell values are reported in this particular example. So now I realise I have no idea what the cell values in the date range columns represent? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?

And here’s another query:

February data??? I thought we were looking at November data?

Hmmm…

PPS If you’re looking for learning outcomes from this post, here are a few: three ways in which we need to wrangle sense out of dates:

  1. representing Excel dates or strings-that-look-like-dates as dates in some sort of datetime representation (which is most useful sort of representation, even if we end up casting dates into string form);
  2. parsing date ranges into pairs of date represented elements (from and to dates);
  3. where a dataset/spreadsheet contains heterogenous single date and date range columns, how do we interpret the numbers that appear in the date range column?
  4. shoving the data into a database and running queries on it can sometimes flag up possible errors or inconsistencies in the data set, that might be otherwise hard to spot (eg if you had to manually inspect lots of different sheets in lots of different spreadsheets…)

Hmmm….

PPPS Another week, another not-quite-right feature:

another date mixup

PPPPS An update on what the numbers actually mean,from an email exchange (does that make me more a journalist than a blogger?!;-) with the contact address contained within the spreadsheets: “On the columns, where we have a weekend, all items apart from beds figures are summed across the weekend (eg number of diverts in place over the weekend, number of cancelled ops). Beds figures (including beds closed to norovirus) are snapshots at the collection time (i.e 8am on the Monday morning).”

PPPPPS Another week, ans this time three new ways of writing the date range over the weekend: 14-16-Dec-12, 14-16-Dec 12, 14-16 Dec 12. Anyone would think they were trying to break my scraper;-)

Written by Tony Hirst

November 27, 2012 at 5:55 pm

Posted in Data, Infoskills

Tagged with , ,

Trademark Galleries on Scraperwiki, via OpenCorporates

What trademarks – familiar to us all – are registered with which companies? And how often are trademarked brands in larger outlets actually ‘exclusive offerings’, aka “own range” products in weak disguise? In Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine, I demonstrated a recipe for using OpenCorporates.com as a way in to previewing at least some of the trademarks registered to a specified (UK registered?) company. Here’s a recipe using Scraperwiki to do a similar thing…

But first, do you recognise any of these trademarks…

… as belonging to Tesco?

The recipe goes something like this… For a given company name keyword, look up that company on OpenCorporates and get a list of company identifiers back:

import scraperwiki,simplejson,urllib

target='tesco'

rurl='http://opencorporates.com/reconcile/gb?query='+urllib.quote(target)
#note - the opencorporates api also offers a search:  companies/search
entities=simplejson.load(urllib.urlopen(rurl))
ocids=[]
for entity in entities['result']:
    ocids.append(entity['id'].lstrip('/companies/'))

print ocids

For each of those companies, we’ll need to look up the company details on OpenCorporates. To benefit from an increased API limit, it makes sense to use an OpenCorporates API key to do this. The idea of API keys is typically that they are assigned to specific users, which is to say, you’re supposed to keep them secret. But how do we do this on Scraperwiki, an otherwise open environment? Here’s trick I found on the Scraperwiki blog that allows you to keep things like API keys secret… Make the scraper a protected one, and then hide the keys in scraper description on the scraper’s homepage using the following convention:

__BEGIN_QSENVVARS__
OC_KEY = XXXXXX
__END_QSENVVARS__

where XXXXXX is your API key value (don’t use quotes around the value – use the actual value).

Here’s how we pick up the key:

import os, cgi
try:
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))
    ockey=qsenv["OC_KEY"]
except:
    ockey=''

To get the trademark data, we need to pull the company data for each company ID of interest, look to see if there are any trademark records associated with it in the OpenCorporates database, and if so, pull those records:

def getOCcompanyData(ocid):
    ocurl='http://api.opencorporates.com/companies/'+ocid+'/data'+'?api_token='+ockey
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    return ocdata

def getOCtmData(ocid,octmid):
    octmurl='http://api.opencorporates.com/data/'+str(octmid)+'?api_token='+ockey
    data=simplejson.load(urllib.urlopen(octmurl))
    octmdata=data['datum']['attributes']
    print 'tm data',octmdata
    categories=[]
    for category in octmdata['goods_and_services_classifications']:
        if 'en' in octmdata['goods_and_services_classifications'][category]:
            categories.append(octmdata['goods_and_services_classifications'][category]['en']+'('+category+')')
    tmdata={}
    tmdata['ocid']=ocid
    tmdata['ocname']=octmdata['holder_name']
    #tmdata['ocname']=ocnames[ocid]
    tmdata['categories']=" :: ".join(categories)
    tmdata['imgtype']=octmdata['mark_image_type']
    tmdata['marktext']=octmdata['mark_text']
    tmdata['repaddr']=octmdata['representative_address_lines']
    tmdata['repname']=octmdata['representative_name_lines']
    tmdata['regnum']=octmdata['international_registration_number']
    #if an image tradmarked, we can work out its URL on the WIPO site...
    if tmdata['imgtype']=='JPG' or tmdata['imgtype']=='GIF':
        tmdata['imgurl']='http://www.wipo.int/romarin/images/' + tmdata['regnum'][0:2] +'/' + tmdata['regnum'][2:4] + '/' + tmdata['regnum'] + '.'+ tmdata['imgtype'].lower()
    else: tmdata['imgurl']='' 
    print tmdata
    scraperwiki.sqlite.save(unique_keys=['regnum'], table_name='trademarks', data=tmdata)

    return octmdata

def grabOCTrademarks(ocid,ocdata):
    for tm in ocdata['data']:
        if tm['datum']['data_type']=='WipoTrademark':
            octmid=tm['datum']['id']
            octmdata=getOCtmData(ocid,octmid)

for ocid in ocids:
    print 'company data',getOCcompanyData(ocid)
    ocdata=getOCcompanyData(ocid)
    octmdata=grabOCTrademarks(ocid,ocdata)

You can find the actual scraper here: Scraperwiki scraper: OpenCorporates Trademark demo

The code for the view (shown above) can be found here: Scraperwiki View: OpenCorporates Trademark demo (code) and here’s the actual view

Note that I think the OpenCorporates databases only has a very fragmentary coverage over trademark records for each company. Even keeping up with daily updates relating to trademarks and trademarked images on WIPO looks like it could be quite a major undertaking, which is maybe why companies such as Thomson Reuters recently saw an opportunity to Create Most Comprehensive Collection of Searchable Trademark Data in the World.

That’s not to say I don’t see value in this micro-attempts at making sense of the world around us such as this one, though. Trademark brands are pervasive, but it’s often not obvious who actually owns them, and may even mask a lack of competition with an appearance of competition (for example, when a company owns two different trademark brands that a consumer thinks of as being competitive offerings by different providers, when in fact they are both owned by the same company; or when you mistake a company offshoot for a third party franchise, as I realised when I saw that BP owned the Wild Bean Cafe trademark…)

Written by Tony Hirst

May 9, 2012 at 10:06 am

Posted in Anything you want

Tagged with ,

The FOI Route to Real (Fake) Open Data via WhatDoTheyKnow

In FOI Signals on Useful Open Data?, I pondered whether we could make use of information about FOI to help identify what sorts of data folk might actually be interested in by virtue of making Freedom of Information (FOI) requests for that that data.

I couldn’t help but start to try working various elements of that idea through, so here’s a simple baby step to begin with – a scraper on Scraperwiki (Scaperwiki scraper: WhatDOTheyKnow requests) that searches for FOI requests made through WhatDoTheyKnow that got one or more Excel/xls spreasheets back as an attachment.

I’ve also popped up a Scraperwiki view that allows you to view data returning searches made to local councils or universities

Clicking through on an FOI request link takes you to the response that contains the data file, which can be downloaded directly or previewed on Zoho:

It strikes me that if I crawled the response pages, I could build my own index of data files, catalogued according to FOI request titles, in effect generating a “fake” data.gov.uk or data.ac.uk opendata catalogue as powered by FOI requests…? (What would be really handy in the local council requests would be if the responses were tagged with with appropriate LGSL code or IPSV terms (indexing on the way out) as a form of useful public metadata that can help put the FOI released data to work…?)

Insofar as the requests may or may not be useful as signaling particular topic areas as good candidates as “standard” open data releases, I still need to do some text analysis on the request titles. In the meantime, you can enter a keyword/key phrase in the Request text box in order to filter the table results to only show requests whose title contains the keyword/phrase. (The Council drop down list allows you to filter the table so that it only shows requests for a particular university/council.)

PS via a post on HelpMeInvestigate, I came across this list of FOI responses to requests made to the NHS Prescription Pricing Division. From a quick skim, some of the responses have “data” file attachments, though in the form of PDFs rather than spreadsheets/CSV. However, it would be possible to scrape the pages to at least identify ones that do have attachments (which is a clue they may contain data sets?)

So now I’m wondering – what other bodies produce full lists of FOI requests they have received, along with the responses to them?

PPS See also this gov.uk search query on FOI Release publications.

Written by Tony Hirst

April 28, 2012 at 10:42 pm

Posted in Data, opengov

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 795 other followers