OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Anything you want’ Category

B[e|ee|ie]rlin

with 2 comments

I went to Berlin for the first time a couple of weeks ago:

FlickrDroid Upload

and had a beer:

I guessI got what I asked for...

Very – apt.

I’m normally a bitter drinker (read that how you will…) but when in, erm, Rome, I figured I should do as the, erm, doughnuts do, and have what presumably translates back here as “a continental beer”:

Wot no proper beer...?!

Good trip though:-)

Written by Tony Hirst

April 12, 2013 at 1:36 pm

Posted in Anything you want

More Open Data Frustrations – Unreadable Documentation from the DfE

leave a comment »

One of the many things I’d like to spend my time doing is tinkering with data journalism doodles relating to local news stories. For example, via our local hyperlocal blog, I saw this post announced today: Isle of Wight has highest percentage of secondary school absentee rates in country. The post included a link to a Department for Education page (Pupil absence in schools in England, including pupil characteristics) containing links to the statistical release and the associated data sets:

dfes absentee data

Here’s what we get in the zipped datafile:

downloaded data

The school level dataset had the following column headings:

Year, country_code, country, GOR, GOR_code, LA, new_LA_code, LA_Name, URN, Estab, LAEstab, School_name, School_type, Academy_Flag, Academy_open_date, enrol_sum, SessionsPossible_sum, OverallAbsence_sum, AuthorisedAbsence_sum, UnauthorisedAbsence_sum, overall_absence_percent, auth_absence_percent, unauth_absence_percent, PA_15_sum, possible_sessions_pa_15_sum, overall_abs_pa_15_sum, auth_abs_pa_15_sum, unauth_abs_pa_15_sum, overall_absence_percent_PA_15, auth_absence_percent_PA_15, unauth_absence_percent_PA_15, sess_auth_illness, sess_auth_appointments, sess_auth_religious, sess_auth_study, sess_auth_traveller, sess_auth_holiday, sess_auth_ext_holiday, sess_auth_excluded, sess_auth_other, sess_auth_totalreasons, sess_auth_unclass, sess_unauth_holiday, sess_unauth_late, sess_unauth_other, sess_unauth_noyet, sess_unauth_totalreasons, sess_unauth_unclass, sess_overall_totalreasons

We can guess at what some of these refer to, but what, for example, do the “PA 15″ columns refer to? In this case, what we really should do is look up the actual definitions, which are described in the metadata description document; a document that just happens to be a Microsoft Word 2007 formatted document…

…a document that doesn’t play nicely either with the copy of Word I have on my Mac:

SOmetimesIPreferPDF

…or the converter that the Google docs uploader uses:

Google docs struggled too

In cases such as this, particularly where there are mathematical equations that often have very specific layout requirements, it can be “safer” to use a document format such as PDF that more reliably captures the appearance of the original page. (If we were really keen on reproducibility, we might also suggest that the equations were made available in an executable form, such as programme code or even as a spreadsheet (I’m not sure “Microsoft equations” are executable?).)

I gave myself a couple of hours to have a quick look through some of the data, but as it is I’ve spent an hour or so looking for ways of reading the metadata description document along with writing up my frustration around not being able to do so… Which is time spent not making sense of the data, or, indeed, its metadata…

PS in passing, I note the publication of the parliamentary Public Accounts Committee 37th report, Whole of Government Accounts 2010-11 again picks up on the way in which government data releases often fall short in terms of their usability (for example, this week MPs call for greater use of Whole of Government Accounts; see also last August Government must do better on transparency, say MPs).

PPS Here’s the solution I used in the end – Skydrive, Microsoft’s online storage/doc viewing play:

equations pt 1

equations pt 2

As it turns out, the equations could easily have been written using simple text strings…

PPPS as to the “15″ columns, the metadata files describes them along the following lines:

PA_15_sum Number of enrolments classed as persistent absentees (threshold of 15 per cent)

possible_session_pa_15_sum Sessions possible for persistent absentees (threshold of 15 per cent)

Which means what exactly?!

Written by Tony Hirst

April 11, 2013 at 5:48 pm

Posted in Anything you want, opengov

Tagged with

Are Universities Open to Business?

leave a comment »

Another post of fragments…

A recent THES feature article – Are universities as open as they should be? – casts an eye over FOI requests in a university context, as well as transparency relating to university decition-making, pay (and bonuses) and the make-up of the student body:

With universities increasingly relying on funding from private sources, some believe that the burden placed on universities by the [Freedom of Information] Act is unfair and unsustainable. “As funding of the sector changes, and a greater proportion of university income comes from non-governmental sources, it is even more arguable whether universities are truly public authorities and whether the Act should therefore apply to them,” the University of Essex told MPs on the Commons Justice Committee when they audited the Act last year.

Universities can be unforthcoming for different reasons. In response to recent FoI requests submitted by Times Higher Education, a number of universities have refused to divulge information because they say it could harm their commercial interests – an exemption permitted by the Act. One THE request prompted this round-robin message from Staffordshire University to other FoI officers: “I have been asked by my student office if we can refuse this using this [commercial interests] exemption, I can’t see how we could – any thoughts?”

With the government pushing the idea of greater competition between institutions, we can expect this exemption to be wheeled out more frequently in the future, Gibbons predicts. “That’s something that may well happen because the government’s got this agenda to make us more commercial.”

Should universities – as free-thinking and occasionally contrarian bodies – pay more heed to the current vogue for transparency? Transparency is fine, says Thomas Docherty, professor of English and comparative literature at the University of Warwick, “as long as it is not confused with truth or even reliable knowledge. Knowledge of what is going on in any institution is not revealed by raw data.

“The demand for transparent data to operate as a kind of substitute for knowledge or truth is part of a culture of ‘immediacy’,” he thinks, which is “anathema to knowledge, and to education, both of which require time, delay, and the mediations of thinking.”

I’ve done a little digging before around university funding (e.g. and First Dabblings with the Gateway to Research API Using OpenRefine), though as mentioned in Public Sector Transparency – Do We Need Open Receipts Data as Well as Open Spending Data?, we could gain even more “transparency” if public bodies also started to publishing receipts data.

University business can also leak out in to other domains… for example, looking through OpenCorporates last night, I noticed that some directors had been added to the FutureLearn listing. My first thought was to see whether any pairing (or larger grouping) of FutureLearn directors were also co-directors of any other companies. Here’s what I found:

OU companies

(We can also look up to see what other companies currently have directors with the same names as directors of the companies listed above (here). Note that there may be some ambiguity here – exact string matches on names doesn’t mean that the names refer to the same person…)

Bookhire Limited was new to me, so I tried a quick search, in turn stumbling on a not recently updated OU FOI page – Subsidiary Companies. Do other universities tend to publish such a page?

Hmmm… maybe that would be a useful thing to collect on data.ac.uk?

PS as well as looking up university registered businesses via co-directorships, we might also try to track down companies based on DNS registrations. For example, here are some DNS nameservers for open.ac.uk and here are some domains registered on an OU nameserver.

Written by Tony Hirst

April 9, 2013 at 12:28 pm

Posted in Anything you want

Revisiting MPs’ Expenses

with one comment

I couldn’t but notice the chatter about Iain Duncan Smith claiming he’d have no problem “living on 53 pounds a dayweek“, which made me wonder not only how many meal catered events he attends each week (and how many of his scheduled meeting also have complementary tea and biscuits (a bellweather of the extent of cuts in many institutions…;-), but also how he fares on the expenses stakes…

For the last couple of years, details about MPs’ expense claims have been published via the Independent Parliamentary Standards Authority (IPSA) website. The data seems to be most easily grabbed as CSV files containing all MPs’ claims for a parliamentary session (or tax year?) – eg 2012/13 or 2011/2012. As you might expect, that means the files are relatively large – 20MB (~100,000 rows) for 12/13, or just over 40MB (~190k rows) for 2011/12.

Files of this size are fine if you’re happy working with files of this size (?!), but can be a pain if you aren’t… So here are a couple of ways getting the data into a more manageable form, starting from raw data files that look something like this…

mp expenses raw

The file is made up of a series of rows, one per expense entry, with common columns. If we loaded this data into a spreadsheet application such as Excel or Google Spreadsheets, we’d see a single sheet containing however many tens of thousands of rows of data. Assuming that the application could cope with loading such a large amount of data of course… which it might not be able to do…which means we may need to make the data file a bit more manageable, somehow…

Let’s start with grabbing data relating to Iain Duncan Smith’s expense claims. On a Linux box, or a Mac, this is easy enough from a Terminal command line. (On Windows, something like cygwin should provide you with equivalents of some of the more useful unix tools.) For example, the grep command let’s us pull just the rows that contain the phrase Iain Duncan Smith:

grep "Iain Duncan Smith" DataDownload_2012.csv > IDS_expenses_2012.csv

This reads along the lines of: search through the file “DataDownload_2012.csv” looking for rows that contain “Iain Duncan Smith”, then copy those rows and only those rows into the file “IDS_expenses_2012.csv”

For what it’s worth, I’ve posted IDS’ expenses data from 2011 and 2012 to a couple of Google Fusion Tables: 2011/12, 2012/13

Another way of wrangling the data into a state we can start to play with it is to load it into RStudio, where we can start applying magical R incantations to it.

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")

mpexpenses2012R

We can then generate a subset of the data containing just IDS’ data:

ids2012=subset(mpExpenses2012, MP.s.Name=="Iain Duncan Smith")

We can also generate a combined data set of IDS’ expense claims from both 2011/12 and 2012/13, for example:

mpExpenses2011 = read.csv("~/Downloads/DataDownload_2011.csv")
ids2011=subset(mpExpenses2011, MP.s.Name=="Iain Duncan Smith")

ids11and12=rbind(ids2011,ids2012)

However, all is not well…

On loading the 2011 data into R, 158320 observations are loaded in. The actual number of rows (including the header – so one more row than the number of “observations”) can be given by running a line count (from the terminal/command line) over the original file:

wc -l DataDownload_2011.csv
187447 DataDownload_2011.csv

That is, 187447 rows…

If we try to pull out a list of MPs’ names using the levels command:

mpNames=data.frame(name=levels(mpExpenses2011$MP.s.Name))

we find that as well as the expected MPs names, there’s some “bad” data:

etl messup

(What we expect to see in the name column is a list of MP names, one unique name per row.)

This is, of course, the way of the world. Folk who publish data rarely if ever, provide a demonstration of how to actually open it cleanly into an application (typically because data publishers think that once they have published the data, it’s bound to be all right and doesn’t need checking. This is, however, rarely true…although, for the 2012/13 data, there are 99071 loaded observations against 99072 (including 1 header) rows in the download file, which does seem to be correct).

What we should do now, of course, is go into a an ETL (or at least, TL) debug mode on the 2011 data and try to figure out what’s going wrong with the simple import… or we just work with the data we have and try to work around the dodgy rows…

…or we limit ourselves to the 2012 data, which does seem okay…

So if we do that, what other sorts of investigation come to mind?

One thing that came to mind after skimming the data…

mp cost of travel

was a “rail travel fares according to MPs’ expenses” lookup table.

So for example, we might start by creating a subset of the data based on expenses categorised as “Travel” and then look to see what sorts of trvel classification falls within that Category:

travel=droplevels(subset(mpExpenses2012,Category=="Travel"))
levels(travel$Expense.Type)

Here’s what we get:

[1] "Car Hire"                       "Car Hire Fuel"                  "Car Hire Fuel MP"              
 [4] "Car Hire Fuel MP Staff"         "Car Hire Insurance MP Staff"    "Car Hire MP"                   
 [7] "Car Hire MP Staff"              "Congest. Zone/Toll Seas Ticket" "Congestion Zone/Toll"          
[10] "Congestion Zone/Toll Dependant" "Congestion Zone/Toll MP"        "Congestion Zone/Toll MP Staff" 
[13] "Food & Drink"                   "Food & Drink @ Parliament"      "Food & Drink @ Parlmnt OFF Est"
[16] "Food & Drink MP"                "Food & Drink MP Staff"          "Hotel Late Sitting"            
[19] "Hotel Late Sitting > 1.00"      "Hotel London Area MP Staff"     "Hotel NOT London Area (Travel)"
[22] "Hotel NOT London Area MP Staff" "Hotel Outside UK"               "Own Bicycle MP"                
[25] "Own Car Dependant"              "Own Car MP"                     "Own Car MP Staff"              
[28] "Own Vehicle Bicycle"            "Own Vehicle Bicycle MP Staff"   "Own Vehicle Car"               
[31] "Own Vehicle Car Dependant"      "Own Vehicle Car MP Staff"       "Own Vehicle Mot Cycle MP Staff"
[34] "Parking"                        "Parking Dependant"              "Parking MP Staff"              
[37] "Parking Season Ticket"          "Public Tr AIR"                  "Public Tr AIR Dependant"       
[40] "Public Tr AIR MP Staff"         "Public Tr BUS"                  "Public Tr BUS MP Staff"        
[43] "Public Tr COACH"                "Public Tr COACH MP Staff"       "Public Tr FERRY"               
[46] "Public Tr FERRY MP Staff"       "Public Tr OTHER"                "Public Tr OTHER Dependant"     
[49] "Public Tr OTHER MP Staff"       "Public Tr RAIL - RTN"           "Public Tr RAIL - SGL"          
[52] "Public Tr RAIL Dependant - RTN" "Public Tr RAIL Dependant - SGL" "Public Tr RAIL Foreign"        
[55] "Public Tr RAIL MP Staff - RTN"  "Public Tr RAIL MP Staff - SGL"  "Public Tr RAIL Other"          
[58] "Public Tr RAIL Other Dependant" "Public Tr RAIL Other MP Staff"  "Public Tr RAIL Railcard"       
[61] "Public Tr RAIL Railcd MP Staff" "Public Tr RAIL Sleeper Suppl"   "Public Tr Season Ticket"       
[64] "Public Tr UND"                  "Public Tr UND Dependant"        "Public Tr UND MP Staff"        
[67] "Public Tr Underground MP"       "Taxi"                           "Taxi After Late Sitting"       
[70] "Taxi after Late Sitting 11pm"   "Taxi Dependant"                 "Taxi MP"                       
[73] "Taxi MP Staff"                  "Taxi Working Late After 9pm"    "Taxi Working Late Before 9pm"

We might further pull out just the rows relating to rail travel (almost 10,000 rows from the 2012/13 dataset):

rail=droplevels(subset(travel,grepl("Rail",Expense.Type,ignore.case=TRUE)))

and then we might start looking to see who’s travelling First vs. who’s travelling Standard, as well as building up a database of rail fares between locations as claimed on expenses. But that’s for another day…

Written by Tony Hirst

April 2, 2013 at 11:20 pm

Simple Mapmaking With Google Maps Engine Lite? Not for me…

leave a comment »

I’m increasingly reluctant to plug Google services because of the way I’ve started to personally frame their actions as I (cynically) attempt to understand the sorts of process the business appears to use when developing, running and retiring services (if it’s not selling you as audience, Google-as-extractive-industry needs you to be providing it with training signals and data resource to feed its own data-collection tools and algorithms. If there’s no sales, training or data provision to be had, you’re surplus to requirements.). On the other hand, the familiarity of Google can make it an easier sell when trying to persuade people to try out a new application.

But for what it’s worth, here’s a new Google service for helping you get your data onto a map in a relatively straightforward way – Google Maps Engine Lite (launch announcement). (I’m guessing that this is a either a ratcheting sales pitch for the commercial Google Maps Engine (having tried the Lite service out, and decide you may want to make “formal” use of it, you read the Terms and Conditions/license agreement and find you legally can’t…), a way to foster WOM pressure on organisations to adopt Maps Engine (cf. bring-your-own-device – workers or customers tell an organisation “I can do this myself at home why can’t I/you do it here?”), or a way of getting more structured geo-data into Google’s tar pit.)

google maps engine lite

As a sample data set (the linked to example data set in the help notes is a broken link (missing “http://&#8221; relativises it… <- pile of crap WordPress trying to linkify something that isn't a link how do I disable it? Or do I just say: “missing h t t p colon double slash”?), I downloaded (as bp.csv) a data grab I'd made on OpenCorporates of companies co-directed by directors of a sample set of BP companies registered in the UK that includes registered addresses:

corporates scraped data

To simplify the data (and because I don’t want to give Google more data than I have to, to create a map of companies by registered address), I just grabbed the company name and registration address columns using this handy CSVfilter commandline tool: cat bp.csv | csvfilter -f 0,4 > testdata.csv

(See also: CSVkit)

The 0,4 refer the the 0th and 4th columns in the original data file to give me a datafile that looks like this:

geotest data

The fact that lots of companies have the same registered address will let me test how the maps engine handles overlaid markers…

The import is then easy enough…

google maps engine lite import

…ish:

hit a limit already..

Erm…okay… So we’re limited on the number of markers in each layer, in the “free” plan at least…

map engine sales

(I guess I could try to split my file into 100 row segments and then upload each of them individually. On a Linux/Unix/Mac terminal command line, the command split -l 100 testdata.csv will split the testdata.csv file into separate files 100 lines long (except the last chunk, probably…;-). The filenames are automatically generated by default – inspect the directory listing of the folder you run the command in and sort by most recent date to see the new files that are created.)

chunked data

We then get walked through identifying columns for locations (more training data for the algos):

import col place

and then marker labels:

add label col

And then let Google do the geocoding…

hmmm broken

Erm… buggritt; does it only work in the US?

craptastic

The word I use for this is craptastic.

Ho hum, I’ve better things to do than faff around with this any more today…

Written by Tony Hirst

March 28, 2013 at 11:18 am

Posted in Anything you want

Next Steps Taken for data.ac.uk…

with 7 comments

One of the problems with doing “data stuff” in a particular sector is finding data from across the sector. data.ac.uk seeks to help simplify the discovery of (and maybe even normalised access to) data published across the UK Higher Education sector.

data.ac.uk homepage

A new unveiling this week was the HE equipment register (which I think grew out of the Uniquip equipment and facility sharing project?), and which is intended to provide a single point of access for looking up access to research facilities and equipment.

data.ac.uk equipment register

(I think the research councils increasingly require universities to have a plan for making funded research equipment available to businesses, and providing a catalogue to look up such equipment facilitates that.) I’m not sure about the coverage of this catalogue at the moment, or how it relates (not least in a data sharing way) to research equipment sharing consortia such as the N8 Research Partnership (Durham, Lancaster, Leeds, Liverpool, Manchester, Newcastle, Sheffield and York) or the M5 Group (Birmingham, Leicester, Loughborough, Nottingham, Warwick, and Aston). (There’s also fragmentary evidence of an S5 grouping (Cambridge, Imperial, Oxford, UCL and Southampton) but I haven’t found a public website for them?)

Another area of the data.ac.uk provides a handy link to “administrative” information relating to HEIs – Learning Providers data, although I’m not sure to what extent this overlaps with the data contained in the JISC Monitoring Unit (JISC MU) database?

learning providers

One problem with open data sites having national or sector coverage is that, whilst we might hope that individual locations will submit data to the datastore, its often more likely the case that a dataset will need curating and collecting together by a dedicated and interested (obsessive?) individual. To date, Chris Gutteridge has been doing a lot of the work on data.ac.uk, but he doesn’t necessarily scale!;-)

Architecturally, the site is designed to support what I guess we could describe as federated management. Subdomains are used to identify different topic or category areas, with a top bar menu providing navigation to other areas of data.ac.uk. In principle, anyone could propose, host, and curate data from across the sector relating to a particular topic. Unlike sites such as OpenlyLocal, the model does not (yet?) support pages built around the opendata offerings of a particular institution, though I guess someone could generate something like quickdata.ac.uk/university-name to provide a summary page for each university on a “quickdata” subdomain?

There is possibly an issue regarding the “status” of data.ac.uk in respect of the extent to which it provides a single point of access to normalised data within a topic area, compared to linking out to locally hosted versions of data relating to particular items (for example, we might imagine foi.data.ac.uk linking to FOI homepages on each university website, or orgcharts.data.ac.uk linking to data source pages on university websites, ordered by university). On the other hand, there are pre-exisiting “national datasets” such as the data collated by JISC MU, or the research council funding data that looks as if it’ll be made available via the Research Councils UK Gateway to research. For these national colletions, the data.ac.uk model would “allow” for sites like Gateway to Research to take over something like the gtr.data.ac.uk subdomain, and add the data.uk.uk top bar to their site, though I could see all sorts of issues with that relating to perceived ownership! One possible way around this would be to provide a button that “partner” sites could include that would identify a site as being part of the data.ac.uk federation and then popping up the top bar if folk wanted to explore other data.ac.uk federation sites? URLs such as gtr.data.ac.uk would then simply act as redirects into sites with independent branding/look and feel, but a data.ac.uk federation member button on them somewhere?

In other news, the Linked Up Challenge also launched this week “promoting the innovative use of linked and open data in an educational context”.

linkededucation data endpoint

The competition will be making available data drawn from across the European HE sector and published as Linked Data:

linked up challenge

For some reason, this springs to mind…

Hmmm…

Written by Tony Hirst

March 21, 2013 at 11:23 am

Posted in Anything you want

Tagged with

via OER-DISCUSS – Notes on Copyright

with one comment

I though this was handy on the OER-DISCUSS mailing list:

Our copyright officer writes:

… US Copyright ‘Fair Use’ or S29 copying for non-commercial research and private study which allows copying but the key word here is ‘private’. i.e. the provisos are that you don’t make the work or copies available to anyone else.

Although there are UK Exceptions for education, they are very limited or obsolete.
S.32 (1) and (2A) do have the proviso “is not done by reprographic process” which basically means that any copying by any mechanical means is excluded, i.e. you may only copy by hand.

S36 educational provision in law for reprographic copying is
a) only applicable to passages in published works i.e. books journals etc and
b) negated becauses licences are now available S.36 (3)

S.32 (2) permits only students studying courses in making Films or Film soundtracks to copy Film, broacasts or sound recordings.

The only educational exception students can rely on is s.32(3) for Examination athough this also is potentially restrictive. For the exception to apply, the work must count towards their final grade/award and any further dealing with the work after the examination process, becomes infringement.

I’m not sure how they are using Voicethread, but if the presentations are part of their assessed coursework and only available to students, staff and examiners on the course, they may use any Copyright protected content, provided it’s all removed from availability after the assessment (not sure how this works with cloud applications though)

There is also exception s.30 for Criticism or Review, which is a general exception for all, and the copying is necessary for a genuine critique or review of it.

If the students can’t rely on the last 3 exceptions, using Copyright free or licenced material (e.g. Creative Commons), would be highly recommended.

Kate Vasili – Copyright Officer, Middlesex University, Sheppard Library

Written by Tony Hirst

March 16, 2013 at 11:50 am

First Dabblings with the Gateway to Research API Using OpenRefine

with one comment

Quick notes from a hackday exploring the Research Councils UK Gateway to Research, a search tool and API over project data relating to UK Research Council project funding awards made since 2006. As an opener, here’s a way of grabbing data related to a particular institution (the OU, for example) using Open Refine, having a quick peek at it, and then pulling down and processing data related to each of the itemised grant awards.

So let’s do a search for the Open University, and limit by organisation:

Gtr Open university search

Here’s the OU’s organisation “homepage” on GtR:

gtr ou homepage

A unique ID associated with the OU, combined with the organisation path element defines the URL for the organisation page:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24

Just add a .json or .xml suffix to get a link to a machine readable version of the data:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json

(Why don’t the HTML pages also include a link to the corresponding JSON or XML versions of the page? As Chris Gutteridge would probably argue for, why don’t the HTML pages also support the autodiscovery of machine friendly versions of the page?)

Noting that there are of the order 400 results for projects associated with the OU, and further that the API returns paged results containing at most 100 results per page, we can copy from the browser address/location bar a URL to pull back 100 results, and tweak it to grab the JSON):

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json?page=1&selectedSortableField=startDate&selectedSortOrder=DESC&fetchSize=100

(You need to “de-end-slash” the URL – i.e. remove the / immediately before the “?”.)

To grab additional pages, simply increase the page count.

In OpenRefine, we can load the JSON in from four URLs:

OpenRefine import data

Yep – looks like the data has been grabbed okay…

open refine - the data has been grabbed

We can now select which blocks of data we want to import as unique rows…

open refine  - identify the import records

OpenRefine nows shows a preview – looks okay to me…

Open Refine - check the preview

Having got the data, we can take the opportunity to explore it. If we cast the start and end date columns to date types:

Open refine Cast to date

We can then use a time facet to explore the distribution of bids with a particular start date, for example:

Open Refine - timeline facet

Hmmm – no successful bids through much of 2006, and none of 2007? Maybe the data isn’t all there yet? (Or maybe there is another identifier minted to the OU and the projects from the missing period are associated with that?)

What else…? How about looking at the distribution of funding amounts – we can use a numeric facet for that:

open refine funding facet

One of the nice features of Open Refine is that we can use the numeric facet sliders to limit the displayed rows to show only projects with start date in a particular range, for example, and/or project funding values within a particular range.

How about the funders – and maybe the number of projects funder? A text facet can help us there:

open refine text facet

Again, we can use the automatically generated facet control to then filter rows to show only projects funded by the EPSRC, for example. Alternatively, in conjunction with other facet controls, we might limit the rows displayed to only those projects funded for less than £20,000 by the ESRC during 2010-11, and so on.

So that’s one way of using Open Refine – to grab and then familiarise ourselves with a dataset. We might also edit the column names to something more meaningful, and then export the data (as a CSV or Excel file, for example) so that we don’t have to grab the data again.

If we return to the Gateway to Research, we notice that each project has it’s own page too (we might also have noticed this from the URL column in the data we downloaded):

gtr - project data

The data set isn’t a huge one, so we should be okay using OpenRefine as an automated downloader, or scraper, pulling down the project data for each OU project.

Open Refine - add col by URL

We need to remember to tweak the contents of the URL column, which currently points to the project HTML page, rather than the JSON data corresponding to that page:

open refine scraper downloader

Data down:-) Now we need to parse it and create columns based on the project data download:

open refine - data down - need to parse it

One thing it might be useful to have is the project’s Grant Reference number. This requires a bit of incantation… Here’s what the structure of the JSON looks like (in part):

gtr project json

Here’s how we can pull out the Grant Reference:

open refine parse json

Let’s briefly unpick that:

value.parseJson()['projectComposition']['project']['grantReference']

The first part – value.parseJson() – gets the text string represented as a JSON object tree. Then we can start to climb down the tree from its root (the root is at the top of the tree…;-)

Here are a few more incantations we can cast to grab particular data elements out:

  • the abstract text: value.parseJson()['projectComposition']['project']['abstractText']
  • the name of the lead research organisation: value.parseJson()['projectComposition']['leadResearchOrganisation']['name']

open refine new cols

We can also do slightly more elaborate forms of parsing… For example, we can pull out the names of people identified as PRINCIPAL INVESTIGATOR:

forEach(filter(value.parseJson()['projectComposition']['projectPerson'],v,v['projectRole'][0]=='PRINCIPAL_INVESTIGATOR'),x,x['name']).join('|')

We can also pull out a list of the Organisations associated with the project:

forEach(value.parseJson()['projectComposition']['organisation'],v,v['name']).join('|')

open refine - more data extracts

Noting that some projects have more than one organisation associated with it, we might want to get a form of the data in which each row only lists a single organisation, the other column values being duplicated for the same project.

The Open Refine “Split Multi-Valued Cells” function will generate new rows from rows where the organisations column contains more than one organisation.

open refine split multi valued cells

We need to split on the | delimiter I specified in the organisations related expression above.

open refine split separated

The result of the split are new rows – but lots of empty columns…

open refien the result of split

We can fill in the values by using “Fill Down” on columns we want to be populated:

open refine fill dwon

Magic:

open refine filled

I can use the same Fill Down trick to populate blank rows in the Grant Reference, PI and abstract columns, for example.

The final step is to export out data, using a custom export format:

open refine cusotm export

This exporter allows us to select – and order – the columns we want to export:

open refine exporter col select

We can also select the output format:

open refine export format select

Once downloaded, we can import the data into other tools. Here’s a peak at loading it into RStudio:

Importing into R

and then viewing it:

data in R

NOTE: I haven’t checked any of this, just blog walked through it..!;-)

Written by Tony Hirst

March 14, 2013 at 5:43 pm

Joining the Open Knowledge Foundation for a Day a Week…

with 6 comments

And so it goes… I’m in the process of going down to 0.8FTE at the OU, and taking up a day a week with the Open Knowledge Foundation [announcement] :-)

I’ll be working with them on the School of Data and no doubt be inflicting crazy hacks and data-related tales on a wider audience through the School of Data blog.

I’m also looking forward to working with potential but currently “non-consuming” open data users to get a better feel for what tools, techniques and training would actually be useful to folk…

Most of my work time will remain with the OU, although one thing I am keen to explore is how initiatives such as OpenLearn, as well as formal education programmes, can be used to complement data education activities offered by initiatives such as the School of Data. (If you have any ideas – please get in touch;-) I also want to start looking in more detail at data-related tooling and workflows, particularly in the context of research skills development, to try to get a feel for what sort of practice researchers engage in around data use…

I’ll try to use context to make it clear what capacity any posts that appear here – or elsewhere – relate to, or what hat I’m wearing whenever I do a presentation. Regular readers may already have noticed that I’ve started linking out to posts that I’ve written elsewhere, and that’s one way I’ll try to separate concerns. As a result, this blog may become even more personal notebook like, so you have been warned!

Written by Tony Hirst

March 12, 2013 at 2:46 pm

Posted in Anything you want

What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events

with 2 comments

Following a chat with @andypryke, I thought I’d try out a simple bit of feature detection around approximated follower acquisition charts (e.g. Estimated Follower Accession Charts for Twitter) to see if I could detect dates around which there were spikes in follower acquisition.

So for example, here’s the follower acquistion chart for Seem Malhotra:

seemaMalhotra

We see a spike in follower count about 440 days ago, with an increased daily follower acquisition rate thereafter. WHat happened 440 days or so ago? We can easily look this up on something like Wolfram Alpha (query on /440 days ago/) or directly in R:

as.Date(Sys.time())-440
[1] "2011-12-20"

So what happened in December 2011? A quick search on /”Seema Malhotra” December 2011/ turns up the news that she won a by-election on 16 December 2011. The spike in followers matches the by-election date well, and the increased rate in daily follower acquisition since then is presumably related to the fact that Seema Malhotra is now an MP.

So what’s the new line on the chart (the black, stepped line along the bottom)? It’s actually a 5 point moving average of the first difference in follower count over time (that is, sort of a smoothed version of a crude approximation to the gradient of the follower acquisition curve; the firstdiff curve is normalised by finding the difference in accumulated follower count between consecutive time samples divided the number of days between samples. So it’s a sort of gradient rather than first difference. If the samples were all a day apart, it would be a first difference…). I also filter the line to only show days on which there was a “significant jump” in follower count, arbitrarily set at a 5 sample moving average of more than 50 new followers per day. Note that scaling of the moving average values too – the numerical y-axis scale is 1:1 for the cumulative follower number, but 10x the moving average value. The numerical value labels that annotate the line chart correspond to the number of days ago (relative to the date the chart was generated) that the peak corresponds to. For any chart critics out there – this is a “working chart”, rather than a polished presentation graphic;-)

#Process Twitter user data file
processUsersData=function(data){
  data$tz=as.POSIXct(data$created_at)
  data$days=as.integer(difftime(Sys.time(),data$tz,units='days'))
  data=data[rev(rownames(data)),]
  data$acc=1:length(data$days)
  data$recency=cummin(data$days)
  data$frfo=data$friends_count/data$followers_count
  data$stfo=data$statuses_count/data$followers_count
  data$foperday=data$followers_count/data$days
  data$stperday=data$statuses_count/data$days
  data$fost=data$followers_count/(1+data$statuses_count)
  
  data
}

#The TTR library includes various moving average functions
require(TTR)

differ_a=function(d){
  d=processUsersData(d)

  #Find the users who are used to approximate the accession date
  d2=subset(d,days==recency)
  #Dedupe these rows (need to check if I grab the first of the last...)
  d3=d2[!duplicated(d2$recency),]

  #First difference
  d3$accdiff=c(0,diff(d3$acc))
  d3$daysdiff=c(0,-diff(d3$days))
  d3$firstdiff=d3$accdiff/d3$daysdiff

  #First difference smoothed over 5 values - note we do dodgy things against time here - just look for signal!
  d3$SMA5=SMA(d3$firstdiff,5)

  #Second difference
  d3$fdd=c(0,diff(d3$firstdiff))
  d3$seconddiff=d3$fdd/d3$daysdiff
  
  d3
}

#An example plotter - sm is the user data
g= ggplot(processUsersData(sm))
g=g+geom_point(aes(x=-days,y=acc),size=1) #The black acc/days dots
g=g+geom_point(aes(x=-recency,y=acc),col='red',size=1) #The red acc/days  acquisition date estimate dots
g=g+geom_line(data=differ_a(sm),aes(x=-days,y=10*SMA5)) #The firstdiff moving average line
g=g+geom_text(data=subset(differ_a(sm),SMA5>50),aes(x=-days,y=10*SMA5,label=days),size=3) #Feature label
g=g+ggtitle("Seema Malhotra") #Chart title

Here’s Chris Pincher:

chrispincher

This account got hit about 79 days ago (December 15th 2012) – we need to ignore the width of the moving average curve and just focus on the leading edge, as a zoom into the chart, with a barchart depicting firstdiff replacing the first diff moving average line, demonstrates.

#Got a rogue datapoint in there somehow?
ggplot(subset(processUsersData(cpmp),days&lt;5000))
g=g+geom_point(aes(x=-days,y=acc),size=1)
g=g+geom_point(aes(x=-recency,y=acc),col='red',size=1)
g=g+geom_bar(data=subset(differ_a(cpmp),days50 &amp; days&lt;5000),aes(x=-days,y=firstdiff,label=days),size=3)
g=g+ggtitle(&quot;Chris Pincher&quot;)+xlim(-200,-25)

chrispincherzoom

The spam followers that were signed up to the account look like they were created in batches several months prior to what I presume was an attack? COuld this have been in response to his Speaking Out about the Collapse of Drive Assist on Thursday, December 13th, 2012, his Huffpo post on the 11th, or his vote against the Human Rights Act as reported on the 5th?

Who else has an odd follower acquisition chart? How about Aidan Burley?

AidanBurley

219 days ago – 28th July 2012…

aidanBurleycrap

I guess that caused something of a Twitter storm, and a resulting growth in his follower count… Diane Abbott’s racist tweet row from December 2012 also grew her twitter following… Top tip for follower acquisition, there;-)

Nadine Dorries’ outspoken comments in May 2012 around David Cameron’s party leadership, and then same sex marriage, was good for her Twitter follower count, which received another push when she joined I’m a Celebrity and was suspended from the Parliamentary Conservative party.

Showing your emotions in Parliament looks like a handy trick too…Here’s a spike around about October 20th, 2011…

alisonMcgovern

(There also looks to be a gradient change around 200 days ago maybe? The second diff calculations might pull this out?)

Chris Bryant’s speech on the phone hacking saga in July 2011 showed that publicly well-received parliamentary speeches can be good for numbers too; not surprisingly, the phone hacking scandal was also good for Tom Watson’s follower count around the end of July 2011. Election victories can be good too: Andy Sawford got a jump in followers when he was announced as a PPC (10th August 2012) and then when he won his seat (November 7th 2012); Ben Bradshaw’s numbers also jumped around the time of his May 2010 election victory, as did Lynne Featherstone’s, particularly with her appointment to a government position. Jesse Norman appeared to get a bump after the Prime Minister confronted him on July 11th 2012; Nick de Bois saw a leap in followers following the riots in his constituency in early August 2011, and the riots also seem to have bumped David Lammy’s and Diane Abbott’s numbers up.

A tragedy on September 17th looks like it may have pushed Peter Hain’s numbers, but he was in the news a reasonable amount around that time – maybe getting your name in the press for several days in a row is good for Twitter follower counts? Steve Rotherham also benefited from another recalled tragedy, the Hillsborough distaster, when, in October 2011, he called the ex-Sun’s editor out over it’s original coverage; he seems to have received another boost in followers when he lead a debate on internet trolls in September 2012.

Personal misfortune didn’t do Michael Fabricant any harm – his speeding conviction colourful Twitter baiting in October 2012 caused his follower count to fly and achieve an elevated rate of daily growth it’s maintained ever since.

A Dispatches special on ticket touts got a bounce in followers for Sharon Hodgson, who was sponsoring a private member’s bill on ticket touts at the time; winning a social media award seemed to do Kevin Brennan a favour in terms of his daily follower acquisition rate, as this ramp increase around the start of December 2010 shows:

kevinBrennan

So there we have it; political life as seen through the lens of Twitter follower acquisition bursts:-)

But what now? I guess one thing to do would be to have a go at estimating the daily growth rates of the various twittering MPs, and see if thy have any bearing to things like ministerial (or Shadow Minister) responsiblity? Where rates seem to change (sustained kinks in the curve), it might be worth looking to see whether we can identify any signs of changes in tweeting behaviour – or maybe news stories that come to associate the MP with Twitter in some way?

Written by Tony Hirst

March 4, 2013 at 9:42 pm

Follow

Get every new post delivered to your Inbox.

Join 344 other followers