OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Infoskills’ Category

MOOC Platforms and the A/B Testing of Course Materials

[The following is my *personal* opinion only. I know as much about FutureLearn as Google does. Much of the substance of this post was circulated internally within the OU prior to posting here.]

In common with other MOOC platforms, one of the possible ways of positioning FutureLearn is as a marketing platform for universities. Another might see it as a tool for delivering informal versions of courses to learners who are not currently registered with a particular institution. [A third might position it in some way around the notion of "learning analytics", eg as described in a post today by Simon Buckingham Shum: The emerging MOOC data/analytics ecosystem] If I understand it correctly, “quality of the learning experience” will be at the heart of the FutureLearn offering. But what of innovation? In the same way that there is often a “public benefit feelgood” effect for participants in medical trials, could FutureLearn provide a way of engaging, at least to a limited extent, in “learning trials”.

This need not be onerous, but could simply relate to trialling different exercises or wording or media use (video vs image vs interactive) in particular parts of a course. In the same way that Google may be running dozens of different experiments on its homepage in different combinations at any one time, could FutureLearn provide universities with a platform for trying out differing learning experiments whilst running their MOOCs?

The platform need not be too complex – at first. Google Analytics provides a mechanism for running A/B tests and “experiments” across users who have not disabled Google Analytics cookies, and as such may be appropriate for initial trialling of learning content A/B tests. The aim? Deciding on metrics is likely to prove a challenge, but we could start with simple things to try out – does the ordering or wording of resource lists affect click-through or download rates for linked resources, for example? (And what should we do about those links that never get clicked and those resources that are never downloaded?) Does offering a worked through exercise before an interactive quiz improve success rates on the quiz, and so on.

The OU has traditionally been cautious when running learning experiments, delivering fee-waived pilots rather than testing innovations as part of A/B testing on live courses with large populations. In part this may be through a desire to be ‘equitable’ and not jeopardise the learning experience for any particular student by providing them with a lesser quality offering than we could*. (At the same time, the OU celebrates the diversity and range of skills and abilities of OU students, which makes treating them all in exactly the same way seem rather incongruous?)

* Medical trials face similar challenges. But it must be remembered that we wouldn’t trial a resource we thought stood a good chance of being /less/ effective than one we were already running… For a brief overview of the broken worlds of medical trials and medical academic publishing, as well as how they could operate, see Ben Goldacre’s Bad Pharma for an intro.

FutureLearn could start to change that, and open up a pathway for experimentally testing innovations in online learning as well as at a more micro-level, tuning images and text in order to optimise content for its anticipated use. By providing course publishers with a means of trialling slightly different versions of their course materials, FutureLearn could provide an effective environment for trialling e-learning innovations. Branding FutureLearn not only as a platform for quality learning, but also as a platform for “doing” innovation in learning, gives it a unique point of difference. Organisations trialling on the platform do not face the threat of challenges made about them delivering different learning experiences to students on formally offered courses, but participants in courses are made aware that they may be presented with slightly different variants of the course materials to each other. (Or they aren’t told… if an experiment is based on success in reading a diagram where the labels are presented in different fonts or slightly different positions, or with or without arrows, and so on, does that really matter if the students aren’t told?)

Consultancy opportunities are also likely to arise in the design and analysis of trials and new interventions. The OU is also provided with both an opportunity to act according to it’s beacon status as far communicating innovative adult online learning/pedagogy goes, as well as gaining access to large trial populations.

Note that what I’m not proposing is not some sort of magical, shiny learning analytics dashboard, it’d be a procedural, could have been doing it for years, application of web analytics that makes use of online learning cohorts that are at least a magnitude or two larger than is typical in a traditional university course setting. Numbers that are maybe big enough to spot patterns of behaviour in (either positive, or avoidant).

There are ethical challenges and educational challenges in following such a course of action, of course. But in the same way that doctors might randomly prescribe between two equally good (as far as they know) treatments, or who systematically use one particular treatment over another that is equally good, I know that folk who create learning materials also pick particular pedagogical treatments “just because”. So why shouldn’t we start trialling on a platform that is branded as such?

Once again, note that I am not part of the FutureLearn project team and my knowledge of it is largely limited to what I have found on Google.

See also: Treating MOOC Platforms as Websites to be Optimised, Pure and Simple…. For some very old “course analytics” ideas about using Google Analytics, see Online Course Analytics, which resulted in OUseful blogarchive: “course analytics”. Note that these experiments never got as far as content optimisation, A/B testing, search log analysis etc. The approach I started to follow with the Library Analytics series had a little more success, but still never really got past the starting post and into a useful analyse/adapt cycle. Google Analytics has moved on since then of course… If I were to start over, I;d probably focus on creating custom dashboards to illustrate very particular use cases, as well as REDACTED.

Written by Tony Hirst

January 31, 2013 at 4:53 pm

Posted in Analytics, Infoskills, OU2.0

Tagged with

All I Did Was Go to a Carol Service…

Christmas Tree day, and though it’s not decorated yet, at least it’s up. The screws in the trusty metal base had rusted a little since Twelfth Night last year, but a pair of pliers and a dab of engine from the mower dipstick seemed to do the trick in loosening them; then it was time to go off to the “Lights of Love” Carol Service in aid of the local hospice at the Church in our old parish.

(A similar service in the local pub missed last week.)

The drive, a little bit longer than usual: temporary traffic lights set up around a hole in the road – gas leak, apparently.

I’ve never visited the Island’s hospice (Earl Mountbatten was the governor, then first Lord Lieutenant, years ago), but it seems they’ve recently opened a wifi enable café; must check it out some time…

Carols sung, community choir, coffee and a mince pie. Shop bought, not Mothers’ Union home made. Illness put the refreshments in doubt, homebaking too? Had I known, maybe I should have brought some of mine. Or maybe not.

Chat with the town Mayor, (sounds grand, doesn’t it?! Longstanding friend.). Remembrance. Surprise declared about my lack of engagement, on civil liberties grounds, about a recent action by the Isle of Wight Council, the police, and representatives of the DWP – Department of Work and Pensions – involving the stopping of cars at commuter time and drivers (presumably just drivers?) consequently “quizzed about their National Insurance numbers, who their employers were and who they lived” [src]. (The story so far… Crackdown on bad driving and aftershock ‘Big Brother’ criticism of operation. The council leader also had a response: Tories hit back at criticism of benefits operation/Cllr Pugh: IW Conservatives support Police/DWP Benefit vehicle stops).

Home. Remains of chicken from yesterday’s roast, fried with bacon. Melted butter, flour, roux. Milk. White sauce. Add the meat, almost cooked farfalle, seasoning. Sorted.

All I did was go to a carol service. Now this: department work pensions question suspected fraud

Interesting – the Power of FOI: [p]lease would you be able to provide me with a copy of the procedure and guidance followed by the DWP fraud investigators where there is suspected fraud.

dwp fraud docs

Not everything.

too much fraud

Fraudsters driving? Clampdown?

fraud drive

Joint working.

joint working

Sections 46 and 47 of Welfare Reform Act 2007. Legislation.

Explanatory Notes are often a good place to start. Almost readable.

wlefare reform act 2007 notes

Rat hole… Section 110A. Administration Act. Which Administration Act?

administration act

Ah – this one:

Original form

Original form only. No amended section (paragraph?) 110. No 110A.

Google. /Social Security Administration 110A/

amended - no original

Amendments to 110A. But no 110A?

Changes.

changes

Searching…

searching for changes

(Ooh – RSS change feed. Could be handy…)

Scroll.

amended

Thanks, John. (John’s bringing legislation up to date and on to the web. Give the man an honour.)

Click the links, in anticipation of the 110A, as introduced. No joy.

Google. Again. Desperate. Loose search. /site:legislation.gov.uk 110A/

fragment

Enough of a clue.

SI footnote

Cmd-f on a Mac (ctrl-f on Windows). Footnote. Small print.

footnote

That looks like a likely candidate…

gone

Gone. Amended out of existence. Replaced.

Maybe that’s why I could only find amendments to 110A. It may not be current, but I’m intrigued. How was it originally enacted?

original enactment

All I did was go to a carol service. All I want to do is be informed about what powers Local Authorities and the DWP have with respect to “quizzing” citizens stopped apparently at random. I just want to be an informed citizen: what powers were available to the Isle of Wight Council and the DWP a week or two ago?

So I’ve tracked down the original 110A, but so what? It’s not currently enacted, is it? That was a sidetrack. An aside. What does 110A allow today, bearing in mind it’s due for repeal (and possibly, prior to that, subject to as yet uncommenced further amendments)?

I guess I could pay a subscription to a legal database to more directly look up the state of the law. (Only I probably wouldn’t have to pay – .ac.uk credential and all that. Member of an academic library and the attendant privileges that go with it. Lessig found that, in a medical context. 11 minutes in. Because. BECAUSE. $435 to you. But then… Table 4. That’s with the privilege of .edu or .ac.uk library access. That’s with the unpaid work of academics running the journals, providing the editorial review services, handing over copyrights (that they possibly handed over to their institutions anyway…) to publishers for free – only not; at public expense, for publicly funded academics and researchers. And for the not-insignificant profit line of the (commercial) academic publishers. As Monbiot suggests.)

But that wouldn’t be right, would it? Ignorance of the law may not be a defence, but it can’t be right that to find out the law I need to pay for access? The legislation.gov.uk team are doing a great job, but as I’m starting to find, the law is oh, so messy, and it needs to be posted right. But I believe that they will get it up to date and they *will* get it all there. (At least give the man an honour when it’s there…)

So where was I? 110A. Going round in circles, or is it a spiral..? Back at s. 46 of the Welfare Reform Act 2007 (sections 46 and 47 of the Welfare Reform Act, as mentioned in the guidance notes on the National Fraud Partnership Agreement).

So what does the legislation actually say?

welfare reform 2007

Right – so now I’m totally confused. This has been repealed.. but the repeal has not yet commenced? What’s this s. 109A rathole? And what’s Welfare Reform Act 2012 all about?

All I did was go to a carol service. And all I want to find out is the bit of legislation that describes the powers the local council and the DWP were acting under when they were “quizzing” motorists a couple of weeks ago.

So 109A – where can I find 109A? Ah – is this an ‘as currently stands” copy of the Social Security Administration Act 1992 (as amended)?

update

In which case…

109A

And more…

109A-2

But I’m too tired to read it and my battery is about to die. So I’m not really any the wiser.

Nine Lessons form this? Sheesh…

All I did was go to a carol service.

PS why not make a donation to the Earl Mountbatten Hospice? Or your local hospice. In advance.

PPS Via @onthewight, a comment link. PACE (Police and Criminal Evidence Act) 1984, s. 4. Road checks.

road check

S.163 of the Road Traffic Act 1988 appears to be the regulation that requires motorists to stop if so directed by a uniformed police or traffic officer.

Now I’m also wondering: as well as the powers available to the DWP and the local council, by what right and under what conditions were cars being stopped by the police and how were they being selected?

Written by Tony Hirst

December 10, 2012 at 1:31 am

The Chart Equivalent of Comic Sans..?

Whilst looking at the apparently conflicting results from a couple of recent polls by YouGov on press regulation (reviewed in a piece by me over on OpenLearn: Two can play at that game: When polls collide in support of a package on the OU/BBC co-produced Radio 4 programme, More Or Less), my eye was also drawn to the different ways in which the survey results were presented graphically.

The polls were commissioned by The Sun newspaper on the one hand, and the Media Standards Trust/Hacked Off on the other. If you look at the poll data (The Sun/YouGov [PDF] and Media Standards Trust/YouGov [PDF] respectively), you’ll see that it’s reported in a standard format. (I couldn’t find actual data releases, but the survey reports look as if they are generated in a templated way, so getting the core of a generic scraper together for them shouldn’t be too difficult…) But how was that represented to readers (text based headlines and commentary aside?

Here are a couple of grabs from the Sun’s story (State-run watchdog ‘will gag free press’):

Pie-charts in 3D, with a tilt… gorgeous… erm, not… And the colour choice for the bar chart inner-column text is a bit low on contrast compared to the background, isn’t it?

It looks a bit like the writer took a photo of the print edition of the story on their phone, uploaded it and popped it into the story, doesn’t it?

I guess credit should be given for keeping the risk responses separate in the second image, when they could have just gone for the headline figures as pulled out in the YouGov report:

So what I’m wondering now is the extent to which a chart’s “theme” or style reflects the authority or formal weight we might ascribe to it, in much the same way as different fonts carry different associations? Anyone remember the slating that CERN got for using Comic Sans in their Higgs-Boson discovery announcement (eg here, here or here)?

Things could hardly have been more critical if they had used CrappyGraphs or an XKCD style chart generator (as for example described in Style your R charts like the Economist, Tableau … or XKCD ; or alternatively, XKCD-style for matplotlib).

XKCD - Science It works [XKCD]

Oh, hang on a minute, it almost looks like they did!

Anyway – back to the polls. The Media Standards Trust reported on their poll using charts that had a more formal look about them:

The chart annotations are also rather clearer to read.

So what, if anything, do we learn from this? That maybe you need to think about chart style, in the same way you might consider your font selection. From the R charts like the Economist, Tableau … or XKCD post, we also see that some of the different applications we might use to generate charts have their own very distinctive, and recognisable, style (as do many Javascript charting libraries). A question therefore arises about the extent to which you should try to come up with your own distinctive (but still clear) style that fits the tone of your communication, as well as its context and in sympathy with any necessary branding or house styling.

PS with respect to the Sun’s copyright/syndication notice, and my use of the images above:

I haven’t approached the copyright holders seeking permission to reproduce the charts here, but I would argue that this piece is just working up to being research into the way numerical data is reported, as well as hinting at criticism and review. So there…

PPS As far as bad charts go, they may also be, misrepresentations and underhand attempts at persuasion, graphic style, are also possible, as SimplyStatistics describes: “The statisticians at Fox News use classic and novel graphical techniques to lead with data” [ The statisticians at Fox News use classic and novel graphical techniques to lead with data ] See also: OpenLearn – Cheating with Charts.

Written by Tony Hirst

December 3, 2012 at 10:24 am

Quick Shiny Demo – Exploring NHS Winter Sit Rep Data

Having spent a chink of the weekend and a piece of yesterday trying to pull NHS Winter sitrep data into some sort of shape in Scraperwiki, (described, in part, here: When Machine Readable Data Still Causes “Issues” – Wrangling Dates…), I couldn’t but help myself last night and had a quick go at using RStudio’s Shiny tooling to put together a quick, minimal explorer for it:

For proof of concept, I just pulled in data relating to the Isle of Wight NHS Trust, but it should be possible to build a more generic explorer: Isle of Wight NHS Sit Rep Explorer Demo.

Three files are used to crate the app – a script to define the user interface (ui.R), a script to define the server that responds to UI actions and displays the charts (server.R), and a supporting file that creates variables and functions that are globally available to bother the server and UI scripts (global.R).

##wightsitrep2/global.R

#Loading in CSV directly from https seems to cause problems but this workaround seems okay
floader=function(fn){
  temporaryFile <- tempfile()
  download.file(fn,destfile=temporaryFile, method="curl")
  read.csv(temporaryFile)
}

#This is the data source - a scraperwiki API call
#It would make sense to abstract this further, eg allowing the creation of the URL based around a passed in a select statement
u="https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=csv&name=nhs_sit_reps&query=select%20SHA%2CName%2C%20fromDateStr%2CtoDateStr%2C%20tableName%2CfacetB%2Cvalue%20from%20fulltable%20%20where%20Name%20like%20'%25WIGH%25'"

#Load the data and do a bit typecasting, just in case...
d=floader(u)
d$fdate=as.Date(d$fromDateStr)
d$tdate=as.Date(d$toDateStr)
d$val=as.integer(d$value)
##wightsitrep2/ui.R

library(shiny)

tList=levels(d$tableName)
names(tList) = tList

# Define UI for application that plots random distributions 
shinyUI(pageWithSidebar(
  
  
  # Application title
  headerPanel("IW NHS Trust Sit Rep Explorer"),
  
  sidebarPanel(
    #Just a single selector here - which table do you want to view?
    selectInput("tbl", "Report:",tList),
    
    div("This demo provides a crude graphical view over data extracted from",
        a(href='http://transparency.dh.gov.uk/2012/10/26/winter-pressures-daily-situation-reports-2012-13/',
          "NHS Winter pressures daily situation reports"),
        "relating to the Isle of Wight NHS Trust."),
    div("The data is pulled in from a scraped version of the data stored on Scraperwiki",
        a(href="https://scraperwiki.com/scrapers/nhs_sit_reps/","NHS Sit Reps"),".")
    
 ),
  
  #The main panel is where the "results" charts are plotted
  mainPanel(
    plotOutput("testPlot"),
    tableOutput("view")
    
  )
))
##wightsitrep2/server.R

library(shiny)
library(ggplot2)

# Define server logic
shinyServer(function(input, output) {
  
  #Do a simple barchart of data in the selected table.
  #Where there are "subtables", display these using the faceted view
  output$testPlot = reactivePlot(function() {
    g=ggplot(subset(d,fdate>as.Date('2012-11-01') & tableName==input$tbl))
    g=g+geom_bar(aes(x=fdate,y=val),stat='identity')+facet_wrap(~tableName+facetB)
    g=g+theme(axis.text.x=element_text(angle=-90),legend.position="none")+labs(title="Isle of Wight NHS Trust")
    #g=g+scale_y_discrete(breaks=0:10)
    print(g)
  })
  
  #It would probable make sense to reshape the data presented in this table
  #For example, define columns based on facetB values, so we have one row per date range
  #I also need to sort the table by date
  output$view = reactiveTable(function() {
    head(subset(d,tableName==input$tbl,select=c('Name','fromDateStr','toDateStr','tableName','facetB','value')),n=100)
  })
  
})

I get the feeling that it shouldn’t be too hard to create quite complex Shiny apps relatively quickly, pulling on things like Scraperwiki as a remote data source. One thing I haven’t tried is to use googleVis components, which would support in the first instance at least a sortable table view… Hmmm…

PS for an extended version of this app, see NHS Winter Situation Reports Shiny Viewer v2

Written by Tony Hirst

November 28, 2012 at 10:32 am

Posted in Data, Infoskills, Rstats

Tagged with ,

When Machine Readable Data Still Causes “Issues” – Wrangling Dates…

With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I’ll give asn example of a dataset that is, arguably, released in a machine readable way – as an Excel spreadsheet, but that still requires quite a bit of work to become useful as data; because presumably the intent behind the aforementioned amendement to the FOI is to make data releases useful and useable as data? As a secondary result, through trying to make the data useful as data, I realise I have no idea what some of the numbers that are reported in the context of a date range actually relate to… Which makes those data columns misleading at best, useless at worst…And as to the February data in a release allegedly relating to a weekly release from November…? Sigh…

[Note - I'm not meaning to be critical in the sense of "this data is too broken to be useful so don't publish it". My aim in documenting this is to show some of the difficulties involved with actually working with open data sets and at least flag up some of the things that might need addressing so that the process can be improved and more "accessible" open data releases published in the future. ]

So what, and where is, the data…? Via my Twitter feed over the weekend, I saw an exchange between @paulbradshaw and @carlplant relating to a scraper built around the NHS Winter pressures daily situation reports 2012 – 13. This seems like a handy dataset for anyone wanting to report on weekly trends, spot hospitals that appear to be under stress, and so on, so I had a look at the scraper, took issue with it ;-) and spawned my own…

The data look like it’ll be released in a set of weekly Excel spreadsheets, with a separate sheet for each data report.

All well and good… almost…

If we load the data into something like Scraperwiki, we find that some of the dates are actually represented as such; that is, rather than character strings (such as the literal “9-Nov-2012″), they are represented as date types (in this case, the number of days since a baseline starting date). A quick check on StackOverflow turned up the following recipe for handling just such a thing and returning a date element that Python (my language of choice on Scraperwiki) recognises as such:

#http://stackoverflow.com/a/1112664/454773
import datetime

def minimalist_xldate_as_datetime(xldate, datemode):
    # datemode: 0 for 1900-based, 1 for 1904-based
    return (
        datetime.datetime(1899, 12, 30)
        + datetime.timedelta(days=xldate + 1462 * datemode)
        )

The next thing we notice is that some of the date column headings actually specify: 1) date ranges, 2) in a variety of styles across the different sheets. For example:

  • 16 – 18/11/2012
  • 16 Nov 12 to 18-NOV-2012
  • 16 to 18-Nov-12

In addition, we see that some of the sheets split the data into what we might term further “subtables” as you should notice if you compare the following sheet with the previous one shown above:

Notwithstanding that the “shape” of the data table is far from ideal when it comes to aggregating data from several weeks in the same database (as I’ll describe in another post), we are faced with a problem here that if we want to look at the data by date range in a mechanical, programmable way, we need to cast these differently represented date formats in the same way, ideally as a date structure that Python or the Scraperwiki SQLlite database can recognise as such.

[For a library that can automatically reshape this sort of hierarchical tabular data arrangement in R, see Automatic Conversion of Tables to LongForm Dataframes]

The approach I took was as follows (it could be interesting to try to replicate this approach using OpenRefine?). Firstly, I took the decision to map dates onto “fromDates” and “toDates”. ***BEWARE – I DON’T KNOW IF THIS IS CORRECT THING TO DO**** Where there is a single specified date in a column heading, the fromDate and toDate are set to one and the same value. In cases where the date value was specified as an Excel represented date (the typical case), the code snippet above casts it to a Pythonic date value then I can then print out as required (I opted to display dates in the YYYY-MM-DD format) using a construction along the lines of:

dateString=minimalist_xldate_as_datetime(cellValue,book.datemode).date().strftime("%Y-%m-%d")

In this case, cellValue is the value of a header cell that is represented as an Excel time element, book is the workbook, as parsed using the xlrd library:

import xlrd
xlbin = scraperwiki.scrape(spreadsheetURL)
book = xlrd.open_workbook(file_contents=xlbin)

and book.datemode is a library call that looks up how dates are being represented in the spreadsheet. If the conversion fails, we default to setting dateString to the original value:
dateString=cellvalue

The next step was to look at the date range cells, and cast any “literal” date strings into a recognised date format. (I’ve just realised I should have optimised the way this is called in the Scraperwiki code – I am doing so many unnecessary lookups at the moment!) In the following snippet, I look to see if we can split the date into a cell range functions,

import time
from time import mktime
from datetime import datetime

def dateNormalise(d):
    #This is a bit of a hack - each time we find new date formats for the cols, we'll need to extend this
    #The idea is to try to identify the date pattern used, and parse the string accordingly
    for trials in ["%d %b %y",'%d-%b-%y','%d-%b-%Y','%d/%m/%Y','%d/%m/%y']:
        try:
            dtf =datetime.datetime.fromtimestamp(mktime(time.strptime(d, trials)))
            break
        except: dtf=d
    if type(dtf) is datetime.datetime:
        dtf=dtf.strftime("%Y-%m-%d")
    return dtf

def patchDate(f,t):
    #Grab the month and year elements from the todate, and add in the from day of month number
    tt=t.split('-')
    fromdate='-'.join( [ str(tt[0]),str(tt[1]),str(f) ])
    return fromdate

def dateRangeParse(daterange):
    #In this first part, we simply try to identify from and to portions
    dd=daterange.split(' to ')
    if len(dd)<2:
        #That is, split on 'to' doesn't work
        dd2=daterange.split(' - ')
        if len(dd2)<2:
            #Doesn't split on '-' either; set from and todates to the string, just in case.
            fromdate=daterange
            todate=daterange
        else:
            fromdate=dd2[0]
            todate=dd2[1]
    else:
        fromdate=dd[0]
        todate=dd[1]
    #By inspection, the todate looks like it's always a complete date, so try to parse it as such 
    todate=dateNormalise(todate)
    #I think we'll require another fudge here, eg if date is given as '6 to 8 Nov 2012' we'll need to finesse '6' to '6 Nov 2012' so we can make a date from it
    fromdate=dateNormalise(fromdate)
    if len(fromdate)<3:
        fromdate=patchDate(fromdate,todate)
    return (fromdate,todate)

#USAGE:
(fromdate,todate)=dateRangeParse(dateString)

One thing this example shows, I think, is that even though the data is being published as a dataset, albeit in an Excel spreadsheet, we need to do some work to make it properly useable.

XKCD - ISO 8601

The sheets look as if they are an aggregate of data produced by different sections, or different people: that is, they use inconsistent ways of representing date ranges.

When it comes to using the date, we will need to take care in how we represent or report on figures collected over a date range (presumably a weekend? I haven’t checked), compared to daily totals. Indeed, as the PS below shows, I’m now starting to doubt what the number in the date range column represents? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period? [This was written under assumption it was summed daily values over period, which PS below suggests is NOT the case, in one sheet at least?] One approach might be to generate “as-if daily” returns simply by dividing ranged date totals by the number of days in the range. A more “truthful” approach may be to plot summed counts over time (date on the x-axis, sume of values to date on the y-axis), with the increment for the date-ranged values that is being added in to the summed value taking the “toDate” date as its x/date value?

When I get a chance, I’ll do a couple more posts around this dataset:
- one looking at datashaping in general, along with an example of how I shaped the data in this particular case
- one looking at different queries we can run over the shaped data.

PS Another problem… on the NHS site, we see that there appear to be weekly spreadsheet releases and an aggregated release:

Because I didn’t check the stub of scraper code used to pull off the spreadsheet URLs from the NHS site, I accidentally scraped weekly and aggrgeated sheets. I’m using a unique key based on a hash that includes the toDate as part of the hashed value, in an attempt to keep dupes out of the data from just this sort of mistake, but looking at a query over the scraped data I spotted this:

If we look at the weekly sheet we see this:

That is, a column for November 15th, and then one for November 18th, but nothing to cover November 16 or 17?

Looking at a different sheet – Adult Critical Care – we get variation at the other end of the range:

If we look into the aggregated sheet, we get:

Which is to say – the weekly report displayed a single data as a column heading where the aggregated sheet gives a date range, although the same cell values are reported in this particular example. So now I realise I have no idea what the cell values in the date range columns represent? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?

And here’s another query:

February data??? I thought we were looking at November data?

Hmmm…

PPS If you’re looking for learning outcomes from this post, here are a few: three ways in which we need to wrangle sense out of dates:

  1. representing Excel dates or strings-that-look-like-dates as dates in some sort of datetime representation (which is most useful sort of representation, even if we end up casting dates into string form);
  2. parsing date ranges into pairs of date represented elements (from and to dates);
  3. where a dataset/spreadsheet contains heterogenous single date and date range columns, how do we interpret the numbers that appear in the date range column?
  4. shoving the data into a database and running queries on it can sometimes flag up possible errors or inconsistencies in the data set, that might be otherwise hard to spot (eg if you had to manually inspect lots of different sheets in lots of different spreadsheets…)

Hmmm….

PPPS Another week, another not-quite-right feature:

another date mixup

PPPPS An update on what the numbers actually mean,from an email exchange (does that make me more a journalist than a blogger?!;-) with the contact address contained within the spreadsheets: “On the columns, where we have a weekend, all items apart from beds figures are summed across the weekend (eg number of diverts in place over the weekend, number of cancelled ops). Beds figures (including beds closed to norovirus) are snapshots at the collection time (i.e 8am on the Monday morning).”

PPPPPS Another week, ans this time three new ways of writing the date range over the weekend: 14-16-Dec-12, 14-16-Dec 12, 14-16 Dec 12. Anyone would think they were trying to break my scraper;-)

Written by Tony Hirst

November 27, 2012 at 5:55 pm

Posted in Data, Infoskills

Tagged with , ,

More Notes on School Examinations Data

In an earlier post on awarding body market share in UK school examinations, I described an OfQual dataset that listed the number of certificates awarded by certificate name and qualification level by the various awarding bodies. We can use that sort of data to see market share by certificates awarded, but the dataset does not give us any insight into the grades awarded by the different bodies, which might allow us to ask a range of other questions: for example, do any exams appear to be “easier” or “harder” than others, simply based on percentages awarded at each grade by different bodies or within different subject areas (that is the distribution of grades; note that statistical assumptions may be used to tweak grade boundaries, so we need to be careful here about what questions we even think we might be able to ask…).

The Joint Council for Qualifications (JCQ) does publish data relating to the distribution of grades by subject (eg A’levels Summer 2012, GCSE Summer 2012) but as PDF documents rather than data.

Just a quick aside here: WhatDoTheyKnow suggests that the JCQ is not FOIable, although OfQual is, and routinely uses “publicly available information” from the JCQ in the formulation of its own reports; similar data is also published by the Awarding bodies, although again in the form of informally structured tabular data within PDF files. In response to an FOI request I made to OfQual, the following statement appears:

Because this information is already accesible (sic) to you from JCQ and Awarding Organisations it is exempt from disclosure under Section 21 of the Act because the information is already in the public domain.

So, to clarify:

  1. OfQual is an FOIable body that has published some data as information; I don’t know whether it holds the information as data
  2. The data I requested is available as information in the form of PDF documents from two classes of non-FOIable body: the JCQ, a charity; the Awarding Bodies, commercial companies.

Under FOI 2000 s. 11 (as amended by Protection fo Freedoms Act 2012), if you request all or part of a dataset in an electronic form “the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.” I don’t know if there are any cases out there arguing the toss about how to interpret this (PDF doc, CSV or SQL dump good, etc? If you know of any, please add a comment…) but I’d argue that the data is not available in that form. So a question that naturally follows is: does this affect the reading of Section 21 of the Act “Information which is reasonably accessible to the applicant … is exempt information.”? (BTW, this looks handy – FOI Wiki, though I’m not sure it’s being maintained?) Similarly, if a public body publishes a dataset in the form of a PDF document and not as data, can I FOI a request for that information as data notwithstanding that the information is available in a different, less accessible form? Or will they throw s. 21 back at me? [Via a tweet, @paulbradshaw suggests that a request for machine readable/data version of info released as PDF will typically be satisified.]

Now where was I..?!

Oh yes… the JCQ data as PDF… well it just so happens that the data is available as data from the Guardian Datastore: GCSE results 2012: exam breakdown by subject, gender and area [data] (I’m not sure if they scraped the A’level results too?). However, the breakdown does not go as far as distributions by award board, and the linkage between subject areas and the certificate titles used in the OfQual dataset is not obvious (there may be mappings in the data documentation/explanatory notes maybe? I haven’t looked.)

Another aside: could the FOIable body point to a scraped dataset published by a third party as evidence that the information is available in a reusable form, even if the reusable format was not published directly by the original FOIable body? That is, if a council publishes data as PDF, and someone scrapes it using Scraperwiki, making it available “as data”, could the council point to the Scraperwiki database as evidence of “[i]nformation which is reasonably accessible to the applicant”? How would they know the data was valid? How about the concrete case here of the JCQ PDF data being scraped by the Guardian Datastore folk and republished as a Google Spreadsheet? And here’s another thought: if I were known to be a demon PDF hacker, would that affect the interpretation of “reasonably accessible”?

If we really wanted to look at distributions of grades by certificate and Awarding Body, we’d probably need to go to the horse’s mouth. So for example, EdExcel grade statistics, AQA results statistics, OCR results stats, CEA Statistics. But again, this data is only available in PDF form, and the companies that publish it aren’t FOIable. (If you’re running – or know of – scrapers grabbing this data, please let me know via the comments). Note that if this “source” data were available, we should be able to check it against the original OfQual data (at least, we should be able to check totals by award board and certificate).

Of course, I could possibly go straight to the OfQual annual market report [PDF] to see market segment breakdowns; but I think that was where I started (the pie charts immediately started to put me off!) – and it’s not really the datajunkie way, is it, seeing reports containing tables and charts and not being able to recreate them?;-)

SO what DDJ lessons do we learn from all this? One thing may be that as data goes along a publishing chain, it tends to get summarised, which then limits the sorts of questions you can ask of it. By unpicking the chain, and getting access to ever finer grained data, we get ourselves into a position whereby we should in principle be able to regenerate the summary reports from the next level down; but we may also be faced with trying to reconcile the data or fit it into the categories that are referred to in the original reports. For transparency as reproducibility, what we need is for reports that publish summary data to also publish two other things: 1) the full set of data that was summarised; 2) the formulae used to generate the summaries from that full data set. Of course, it may be that there are multiple summary steps in the chain (report A generates summaries of dataset B, which itself summarises or represents a particular view over a dataset C). In the current example, OfQual publishes data about certificates awarded by each Awarding Body but no grades; JCQ has grade data across awards but no awarding body data (though in some cases we may be able to recreate that – eg where only a single awarding bidy offers certificates in a particular area); the awarding bodies publish the finest grained data of all – grade distributions by certificate (and rather obviously, this data is at the level of a particular awarding body).

Written by Tony Hirst

November 18, 2012 at 1:46 pm

Posted in Infoskills, Policy

Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix

If you haven’t already seen it, one of the breakthrough visualisations of the US elections was the New York Times Paths to the Election scenario builder. With the F1 drivers’ championship in the balance this weekend, I wondered what chances were of VET claiming the championship this weekend. The only contender is ALO, who is currently ten points behind.

A quick Python script shows the outcome depending on the relative classification of ALO and VET at the end of today’s race. (If the drivers are 25 points apart, and ALO then wins in Brazil with VET out of the points, I think VET will win on countback based on having won more races.)

#The current points standings
vetPoints=255
aloPoints=245

#The points awarded for each place in the top 10; 0 points otherwise
points=[25,18,15,12,10,8,6,4,2,1,0]

#Print a header row (there's probably a more elegant way of doing this...;-)
for x in ['VET\ALO',1,2,3,4,5,6,7,8,9,10,'11+']: print str(x)+'\t',
print ''

#I'm going to construct a grid, VET's position down the rows, ALO across the columns
for i in range(len(points)):
	#Build up each row - start with VET's classification
	row=[str(i+1)]
	#Now for the columns - that is, ALO's classification
	for j in range(len(points)):
		#Work out the points if VET is placed i+1  and ALO at j+1 (i and j start at 0)
		#Find the difference between the points scores
		#If the difference is >= 25 (the biggest points diff ALO could achieve in Brazil), VET wins
		if ((vetPoints+points[i])-(aloPoints+points[j])>=25):
			row.append("VET")
		else: row.append("?")
	#Print the row a slightly tidier way...
	print '\t'.join(row)

(Now I wonder – how would I write that script in R?)

And the result?

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
1	?	?	?	?	VET	VET	VET	VET	VET	VET	VET
2	?	?	?	?	?	?	?	?	VET	VET	VET
3	?	?	?	?	?	?	?	?	?	?	VET
4	?	?	?	?	?	?	?	?	?	?	?
5	?	?	?	?	?	?	?	?	?	?	?
6	?	?	?	?	?	?	?	?	?	?	?
7	?	?	?	?	?	?	?	?	?	?	?
8	?	?	?	?	?	?	?	?	?	?	?
9	?	?	?	?	?	?	?	?	?	?	?
10	?	?	?	?	?	?	?	?	?	?	?
11	?	?	?	?	?	?	?	?	?	?	?

Which is to say, VET wins if:

  • VET wins the race and ALO is placed 5th or lower;
  • VET is second in the race and ALO is placed 9th or lower;
  • VET is third in the race and ALO is out of the points (11th or lower)

We can also look at the points differences (define a row2 as row, then use row2.append(str((vetPoints+points[i])-(aloPoints+points[j])))):

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
1	10	17	20	23	25	27	29	31	33	34	35
2	3	10	13	16	18	20	22	24	26	27	28
3	0	7	10	13	15	17	19	21	23	24	25
4	-3	4	7	10	12	14	16	18	20	21	22
5	-5	2	5	8	10	12	14	16	18	19	20
6	-7	0	3	6	8	10	12	14	16	17	18
7	-9	-2	1	4	6	8	10	12	14	15	16
8	-11	-4	-1	2	4	6	8	10	12	13	14
9	-13	-6	-3	0	2	4	6	8	10	11	12
10	-14	-7	-4	-1	1	3	5	7	9	10	11
11	-15	-8	-5	-2	0	2	4	6	8	9	10

We could then do a similar exercise for the Brazil race, and essentially get all the information we need to do a scenario builder like the New York Times election scenario builder… Which I would try to do, but I’ve had enough screen time for the weekend already…:-(

PS FWIW, here’s a quick table showing the awarded points difference between two drivers depending on their relative classification in a race:

A\B	1	2	3	4	5	6	7	8	9	10	11+
1	X	7	10	13	15	17	19	21	23	24	25
2	-7	X	3	6	8	10	12	14	16	17	18
3	-10	-3	X	3	5	7	9	11	13	14	15
4	-13	-6	-3	X	2	4	6	8	10	11	12
5	-15	-8	-5	-2	X	2	4	6	8	9	10
6	-17	-10	-7	-4	-2	X	2	4	6	7	8
7	-19	-12	-9	-6	-4	-2	X	2	4	5	6
8	-21	-14	-11	-8	-6	-4	-2	X	2	3	4
9	-23	-16	-13	-10	-8	-6	-4	-2	X	1	2
10	-24	-17	-14	-11	-9	-7	-5	-3	-1	X	1
11	-25	-18	-15	-12	-10	-8	-6	-4	-2	-1	X

Here’s how to use this chart in association with the previous. Looking at the previous chart, if VET finishes second and ALO third, the points difference is 13 in favour of VET. Looking at the chart immediately above, if we let VET = A and ALO = B, then the columns correspond to ALO’s placement, and the rows to VET. VET (A) needs to lose 14 or more points to lose the championship (that is, we’re looking for values of -14 or less). In particular, ALO (B, columns) needs to finish 1st with VET (A) 5th or worse, 2nd with A 8th or worse, or 3rd with VET 10th or worse.

And the script:

print '\t'.join(['A\B','1','2','3','4','5','6','7','8','9','10','11+'])
for i in range(len(points)):
	row=[str(i+1)]
	for j in range(len(points)):
		if i!=j:row.append(str(points[i]-points[j]))
		else: row.append('X')

And now for the rest of the weekend…

Written by Tony Hirst

November 18, 2012 at 12:59 pm

Posted in Infoskills, Tinkering

Tagged with ,

Using Google Fusion Tables for a Quick Look at GCSE/A’Level Certificate Awards Market Share by Examination Board

On my to do list for some time has been a quick peek at market share in the school exams market – does any one awarding body dominate at a particular level, for example, or within a particular subject area? Or how about dominating a particular subject at a particular level? (If you thing this might have anything to do with my idle thought around plurality, you wouldn’t be far wrong…;-) For additional context, see eg House of Commons Education Committee – The administration of examinations for 15–19 year olds in England.)

And so, a couple of months ago, I posted an FOI request to OfQual asking for a copy of relevant data. The request was politely declined at the time on the grounds that the data would seen be available in public anyway. The following question did come to mind though: if the data is public but in PDF (rather than machine readable dataset) form, and I specifically requested machine readable form, presumably an “It’s already publicly available” response wouldn’t wash, given the Protection of Freedoms amendment to FOI that enshrines the right to data in data form? The FOI response also gave a link to the Joint Council for Qualifications (JCQ), although the URL provided doesn’t seem to work now/any more – I’m guessing this: http://www.jcq.org.uk/examination-results/a-levels is the sort of thing they were trying to refer me to? Which is PDF doc with a load of data tables… Hmm… Looking through that data, I started to wonder about the existence of a more refined dataset, specific one that for each qualification body shows the number of people who took a particular qualification at a particular level and the break down of grades awards. In this way, we could look to see whether one board appeared to be “easier” than another in terms of the distribution of grades awarded within a particular qualification by a particular board.

Anyway… it now being after October 19th, the date by which the data was due to be released, I went to the OfQual site to find the data. The site appears to have had a redesign and I couldn’t find the data anywhere… Using the URL I’d discovered on the old site and included in my FOI request – http://www.ofqual.gov.uk/standards/statistics/general-data/ – I found the following “no but, yes but”, blink and you’ll take it for a 404, page, which has the page title (that appears in a browser tab) of Page not found - OfQual:

If you click through and visit the page on the “old” site – http://www2.ofqual.gov.uk/standards/statistics/general-data/ – you can get to a copy of the data… I also popped a copy onto Google Fusion Tables.

(If you can work out where on the new site this data is, along with a protocol/strategy for finding it from: 1) the OfQual homepage, and 2) Google, I’d appreciate it if you’d post a hint or too in the comments;-)

Using the “new look” interface to Google Fusion Tables, here are a few examples of different views we might generate over the data to get a feel for relative market shares.

To start with, the data looks something like this:

The first thing I did was to duplicate the data, and then collapse the results columns relating to years other than 2012 (for now, I’m just interested in 2012 numbers to get a feel for current market shares):

Here’s the result:

We’re not really interested in rows where no certificates were awarded in 2012, so we can filter those rows out:

This is just like selecting a numeric facet for a column in OpenRefine. The Fusion Tables panel that pops up, though, is perhaps not quite as, erm “refined” as the panel in OpenRefine (which shows a range slider). Instead, in Fusion Tables, we are presented with two limit boxes – the one on the left sets an inclusive lower boundary on the value displayed (a “greater than or equal to” limit) and the one on the right an inclusive upper bound (a “less than or equal to” limit). To view rows where the certs2012 value is greater than 0, we put a 1 in the lower bound box and leave the upper bound empty (which is to say, we filter to allow through values >=1).

So where are we at? We now have a set of data that describes for 2012 how many awards were made by each board for each GCE/GCSE certificate they offer. So what? What I was interested in was the market share for each board. The Summary view provides us with a straightforward way of doing this:

What we want to do is sum up the certificates awarded by each board:

This gives a report of the form:

So for example, we can see from the summary table that is generated that AQA awarded (?) almost three and a half million certificates, and OCR just over 1.5 million.

Knowing the total number of certificates awarded by each board is one thing, but the resolution isn’t great because it mixes levels – if one board dominated A’levels and another GCSEs, the order(s) of magnitude more people taking GCSEs would mask the dominant share at A’level, where far fewer certificates are awarded.

To summarise the certificates awarded by each board at each qualification level, we can refine the Summary view:

Here’s what this particular summary view looks like – note that we can sort the rows according to the values in a particular column:

If you are more interested in looking at market share across a particular subject area, we can use a Filter to limit the search results:

The filter panel contains the different factor levels (R), or text facets (OpenRefine) of the elements contained within the selected column.

As well as filtering by a particular facet value, we can also filter results based on a full text string match (I don’t thing Boolean search is possible – just a literal string match on whatever appears in the search box):

To look at data for a particular qualification, we can bring in another filter from the Filter menu, or we can “find” particular values:

This is very much like the text facet view in Google Refine – here’s what the filtered, summarised and found view looks like:

As far as working out dominant market share, we haven’t really got very far – the above data conversation suggests that there is a whole load of context we need to be clear about when we count up the number of certificates awarded by each body and then compare them (are we making subject based comparisons, level based comparisons, subject and level based comparisons, etc.) What we do have, though, is a conversational strategy for starting to ask particular questions of the data. For example, how do the boards compare in the award of Math (string fragment that should match both Maths and Mathematics in a certificate title) across the levels:

And at A’level?

This is fine insofar as it goes, but it would be a bit laborious trying to get a macroscopic view for market shares over all subject areas and levels separately… To do that, I’d probably opt for another tool with powerful support for grouping and visualisation. R maybe…?;-) But that’ll have to wait for another post…

PS You’ll notice that I haven’t actually made any comments about which boards have what market share…And that’s part of the point of OUseful.info as a “howto” resource – it’s a place for coming up with questions and strategies for starting to answer them, as well as sharing process ideas rather than any particular outcomes from applying those processes…

Written by Tony Hirst

November 16, 2012 at 2:07 pm

Sketching With OpenCorporates – Fragmentary Notes in Context of Thames Water Corporate Sprawl

The Observer newspaper today leads with news of how the UK’s water companies appear to be as, if not more, concerned with running tax efficient financial engines as they are maintaining the water and sewerage network. Using a recipe that’s probably run its course (which is to say – I have some thoughts on how to address some of its many deficiencies) – Corporate Sprawl mapping – I ran a search on OpenCorporates for mentions of “Thames Water” and then plotted the network of companies as connected by directors identified through director dealings also indexed by OpenCorporates:

With the release of the new version 0.2 of the OpenCorporates API, I notice that information regarding directors is now addressable, which means that we should be able to pivot from one company, to its directors, to other companies associated with that director…

To get a feel for what may be possible, let’s run a search on /Thames Water/, and then click through on one of the director links – we can see (through the search interface), records for individual corporate officers, along with sidebar links to similarly named officers (with different officer IDs):

(At this point, I don’t know the extent to which the API reconciles this individual references, if at all – I’m still working my way through the web search interface…)

Let’s assume for a moment that the similarly named individuals in the sidebar are the same as the person whose officer record we are looking at. We notice that as well as Thames Water companies, other companies are listed that would not be discovered by a simple search for /Thames Water/ – INNOVA PARK MANAGEMENT COMPANY LIMITED, for example. (Note that we can also see dates the directorial appointments were valid, which means we may be able to track the career of a particular director; FWIW, offering tools to support ordering directors by date of appointment, or using something resembling a Gantt chart layout, may help patterns jump out of this data…?)

Innova Park Management Company Ltd may or may not have anything to do with Thames Water of course, but there are a couple of bits of evidence we can look for to see whether it is likely that it is part of the Thames Water corporate sprawl using just OpenCorporates data: firstly, we might look to see if this company concurrently shares several directors with Thames Water companies; secondly, we might check its registered address:

(In this case, we also note that /Thames Water/ appears in the previous name of Innova Park Management Company Ltd (something I think that the OpenCorporates search does pick up on?).)

One of the things I’ve mentioned to Chris Taggart before is how geocoding company addresses might give us a good way into to finding colocated companies. One reason for why this might be useful is that it might be able to show how companies evolve through different times and yet remain registered at the same address. It also provides a possible way in to sprawl mapping if many of the sprawl companies are registered at the same address at the same time (though there may be other reasons for companies being registered at the same address: companies may be registered by an accountancy or legal firm, for example, that offers registered address services; or be co-located in a particular building. But for investigations, this may also be useful, for example in cases of tracking down small companies serviced by, erm, creative accountants…)

(By the by, this Google Refine/OpenRefine tutorial contains a cunning trick – geocode addresses using Google maps to get lat/long coordinates, then use a scatterplot facet to view lat/long grid and select rectangular regions within it – that is, it gives you an ad hoc spatial search function… very cunning;-)

Note to self: I think I’ve pondered this before – certainty factors around the extent to which two companies are part of the same sprawl, or two similarly named directors are the same person. Something along the lines of:

- corporate sprawl: certainty = f( number_of_shared_directors, shared_address, similar_words_in_company_name, ...)
- same person (X, Y): certainty related to number of companies both X and Y are directors of that share other directors, share same address, share similar company name.

If we quickly look at the new OpenCorporates API, we see that there are a couple of officers related called: GET officers/search and GET officers/:id.

Based on the above ‘note to self’, I wonder if it’d be useful to code up a recipe that takes an officer ID, fetches the name of the director, runs a name search, then tries to assign a likelihood that each person in the returned set of search results is the same as the person whose ID was supplied in the original lookup? This is the sort of thing that Google Refine reconciliation API services offer, so maybe this is already available via the OpenCorporates reconciliation API?

PS I use the phrase “corporate sprawl” to refer to a similar thing that OpenCorporate’s user-curated corporate_groupings refer to. One thing that interests me is extent to which we can build tools to automatically make suggestions about corporate_grouping membership.

PPS running the scraper, I noticed that Scraperwiki have a job opening for a “data scientist”

Written by Tony Hirst

November 11, 2012 at 1:36 pm

Posted in Infoskills

Tagged with

Sketched Thoughts On Data Journalism Technologies and Practice

Over the last year or two, I’ve given a handful of talks to postgrad and undergrad students broadly on the topic of “technology for data driven journalism”. The presentations are typically uncompromising, which is to say I assume a lot. There are many risks in taking such an approach, of course, as waves of confusion spread out across the room… But it is, in part, a deliberate strategy intended to shock people into an awareness of some of the things that are possible with tools that are freely available for use in the desktop and browser based sheds of today’s digital tinkerers… Having delivered one such presentation yesterday, at UCA, Farnham, here are some reflections on the whole topic of “#ddj”. Needless to say, they do not necessarily reflect even my opinions, let alone those of anybody else;-)

The data-driven journalism thing is being made up as we go along. There is a fine tradition of computer assisted journalism, database journalism, and so on, but the notion of “data driven journalism” appears to have rather more popular appeal. Before attempting a definition, what are some of the things we associate with ddj that might explain the recent upsurge of interest around it?

  • access to data: this must surely be a part of it. In one version we might tell of the story, the arrival of Google Maps and the reverse engineering of an API to it by Paul Rademacher for his April 2005 “Housing Maps mashup”, opened up people’s eyes to the possibility of map-based mashups; a short while later, in May 2005, Adrian Holovaty’s Chicago Crime Map showed how the same mashup idea could be used as an example of “live”, automated and geographically contextualised reporting of crime data. Mashups were all about appropriating web technologies and web content, building new “stuff” from pre-existing “stuff” that was already out there. And as an idea, mashups became all the rage way back then, offering as they did the potential for appropriating, combining and re-presenting elements of different web applications and publications without the need for (further) programming.
    In March 2006, a year or so after the first demonstration of the Housing Maps mashup, and in part as a response to the difficulty in getting hold of latitude and longitude data for UK based locations that was required to build Google maps mashups around British locations, the Guardian Technology supplement (remember that? It had Kakoru puzzles and everything?!;-) launched the “Free Our Data” campaign (history). This campaign called for the free release of data collected at public expense, such as the data that gave the latitude and longitude for UK postcodes.
    The early promise of, and popular interest in “mashups” waxed, and then waned; but there was a new tide rising in the information system that is the web: access to data. The mashups had shown the way forward in terms of some of the things you could do if you could wire different applications together, but despite the promise of no programming it was still too techie, too geeky, too damned hard and fiddly for most people; and despite what the geeks said, it was still programming, and there often still was coding involved. So the focus changed. Awareness grew about the sorts of “mashup” were possible, so now you could ask a developer to build you “something like that”, as you pointed to an appropriate example. The stumbling block now was access to the data to power an app that looked like that, but did the same thing for this.
    For some reason, the notion of “open” public data hit a policy nerve, and in the UK, as elsewhere, started to receive cross-party support. (A brief history of open public data in a UK context is illustrated in the first part of Open Standards and Open Data.) The data started to flow, or at least, started to become both published (through mandated transparency initiatives, such as the release of public accounting data) and requestable (for example, via an extension to FOI by the Protection of Freedoms Act 2012).
    We’ve now got access in principle and in practice to increasing amounts of data, we’ve seen some of the ways in which it can be displayed and, to a certain extent, started to explore some of the ways in which we can use it as a source for news stories. So the time is right in data terms for data driven journalism, right?
  • access to visualisation technologies: it wasn’t very long ago when it was still really hard to display data on screen using anything other than canned chart types – pie charts, line charts, bar charts (that is, the charts you were introduced to in primary school. How many chart types have you learned to read, or create, since then?). Spreadsheets offer a range of grab-and-display chart generating wizards, of course, but they’re not ideal when working with large datasets, and they’re typically geared for generating charts for reports, rather than being used analytically. The visual analysis mantra – Overview first, zoom and filter, then details-on-demand – (coined in Ben Schneiderman’s 1997 article A Grander Goal: A Thousand-Fold Increase in Human Capabilities, I think?) arguably requires fast computers and big screens to achieve the levels of responsiveness that is required for interactive usage, and we have those now…

There are, however, still some considerable barriers to access:

  • access to clean data: you might think I’m repeating myself here, but access to data and access to clean data are two separate considerations. A lot of the data that’s out there and published is still not directly usable (you can’t just load it into a spreadsheet and work on it directly); things that are supposed to match often don’t (we might know that Open Uni, OU and Open University refer to the same thing, but why should a spreadsheet?); number columns often contain things that aren’t numbers (such as commas or other punctuation); dates are provided in a wide variety of formats that we can recognise as such, but a computer can’t – at least, not unless we give it a bit of help; data gets misplaced across columns; character encodings used by different applications and operating systems don’t play nicely; typos proliferate; and so on. So whose job is it to clean the data before it can be inspected or analysed?
  • access to skills and workflows: engineering practice tends to have a separation between the notion of “engineer” and “technician”. Over-generalising and trivialising matters somewhat, engineers have academic training, and typically come at problems from a theory dominated direction; technicians (or technical engineers) have the practical skills that can be used to enact the solutions produced by the engineers. (Of course, technicians can often suggest additional, or alternative, solutions, in part reflecting a better, or more immediate, knowledge about the practical considerations involved in taking one course of action compared to another.) At the moment, the demarcation of roles (and skills required at each step of the way) in a workflow based around data discovery, preparation, analysis and reporting is still confused.
  • What questions should ask? If you think of data as a source, with a story to tell: how do you set about finding that source? Why do you even think you want to talk to that source? What sorts of questions should you ask that source, and what sorts of answer might you reasonably expect it to provide you with? How can you tell if that source is misleading you, lying to you, hiding something from you, or is just plain wrong? To what extent do you or should you trust a data source? Remember, ever cell in a spreadsheet is a fact. If you have a spreadsheet containing a million data cells, that’s a lot of fact checking to do…
  • low or misplaced expectations: we don’t necessarily expect Journalism students to know how to drive to a spreadsheet let alone run or apply complex statistics, or even have a great grasp on “the application of number”; but should they? I’m not totally convinced we need to get them up to speed with yesterday’s tools and techniques… As a tool builder/tool user, I keep looking for tools and ways of using tools that may be thought of as emerging “professional” tools for people who work with data on a day-to-day basis, but wouldn’t class themselves as data scientists, or data researchers; tools for technicians, maybe. When presenting tools to students, I try showing the tools that are likely to be found on a technician’s workbench. As such, they may look a little bit more technical than tools developed for home use (compare a socket set from a trade supplier with a £3.50 tool-roll bargain offer from your local garage), but that’s because they’re quality tools that are fit for purpose. And as such, it may take a bit of care, training and effort to learn how to use them. But I thought the point was to expose students to “industry-strength” ideas and applications? And in an area where tools are developing quite quickly, students are exactly the sort of people we need to start engaging with them: 1) at the level of raising awareness about what these tools can do; 2) as a vector for knowledge and technology transfer, getting these tools (or at least, ideas about what they can do) out into industry; 3) for students so inclined, recruiting those students for the further development of the tools, recruiting power users to help drive requirements for future iterations of the tools, and so on. If the journalism students are going to be the “engineers” to the data wrangler technicians, it’ll be good for them to know the sorts of things they can reasonably ask their technicians to help them to do…Which is to say, the journalists need exposing to the data wrangling factory floor.

Although a lot of the #ddj posts on this OUseful.info blog relate to tools, the subtext is all about recognising data as a medium, the form particular datasets take, and the way in which different tools can be used to work with these forms. In part this leads to a consideration of the process questions that can be asked of a data source based on identifying natural representations that may be contained within it (albeit in hidden form). For example, a list of MPs hints at a list of constituencies, which have locations, and therefore may benefit from representation in a geographical, map based form; a collection of emails might hint at a timeline based reconstruction, or network analysis showing who corresponded with whom (and in what order), maybe?

And finally, something that I think is still lacking in the formulation of data journalism as a practice is an articulation of the process of discovering the stories from data: I like the notion of “conversations with data” and this is something I’ll try to develop over forthcoming blog posts.

PS see also @dkernohan’s The campaigning academic?. At the risk of spoiling the punchline (you should nevertheless go and read the whole thing), David writes: “There is a space – in the gap between academia and journalism, somewhere in the vicinity of the digital humanities movement – for what I would call the “campaigning academic”, someone who is supported (in a similar way to traditional research funding) to investigate issues of interest and to report back in a variety of accessible media. Maybe this “reporting back” could build up into equivalence to an academic reward, maybe not.

These would be cross-disciplinary scholars, not tied to a particular critical perspective or methodology. And they would likely be highly networked, linking in both to the interested and the involved in any particular area – at times becoming both. They might have a high media profile and an accessible style (Ben Goldacre comes to mind). Or they might be an anonymous but fascinating blogger (whoever it is that does the wonderful Public Policy and The Past). Or anything in between.

But they would campaign, they would investigate, they would expose and they would analyse. Bringing together academic and old-school journalistic standards of integrity and verifiability.”

Mixed up in my head – and I think in David’s – is the question of “public accounting”, as well as sensemaking around current events and trends, and the extent to which it’s the role of “the media” or “academic” to perform such a function. I think there’s much to be said for reimagining how we inform and educate in a network-centric web-based world, and it’s yet another of those things on my list of things I intend to ponder further… See also: From Academic Privilege to Consultations as Peer Review.

Written by Tony Hirst

November 6, 2012 at 2:39 pm

Posted in Infoskills, onlinejournalismblog

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 757 other followers