Enter the Market – Course Data

I’m not at Dev8Ed this week, though I probably should be, but here’s what I’d have probably tinkered with had I gone – a recipe for creating a class of XCRI (course marketing data) powered websites to support course choice on a variety of themes and that could be used to ruthlessly and shamelessly exploit any and every opportunity for segmenting audiences and fragmenting different parts of the market for highly targeted marketing campaigns. So for example:

  • let’s start with something easy and obvious: russelgroupunis.com (sic;-), maybe? Search for courses from Russell Group (research intensive) universities on a conservatively branded site, lots of links to research inspired resources, pre-emptively posted reading lists (with Amazon affiliate codes attached); then bring in a little competition, and set this site up as a Waitrose to the Sainsburys of 1994andallthat.com, a course choice site based around the 1994 Group Universities (hmmm: seems like some of the 1994 Group members are deserting and heading off to join the Russell Group?); worthamillionplus.com takes the Tesco ads for the Million+ group, maybe, and unireliance.com (University Alliance) the Morrisons(?) traffic. (I have no idea if these uni group-supermarket mappings work? What would similarly tongue-in-cheek broadsheet/tabloid mappings be I wonder?!). If creative arts are more your thing, there could be artswayforward.com for the UKIAD folk, perhaps?
  • there are other ways of segmenting the market, of course. University groupings organise universities from the inside, looking out, but how about groupings based on consumers looking in? At fiveAgrades.com, you know where the barrier is set, as you do with 9kQuality.com, whereas cheapestunifees.com could be good for bottom of the market SEO. wetakeanyone.com could help at clearing time (courses could be identified by looking at grade mappings in course data feeds), as could the slightly more upmarket universityclearingcourses.com. And so on
  • National Student Survey data could also play a part in automatically partitioning universities into different verticals, maybe in support of FTSE-30 like regimes where only courses from universities in the top 30 according to some ranking scheme or other are included. NSS data could also power rankings of course. (Hmm… did I start to explore this for Course Detective? I don’t remember…Hmmm…)

The intention would be to find a way of aggregating course data from different universities onto a common platform, and then to explore ways of generating a range of sites, with different branding, and targeted at different markets, using different views over the same aggregated data set but similar mechanics to drive the sites.

PS For a little inspiration about building course comparison websites based around XCRI data, NSS data and KIS data, it may be worth looking at how the NHS does it (another UK institution that’s hurtling towards privatisation…): for example, check out NHS Choices hospitals near you service, or alternatively compare GPs.

PPS If anyone did start to build out a rash of different course comparison sites on a commercial basis, you can bet that as well as seeking affiliate fees for things like lead generation (prospectuses downloaded/mailed, open day visits booked (in exchange for some sort of ‘discount’ to the potential student if they actually turn up to the open day), registrations/course applications made etc) advertising would play a major role in generating site revenue. If a single operator was running a suite of course choice sites, it would make sense for them to look at how cross-site exploitation of user data could be used to track users across sites and tune offerings for them. I suspect we’d also see the use of paid placement on some sites (putting results to the top of a search results listing based on payment rather than a more quality driven ranking algorithm), recreating some of the confusion of the early days of web searchengines.

I suspect there’d also be the opportunity for points-make-prizes competitions, and other giveaways…

Or like this maybe?

Ahem…

[Disclaimer: the opinions posted herein are, of course, barely even my own, let alone those of my employer.]

Dashboard Views as Data Source Directories: Open Data Communities

Publishing open data is one thing, reusing it quite another. Firstly, you’re faced with a discovery problem – finding a reliable source of the data you need. Secondly, you need to actually find a way of getting a copy of the data you need into the application or tool you want to use it with. Whilst playing around with the Open Data Communities Local Authority Dashboard, a recently launched user facing view over a wealth of Linked Data published by the Department for Communities and Local Government (DCLG) on the OpenDataCommunities website (New PublishMyData Features: Parameterised and Named Queries), I noticed that they provide a link to the data source for each “fact” on the dashboard:

One of the ideas I keep returning to is that it should be possible to “View Source” on a chart or data report to see the route back, via a query, to the dataset from whence it came:

So it’s great to see the Local Authority Dashboard doing just this by exposing the SPARQL query used to return the data from the Open Data Communities datastore:

You can also run the query to preview its output:

Conveniently, a permalink is also provided to the query:

http://opendatacommunities.org/sparql/spend-per-category-per-household?authority=http%3A%2F%2Fopendatacommunities.org%2Fid%2Funitary-authority%2Fisle-of-wight&service_code=490

This is actually an example of a “Named Query” that the platform provides in the form of a parameterisd ‘shortcut’ URL – changing the authority name and/or service code allows you to use the same base URL pattern to get back finance data in this case relating to other authorities and/or service codes as required.

The query view is also editable, which means you can use exposed query as a basis for writing your own queries. Once customised, queries can be called programmatcially via a GET request of the form

http://opendatacommunities.org/sparql.format?query=URL-encoded-SPARQL-query

Custom queries can also support user defined parameter values by including %{tokens} in the original SPQARQL queries, and providing values for the tokens on the url query string:

http://opendatacommunities.org/sparql.format?query=URL-encoded-SPARQL-query?token1=value-for-token1&token2=value-for-token2

As well as previewing the output of a query, we can generate a variety of output formats from a tweak to the URL (add .suffix before the ?), including JSON:

{
  "head": {
    "vars": [ "spend_per_household" ]
  } ,
  "results": {
    "bindings": [
      {
        "spend_per_household": { "datatype": "http://www.w3.org/2001/XMLSchema#decimal" , "type": "typed-literal" , "value": "115.838709677419354838709677" }
      }
    ]
  }
}

XML:

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="spend_per_household"/>
  </head>
  <results>
    <result>
      <binding name="spend_per_household">
        <literal datatype="http://www.w3.org/2001/XMLSchema#decimal">115.838709677419354838709677</literal>
      </binding>
    </result>
  </results>
</sparql>

and CSV:

spend_per_household
115.838709677419354838709677

Having access to the data in this form means we can then pull it into something like a Google Spreadsheets. For example, we can use the =importData(URL) formual to pull in CSV data from the linked query URL:

And here’s the result:

Note: it might be quite handy to be able to suppress the header in the returned CSV so that we could directly use =importData() formula to pull actual values into particular cells, as for example described in Viewing SPARQLed data.gov.uk Data in a Google Spreadsheet and Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula). This loss of metadata in the query response is potentially risky, although I would argue the loss of context about what the data relates to is mitigated by seeing the “unpacked” named query (i.e. the SPARQL query it aliases) and the returned data as a system/atom.

This ability to see the data, then get the data (or “See the data in context – then get the data you need”) is really powerful I think, and offers a way of providing direct access to data via a contextualised view fed from a trusted source.

Using Aggregated Local Council Spending Data for Reverse Spending (Payments to) Lookups

Knowing how much a council spent on this or that activity may contribute to transparency, but it also provides us with an opportunity for tracking the extent to which a council may provide services to other councils in exchange for payment (a “reverse spend”, if you will…)

In Inter-Council Payments and the Google Fusion Tables Network Graph, I demonstrated a recipe for graphing the extent to which county councils made payments to each other using data scraped from OpenlyLocal. But how about if we stick with a tabular view, and just work directly with the data contained in the Scraperwiki database?

One of the nice things about Scraperwiki is that it provides us with API access to the datatables in each scraper. Here’s the API for my “local spend” scraper:

We can use the API explorer to generate URLs to HTML tables containing the results of a query, or JSON or CSV feeds of the results. We can also preview the result of a query:

So what sorts of query can we run? As the database behind Scraperwiki scrapers is SQLite, the way in is through SQL like queries. SQL (Structured Query Language) is a query language for making very powerful searches over database tables (it also provides the basis for queries that treat Google spreadsheets or Google Fusion Tables as a database).

Here’s a sample of some queries we can run over the local spending data on my Scraperwiki scraper. The table I’m calling (think of a table like a particular worksheet in a spreadsheet) is publicMesh, where I have aggregated data from OpenlyLocal that relates to total spend made to various public bodies from other public bodies. supplier is the entity that received a sum from another body; payer and supplyingTo both relate to the entity that paid another body. (I guess “supplier” is not necessarily always the right term, e.g. when a payer is making a grant or award payment?)

The queries are structured as follows: select oneThing, anotherThing from 'table' says “grab the columns ‘oneThing’ and ‘anotherThing’ from the database table ‘table'”; where item relation condition lets us limit the results to those where the value of the ‘item’ column meets some condition (such as, total > 1000, which only selects results where the corresponding value in the total column is greater than 1000; or thing like ‘%term%’ which searches for rows where the contents of the ‘thing’ column contains ‘term’ (the % is like a wildcard character)).

  • select supplier,supplyingTo,total from `publicMesh` where normsupplier like '%county hampshire%' and normsupplyingTo like '%arts%' order by total desc – look (crudely) for Arts Council payments to Hampshire County Council (result)
  • select supplier,supplyingTo,total from `publicMesh` where normsupplier like '%county hampshire%' and normsupplyingTo like '%county%' order by total desc – see which other County Councils Hampshire is providing service to (result)
  • select supplier,supplyingTo,total from `publicMesh` where normsupplier like '%county hampshire%' and normsupplyingTo like '%borough%' order by total desc – alternatively, what councils with ‘Borough” in their name has Hampshire County Council received funding from? (result)
  • select supplyingTo,sum(total) as amount,count(payer) from `publicMesh` where normsupplier like '%county hampshire%' group by payer order by amount desc – if you click through on the “arts” link above, you’ll see that the Arts Council makes various payments to different entities associated with Hampshire County Council (Hampshire County Council; Arts Service, Hampshire County Council; Hampshire County Council – Schools Landscape Programme; Hampshire County Council Music Service). It’s possible aggregate the separate totals for each of this under a single “Hampshire County Council” banner (though note that it may not always make sense to do this sort of grouping operation. The sum operator adds together (sums) all the totals from items that are grouped together by the group by payer statement, that bundles together items with the same payer. count(payer) counts just how many lines from the same payer are grouped together (result)

Hmm…methinks there may be an opportunity here for an tutorial on writing SQL queries…? Maybe this would be a good context for a pathway on SocialLearn…?

Sketching Substantial Council Spending Flows to Serco Using OpenlyLocal Aggregated Spending Data

An article in today’s Guardian (Serco investigated over claims of ‘unsafe’ out-of-hours GP service) about services provided by Serco to various NHS Trusts got me thinking about how much local councils spend with Serco companies. OpenlyLocal provides a patchy(?) aggregating service over local council spending data (I don’t think there’s an equivalent aggregator for NHS organisations’ spending, or police authority spending?) so I thought I’d have a quick peek at how the money flows from councils to Serco.

If we search the OpenlyLocal Spending Dashboard, we can get a summary of spend with various Serco companies from local councils whose spending data has ben logged by the site:

Using the local spend on corporates scraper I used to produce Inter-Council Payments Network Graph, I grabbed details of payments to companies returned by a search on OpenlyLocal for suppliers containing the keyword serco, and then generated a directed graph with edges defined: a) from council nodes to company nodes; b) from company nodes to canonical company nodes. (Where possible, OpenlyLocal tries to reconcile companies identified for payment by councils with canonical company identifiers so that we can start to get a feeling for how different councils make payments to the same companies.)

I then exported the graph as a json node/edge list so that it could be displayed by Mike Bostock’s d3.js Sankey diagram library:

(Note that I’ve filtered the edges to only show ones above a certain payment amount (£10k).)

As a presentation graphic, it’s really tatty, doesn’t include amount labels (though they could be added) and so on. But as a sketch, it provides an easy to digest view over the data as a starting point for a deeper conversation with the data. We might also be able to use the diagram as a starting point for a data quality improvement process, by identifying the companies that we really should try to reconcile.

Here are flows associated with speend to g4s identified companies:

I also had a quick peek at which councils were spending £3,500 and up (in total) with the OU…

Digging into OpenlyLocal spending data a little more deeply, it seems we can get a breakdown of how total payments from council to supplier are made up, such as by spending department.

Which suggests to me that we could introduce another “column” in the Sankey diagram that joins councils with payees via spending department (I suspect the Category column would result in data that’s a bit too fine grained).

See also: University Funding – A Wider View

Cognitive Waste and the Project Funding Bind

As I tweeted earlier today: “A problem with project funding is that you’re expected to know what you’re going to do in advance – rather than discover what can be done..”

This was prompted by reading a JISC ITT (Invitation to Tender) around coursedata: Making the most of course information – xcri-cap feed use demonstrators. Here’s an excerpt from the final call:

JISC is seeking to fund around 6-10 small, rapid innovation projects to create innovative, engaging examples that demonstrate the use of the #coursedata xcri-cap feeds (either directly, or via the JISC Aggregator API). These innovative examples will be shared openly through the JISC web site and events to promote the good practice that has been adopted.
13. The demonstrators could use additional data sources such as geolocation data to provide a mash-up, or may focus on using a single institutional feed to meet a specific need.
14. The demonstrators should include a clear and compelling use case and usage scenario.
15. The range of demonstrators commissioned will cover a number of different approaches and is likely to include examples of:
• an online prospectus, such as a specialist courses directory;
• a mobile app, such as a course finder for a specific geographical area;
• a VLE block or module, such as a moodle block that identifies additional learning opportunities offered by the host institution;
• an information dashboard, such as a course statistics dashboard for managers providing an analysis of the courses your institution offers mashed up with search trends from the institutional website;
• a lightweight service or interface, such as an online study group that finds peers based on course description;
• a widget for a common platform, such as a Google Gadget that identifies online courses, and pushes updates to the users iGoogle page.
16. All demonstrators should be working code and must be available under an open source licence or reusable with full documentation. Project deliverables can build on proprietary components but wherever possible the final deliverables should be open source. If possible, a community-based approach to working with open source code should be taken rather than just making the final deliverables available under an open source licence.
17. The demonstrators should be rapidly developed and be ready to use within 4 months. It is expected most projects would not require more than 30 – 40 chargeable person days.

In addition:

23. Funding will not be allocated to allow a simple continuation of an existing project or activity. The end deliverable must address a specific need that is accepted by the community for which it is intended and produce deliverables within the duration of the project funding.
24. There should be no expectation that future funding will be available to these projects. The grants allocated under this call are allocated on a finite basis. Ideally, the end deliverables should be sustainable in their own right as a result of providing a useful solution into a community of practice.

The call appears to be open to all comers (for example, sole traders) and represents a way of spending money on bootstrapping innovation around course data feeds using HEFCE funding, in a similar way to how the Technology Strategy money disburses money (more understandably?) to commercial enterprises, SMEs, and so on. (Although JISC isn’t a legal entity – yet – maybe we’ll start to see JISC trying to find ways in which it can start to act as a vehicle that generates returns from which it can benefit financially, eg as a venture funder, or as a generator of demonstrable financial growth?)

As with many JISC calls, the intention is that something “sustainable” will result:

22. Without formal service level agreements, dependency on third party systems can limit the shelf life of deliverables. For these types of projects, long term sustainability although always desirable, is not an expected outcome. However making the project deliverables available for at least one year after the end of the project is essential so opportunities are realised and lessons can be learned.

24. There should be no expectation that future funding will be available to these projects. The grants allocated under this call are allocated on a finite basis. Ideally, the end deliverables should be sustainable in their own right as a result of providing a useful solution into a community of practice.

All well and good. Having spent a shedload (technical term ;-) on getting institutions to open up their course data, the funders now need some uptake. (That there aren’t more apps around course data to date is partly my fault. The TSO Open Up Competition prize I won secured a certain amount of TSO resource to build something around course scaffolding code scaffolding data as held by UCAS (my proposal was more to do with seeing this data opened up as enabling data, rather than actually pitching a specific application…). As it turned out, UCAS (a charity operated by the HEIs, I think) were (still are?) too precious over the data to release it as open data for unspecified uses, so the prize went nowhere… Instead, HEFCE spent millions through JISC to get universities to open up course data (albeit probably more comprehensive than the UCAS data) instead…and now there’s an unspecified amount for startups and businesses to build services around the XCRI data. (Note to self: are UCAS using XCRI as an import format or not? If not, is HEFCE/JISC also paying the HEIs to maintain/develop systems that publish XCRI data as well as systems that publish data in an alternative way to UCAS?)

I think TSO actually did some work aggregating datasets around a, erm, model of the UCAS course data; so if they want a return on that work, they could probably pitch an idea for something they’ve already prepped and try to gt HEFCE to pay for it, 9 months on from when I was talking to them at their expense…

Which brings me in part back to my tweet earlier today (“A problem with project funding is that you’re expected to know what you’re going to do in advance – rather than discover what can be done..”), as well as the mantra I was taught way back when I was a research student, that the route to successful research bids was to bid to do work you had already done (in part because then you could propose to deliver what you knew you could already deliver, or could clearly see how to deliver…)

This is fine if you know what you’re pitching to do (essentially, doing something you know how to do), as opposed to setting out to discover what sorts of things might be possible if you set about playing with them. Funders don’t like the play of course, because it smacks of frivolity and undirectedness, even though it may be a deeply focussed and highly goal directed activity, albeit one where the goal emerges during the course of the activity rather than being specified in advance.

As it is, funders tend to fund projects. They tell bidders what they want, bidders tell funders back how they’ll do it (either something they’ve already done = guaranteed deliverable, paid for post hoc), or something they *think* they intend to do (couched in project management and risk assessment speak to mask the fact they don’t really know what’ll happen when they try to execute the plan, but that doesn’t really matter, because at the end of the day they have a plan and a set of deliverables against which they can measure (lack of) progress.) In the play world, you generally do or deliver something because that’s the point – you are deeply engaged in and highly focussed on whatever it is that you’re doing (you are typically intrinsically motivated and maybe also extrinsically motivated by whatever constraints or goals you have adopted as defining the play context/play world. During play, you work hard to play well. And then there’s the project world. In the project world, you deliver or you don’t. So what.

Projects also have overheads associated with them. From preparing, issuing, marking, awarding, tracking and reporting on proposals and funded projects on the fundrs’ side, to preparing, submitting, and managing the project on the other (aside from actually doing the project work – or at last, writing up what has previously been done in an appropriate way;-).

And then there’s the waste.

Clay Shirky popularised the notion of cognitive surplus to characterise creative (and often collaborative creative) acts done in folks’ free time. Things like Wikipedia. I’d characterise this use of cognitive surplus capacity as a form of play – in part because it’s intrinsically motivated, but also because it is typically based around creative acts.

But what about cognitive waste, such as arises from time spent putting together project proposals that are unfunded and then thrown away (why aren’t these bids, along with the successful ones, made open as a matter of course, particularly when the application is for public money from an applicant funded by public money?). (Or the cognitive waste associated with maintaining a regular blog… erm… oops…)

I’ve seen bids containing literature reviews that rival anything in the (for fee, paywall protected, subscription required, author/institution copyright waivered) academic press, as well as proposals that could be taken up, maybe in partnership, by SMEs for useful purpose, rather than academic partners for conference papers), to time spent pursuing project processes, milestones and deliverables for the sole reason that they are in the plan that was defined before the space the project was pitched in to is properly through engaging with it, rather than because they continue to make sense (if indeed they ever did). (And yes, I know that the unenlightened project manager who sees more merit in trying to stick to the project plan and original deliverables, rather than pivoting if a far more productive, valuable or useful opportunity reveals itself, is a mythical beast…;-).

Maybe the waste is important. Evolution is by definition wasteful process, and maybe the route to quality is through a similar sort of process. Maybe the time, thought and effort that goes into unsuccessful bids really is cognitive waste, bad ideas that don’t deserve to be shared (and more than that, shouldn’t be shared because they are dangerously wrong). But then, I’m not sure how that fits in with project funding schemes that are over-subscribed and even highly rated proposals (that would ordinarily receive funding) are rejected, whereas in an undersubscribed call (maybe because it is mis-positioned or even irrelevant), weak bids (that ordinarily wouldn’t be considered) get funding.

Or maybe cognitive waste arises from a broken system and broken processes, and really is something valuable that is being wasted in the sense of squandered?

Right – rant over, (no) (late)lunchtime over… back to the “work” thing, I guess…

PS via @raycorrigan: “Newton, Galileo, Maxwell, Faraday, Einstein, Bohr, to name but a few; evidence of paradigm shifting power of ‘cognitive waste'” – which is another sense of “waste” I hadn’t considered, which is waste (as in loss, or loss to an organisation) of good ideas through rejecting or not supporting the development of a particular proposal or idea..?

F1 Championship Points as a d3.js Powered Sankey Diagram

d3.js crossed my path a couple of times yesterday: firstly, in the form of an enquiry about whether I’d be interested in writing a book on d3.js (I’m not sure I’m qualified: as I responded, I’m more of a script kiddie who sees things I can reuse, rather than have any understanding at all about how d3.js does what it does…); secondly, via a link to d3.js creator Mike Bostock’s new demo of Sankey diagrams built using d3.js:

Hmm… Sankey diagrams are good for visualising flow, so to get to grips myself with seeing if I could plug-and-play with the component, I needed an appropriate data set. F1 related data is usually my first thought as far as testbed data goes (no confidences to break, the STEM/innovation outreach/tech transfer context, etc etc) so what things flow in F1? What quantities are conserved whilst being passed between different classes of entity? How about points… points are awarded on a per race basis to drivers who are members of teams. It’s also a championship sport, run over several races. The individual Driver Championship is a competition between drivers to accumulate the most points over the course of the season, and the Constructor Chanmpionship is a battle between teams. Which suggests to me that a Sankey plot of points from races to drivers and then constructors might work?

So what do we need to do? First up, look at the source code for the demo using View Source. Here’s the relevant bit:

Data is being pulled in from a relatively addressed file, energy.json. Let’s see what it looks like:

Okay – a node list and an edge list. From previous experience, I know that there is a d3.js JSON exporter built into the Python networkx library, so maybe we can generate the data file from a network representation of the data in networkx?

Here we are: node_link_data(G) “[r]eturn data in node-link format that is suitable for JSON serialization and use in Javascript documents.”

Next step – getting the data. I’ve already done a demo of visualising F1 championship points sourced from the Ergast motor racing API as a treemap (but not blogged it? Hmmm…. must fix that) that draws on a JSON data feed constructed from data extracted from the Ergast API so I can clone that code and use it as the basis for constructing a directed graph that represents points allocations: race nodes are linked to driver nodes with edges weighted by points scored in that race, and driver nodes are connected to teams by edges weighted according to the total number of points the driver has earned so far. (Hmm, that gives me an idea for a better way of coding the weight for that edge…)

I don’t have time to blog the how to of the code right now – train and boat to catch – but will do so later. If you want to look at the code, it’s here: Ergast Championship nodelist. And here’s the result – F1 Chanpionship 2012 Points as a Sankey Diagram:

See what I mean about being a cut and paste script kiddie?!;-)

And So It Begins… The Disinteroperability of the Web (Or Just a Harmless Bug…?)

When does a keyboard shortcut *not* do the same thing as the menu command it shortcuts? When it’s a Google docs copy command in Google Chrome, maybe?

Although I know that I, and I suspect many of the any readers of this blog, use keyboard shortcuts unconsciously, intuitively, on a regularly basis: ctrl/cmd-f for within page search, -c for copy, -x for cut, and -v for paste. I also suspect that keyboard shortcuts are alien to many, and that a more likely route to these everyday operations is through the file menu:

or (more unlikely?) via a right-click contextual pop-up menu:

As keyboard shortcut users, we assume that the keyboard shortcuts and the menu based operations do the same thing. But whether a bug or not, I noticed today in the course of using Google docs in Google Chrome that when I tried to copy a highlighted text selection using either the file menu Copy option, or the contextual menu copy option, I was presented with this:

(The -c route to copying still worked fine.)

With Chrome well on its way to becoming the world’s most popular browser, allowing Google to dominate not just our searchable view over the web, but also intermediate our direct connection to the web through the desktop client we use to gain access to it, this makes me twitchy… Firstly, because it suggests that whilst the keyboard shortcut is still routing copied content via my clipboard, the menued option is routing it through the browser, or maybe even the cloud where an online connection is present? Secondly, because in prompting me to extend my browser, I realised I have no real idea of what sorts of updates Google is regularly pushing to me through Chrome’s silent updating behaviour (I’m on version 19.0.1084.46 at the moment, it seems… 19.0.1084.46.

A lot of Google’s activities are driven by technical decisions based on good technical reasons for “improving” how applications work and interoperate with each other. But it seems to me that Google is also closing in on itself and potentially adopting technical solutions that either break interoperability, or include a Google subsystem or process at every step (introducing an alternative de facto operating system onto out desktop by a thousand tiny updates and extensions). So for example, whilst I haven’t installed the Chrome copy extension, I wonder if I had: would a menu based copy from a Google doc allow me to then paste the content into a Word doc running as a Microsoft Office desktop application, or paste it into my online WordPress editor. And if so, would Chrome be cacheing that copied content via the extension?

Maybe this is something and nothing. Maybe I’m just confused about how the cut-and-paste thing works at all. Or maybe Google is starting to overstep its mark and is opening up an attack on host operating system functions from installed browser base. Which as the upcoming most popular browser in the world is not a bad beachhead to have…

PS At least Google Public DNS isn’t forced onto Chrome users as the default way of identifying the actual IP address of a website that is used to actually connect the browser to it from an entered domain name or clicked on link…

Inter-Council Payments and the Google Fusion Tables Network Graph

One of the great things about aggregating local spending data from different councils in the same place – such as on OpenlyLocal – is that you can start to explore structural relations in the way different public bodies of a similar type spend money with each other.

On the local spend with corporates scraper on Scraperwiki, which I set up to scrape how different councils spent money with particular suppliers, I realised I could also use the scraper to search for how councils spent money with other councils, by searching for suppliers containing phrases such as “district council” or “town council”. (We could also generate views to to see how councils wre spending money with different police authorities, for example.)

(The OpenlyLocal API doesn’t seem to work with the search, so I scraped the search results HTML pages instead. Results are paged, with 30 results per page, and what seems like a maximum of 1500 (50 pages) of results possible.)

The publicmesh table on the scraper captures spend going to a range of councils (not parish councils) from other councils. I also uploaded the data to Google Fusion tables (public mesh spending data), and then started to explore it using the new network graph view (via the Experiment menu). So for example, we can get a quick view over how the various county councils make payments to each other:

Hovering over a node highlights the other nodes its connected to (though it would be good if the text labels from the connected nodes were highlighted and labels for unconnected nodes were greyed out?)

(I think a Graphviz visualisation would actually be better, eg using Canviz, because it can clearly show edges from A to B as well as B to A…)

As with many exploratory visualisations, this view helps us identify some more specific questions we might want to ask of the data, rather than presenting a “finished product”.

As well as the experimental network graph view, I also noticed there’s a new Experimental View for Google Fusion Tables. As well as the normal tabular view, we also get a record view, and (where geo data is identified?) a map view:

What I’d quite like to see is a merging of map and network graph views…

One thing I noticed whilst playing with Google Fusion Tables is that getting different aggregate views is rather clunky and relies on column order in the table. So for example, here’s an aggregated view of how different county councils supply other councils:

In order to aggregate by supplied council, we need to reorder the columns (the aggregate view aggregates columns as thet appear from left to right in the table view). From the Edit column, Modify Table:

(In my browser, I then had to reload the page for the updated schema to be reflected in the view). Then we can get the count aggregation:

It would be so much easier if the aggregation view allowed you to order the columns there…

PS no time to blog this properly right now, but there are a couple of new javascript libraries that are worth mentioning in the datawrangling context.

In part coming out of the Guardian stable, Misoproject is “an open source toolkit designed to expedite the creation of high-quality interactive storytelling and data visualisation content”. The initial dataset library provides a set of routines for: loading data into the browser from a variety of sources (CSV, Google spreadsheets, JSON), including regular polling; creating and managing data tables and views of those tables within the browser, including column operations such as grouping, statistical operations (min, max, mean, moving average etc); playing nicely with a variety of client side graphics libraries (eg d3.js, Highcharts, Rickshaw and other JQuery graphics plugins).

Recline.js is a library from Max Ogden and the Open Knowledge Foundation that if its name is anything to go by is positioning itself as an alternative (or complement?) to Google Refine. To my mind though, it’s more akin to a Google Fusion Tables style user interface (“classic” version) wherever you need it, via a Javascript library. The data explorer allows you to import and preview CSV, Excel, Google Spreadsheet and ElasticSearch data from a URL, as well as via file upload (so for example, you can try it with the public spend mesh data CSV from Scraperwiki). Data can be sorted, filtered and viewed by facet, and there’s a set of integrated graphical tools for previewing and displaying data too. Refine.js views can also be shared and embedded, which makes this an ideal tool for data publishers to embed in their sites as a way of facilitating engagement with data on-site, as I expect we’ll see on the Data Hub before too long.

More reviews of these two libraries later…

PPS These are also worth a look in respect of generating visualisations based on data stored in Google spreadsheets: DataWrapper and Freedive (like my old Guardian Datastore explorer, but done properly… Wizard led UI that helps you create your own searchable and embeddable database view direct from a Google Spreadsheet).

Reflections on (Government) (Big) Data Use…

Some thoughts scribbled down on my way home from a Policy Exchange workshop on “Big Data in Gov” earlier today, in which I start trying to unpack some of the confusion I have about what the open data and data driven government thing is all about…

When asked about challenges around use of personal data for government or commercial use, it’s easy to fall into the trap of putting privacy concerns at the top of the list and leave it at that. So here are some of the assumptions and beliefs I tend to bundle into the “privacy concerns and fears” bucket:

confidentiality: when folk talk about breaches of online privacy, I suspect they’re actually concerned about a loss of confidentiality;

– associated with confidentiality is selective revelation, or the belief that we should not have to divulge certain sorts of information to anyone who asks for it, or that if we do, it will be in confidence and subject to informed consent about how that data will be used.

Relating to these on social networks in particular are notions surrounding recovery from inappropriate disclosure (such as deletion of content), whether on the first part (someone posts something they want to detract), the second part (a “friend” makes a disclosure the first party would prefer had not been made, such as wishing them a happy birthday and revealing their birthdate) or the third part (where someone who isn’t a “friend” of the first party makes a disclosure about the first party) (see for example Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There).

In part, I suspect there is often a tacit assumption that there are safeguards on how data is collected and shared (e.g. as regulated by the Data Protection Act (for a quick overview, see ICO guidance on DPA)) but that the majority of folk (myself included!) are actually more than a little hazy about what the law actually stipulates… I also suspect that folk do not generally know what data large companies have collected about them, or the purposes to which that data is put. Add to this concerns about the buying, selling, aggregatation and disaggregation of personal data as part of the business of going concerns or for example as companies themselves are bought and sold, maybe even for the data they hold.

loss of anonymity: privacy as anonymity, or at least, the right to limit knowledge of your actions to a specific, limited public or with confidence that you will not be recognised outside that limited public. When different data sets can be cross-referenced or reconciled with each other, that data graph can become an unexpected public witness (“with the graph as my witness”!)

“creepiness” (or, “how did they know that….?”): this may be thought of as a form of invasion of privacy, in which your personal data may be processed in such a way that it triggers an action that appears to you to breach a confidence you did not knowingly or intentionally share.

“potential for evil”: to what extent might your data be used against you (i.e. to your detriment rather than your benefit)? In part, this may relate to uninformed consent, or use of data without, or against, consent, but it also admits of the ways in which data released for one purpose may come to be cross-referenced with other datasets which in turn and as a result may then be used against you.

equitability: if, as in Norway, your tax affairs are made public (via the skatteliste (“tax list”)) but the tax affairs of your neighbour aren’t, you might feel as if that were rather unfair. And if personal tax affairs are public, should corporate tax dealings be public to the same extent too (for example, if were were to compare the dealings of a sole trader operating under a personal tax regime, compared to their neighbour who set up a limited company or similar to operate a similar business under a “private” corporate tax regime?).

One of the other things we discussed was the extent to which personalisation might feature in the way government deals with its citizens. Part of the brief was to try to pay heed to how waste could be reduced and fraud either detected or prevented. Though not having much evidence to hand to base this on, it seems to me that part of the role of personalisation might be to identify services and benefits that meet the needs of a particular user profile, and try to more efficiently allocate services or resources to people who are eligible for them (at least in the sense of making the citizen aware of their entitlements). In the case of tax, it struck me that a good accountant essentially personalises the tax affairs of their client to maximise the benefit to the client, which got me wondering about the extent to which a personalised HMRC dashboard might tell me how to fill in my tax return for maximum efficiency…! And that if such a dashboard was pointing out to citizens the various loopholes and workarounds they could employ to minimise their tax spend, those loopholes would presumably get fixed pretty quickly…

As far as waste goes, making sure people claim things to which they are entitled, rather than things to which they are not entitled, presumably saves time and cost in processing those ineligible requests and minimises opportunities for misallocation through that route. Reducing friction (such as reducing the number of times or number of places in which a user needs to enter the same personal data), and increasing fluidity (for example, by allowing government services to share data elements, such as the DVLA “borrowing” (with your permission) your photograph from the Passport Office for your driving licence) can also serve to reduce duplicated processes and the potential for error (and hence cost, as well as the opportunity for fraud) that occurs in such cases.

In terms of fraud, this may in part be seen as a deliberate attempt to create a profile that is not a true one but that is eligible for benefits or services that are not directed at the true profile. One way of mitigating against such attempts at fraud might then be to find means by which creating false profiles for the purposes of fraud trigger graph conflicts that can be used to signal the fraud or deception.

As far as big data in government goes, I’m not sure we touched on it much at all. I do wonder, though, about the extent to which government could – or should – buy big data from corporates. Census data may be plugged in to databases held by companies such as Experian, but how much does it add? And how richer, and more current, are Experian’s datasets than the census reports? Google famously used search behaviours to identify flu trends, and I suspect that the supermarkets have a pretty good model of how calories, food groups and medecines are purchased, if not consumed, at a local level, which presumably feeds (no pun intended) into local health trends over a variety of timescales. And as far as traffic monitoring goes? I suspect that the mobile phone network operators have access to a far more comprehensive, up-to-date and even realtime models of pedestrian, as well as traffic, flows than the government does…

One of the things that big data, wherever it’s produced, can benefit from is some form of scaffolding that provides a basis around which normalisation can occur. In the education sector, getting a normalised catalogue of course codes is one example, although this is something that UCAS still appears unwilling to release as open data, let alone free open data.

And as far as open data goes, I think a good reason for opening up data is that it allows innovations from the inside, outside. Which is to say, developers working inside government may be tied to using legacy systems and processes, but if the data is open and public, there is nothing to stop them building more efficient implementations outside government, demonstrate their benefits, then bring them back within government….

PEERing at Education…

I just had a “doh!” moment in the context of OERs – Open Educational Resources, typically so called because they are Resources produced by an Educator under an Open content license (which to all intents and purposes is a copyright waiver). One of the things that appeals to me about OERs is that there is no reason for them not to be publicly discoverable which makes them the ideal focus for PEER – Public Engagement with Educational Resources. Which is what the OU traditionally offered through 6am TV broadcasts of not-quite-lectures…

Or how about this one?

And which the OU is now doing through iTunesU and several Youtube Channels, such as OU Learn:


(Also check out some of the other OU playlists…or OU/BBC co-pros currently on iPlayer;-)

PS It also seems to me that users tend not to get too hung up about how things are licensed, particularly educational ones, because education is about public benefit and putting constraints on education is just plain stoopid. Discovery is nine tenths of law, as it were. The important thing about having something licensed as an OER is that no-one can stop you from sharing it… (which even if you’re the creator of a resource, you may not b able to do; academics, for example, often hand over the copyright of their teaching materials to their employer, and their employer’s copyright over their research output (similarly transferred as a condition of employment) to commercial publishers who then sell the content back to their employers.