From Paywalls and Attention Walls to Data Disclosure Walls and Survey Walls

Is it really only a couple years since the latest, widely quoted, iteration of the idea that “If you are not paying for it, you’re not the customer; you’re the product being sold” was first posted about web economics?

[Notes for folk visiting this site from a referral thread on metafilter]

Prompted by the recent release of new Google product that presents site visotrs with a paid for, and revenue generating, survey before they can see the site’s content, here are a few observations around that idea…

First, let’s just consider the paywall for a moment. Paywalls on the web prevent you from accessing content without payment or some other form of financial subscription. I’m guessing the term was originally coined as a corruption of the term “firewall”, which in a network sense is a component that either allows or prevents network traffic from passing from one device to another based on a set of rules. For example, a firewall might blog traffic from a .xxx domain or particular IP address. [OpenLearn: What are firewalls?]

If a user can be tracked across pageviews within a single visit to a site, or across multiple visits to the site, the paywall may be configured to allow the user to see so many items for free per visit, or per month, before they are required to pay.

Paywalls, can come in a literal form – you pays your money and you gets your content – or at one step remove: you hand over your data, and it’s used to charge an advertiser a premium rate for selling ads to you as a known entity, or by selling your data to a third party. This is the sense in which you are the product. So how does it work?

If you’ve watched an online video recently, whether on a site such as Youtube, or a (commercial) watch again TV service such as ITV Player or 4od, you may way have been exposed to a pre-roll advert before the video you want to watch begins. Many commercial media websites, too, load first with an ad containing lightbox that overlays the article you actually want to read, often with a “Skip Ad” action required if you want to bail out of the ad early.

These ads are one the ways these sites generate income, of course, income that at the end of the day helps pay to keep the site running.

The price paid for these ads typically depends on the size and “quality” or specificity, as well as the size, of the audience the site delivers to the advertiser (that is, the audience segment: [OpenLearn: Market segmentation and targeting]). Sites (and magazines, and TV programmes) all have audiences with a particular demographics and set of interests, and these specialist or well defined audience groups are what the publisher sells to the advertiser.

(From years ago, I remember a bid briefing for a science outreach funding programme where we were told we would be marked down severely if we said the intended audience for our particular projects was “the general public”. What they wanted to know was what audience we were specifically going to hit, and how we were going to tune our projects to engage and inform that particular audience. Same story.)

At the end of the day, adverts are used to persuade audiences to purchase product. So you give data to a publisher, they use that to charge an advertiser a higher rate for being able to put ads in front of particular audiences who are presumably likely to to buy the advertiser’s wares if nudged appropriately, and you buy the product. With cash that pays the advertiser who bought the ad from the publisher who sold your details to them. So you still paid to access that content. With a “free gift” in the form of the goods you bought from the advertiser who bought the ads from the publisher that were placed in front of a particular audience on the basis of the data you gave to the publisher.

Let’s reconsider the paywall mediated sites, for a moment, where for example you get 10 free articles a month, 20 if you register, unlimited if you pay. The second option requires that you register some personal information with the site, such as an email address, or date of birth. You get +x article views on the site “for free” in exchange for your giving the website y pieces of data. In exchange for those free views, you have had to give something in return. You have bought those extra “free” views with your data. The money the site would have got from you if you had paid with cash is replaced by income generated from your data. For example, if the publisher sells adverts at a high price to audiences in the 17-25 range, and you are in that age range, the disclosure of your birthdate allows you to be put into that audience group which is sold to advertisers as such. If you handed over your email address, that can also be sold on to email marketers; if you had to verify that address by clicking on a link emailed to it, it becomes more valuable because it’s more likely to be a legitimate email address. More value can be added to the email address if it is sold as a verified email address belonging to a 17-25 year old, and so on.

Under the assumption that by paying attention to an ad you become more likely to buy a product, or tell someone about the product who is likely to buy it, the paywall essentially becomes replaced by an “attention based, indirect paywall”.

A new initiative by Google ramps up the data-exchange based paywall even further: Google Consumer Surveys. Marketing magazine describes it as follows (Google’s new survey tool: DIY research tool and pay wall alternative):

‘Google Consumer Surveys’ is a survey tool which blocks sections of webpages or articles until the reader answers a question, paying the website owner five cents per response when they do. The service is being billed as an alternative revenue model for publishers considering a pay wall strategy, launching with a handful of news partners last week.

The service works as a DIY research tool, charging users 10 cents per response to questions of the their choice. Buyers of the research have the option to pay an extra 40 cents per response to target sub-populations based on gender, age and location and can target more specific audiences, such as dog owners, with a screening and follow-up question option that costs an additional 50 cents per response.

So let’s unpick that: rather than running ads, the publisher runs a survey. They essentially get paid (via Google) for running the survey by someone who pays Google to run the survey. You hand over your data to the survey company who pays Google who pays the publisher for delivering you, the survey subject. Rather than targeting ads at you, Google targets you as a survey subject, mediated by the publisher who delivers a particular audience demographic; (rather than using sites to target particular audiences, I guess Google will end up using knowledge about audiences to ensure that surveys are displayed to a wide range of subjects, thus ensuring a fair sample. Which means, as Marketing mag suggests, “the questions [will] potentially having nothing to do with the site’s content…”). Rather trying to influence you as a purchaser by presenting you with an ad, in the hope that you will return cash to the person who orginally paid for the ad by buying their wares, disclosure about your beliefs is now the currency. (I need to check about the extent to which: a) Google can in principle and in fact reconcile survey results with a user ID; b) the extent to which Google provides detailed information back to the survey commissioner about the demographics and identity of the survey subjects. Marketing mag suggests “[t]o pre-empt any privacy fears, the search giant is emphasising that all surveys will be completely anonymous and that Google will not use any data collected for its own ad targeting.” So that’s all right then. But Google will presumably know that it has served you x ads and y surveys, if not what answers you gave to survey qustions.).

As well as productising yourself, as sold by publishers to advertisers, by virtue of handing over your data, you’ve also paid in a couple of other senses too – with your attention and with your time. Your attention and your demographic details (that is, your propensity to buy and, at the end of the day, your purchasing power (i.e. your cash) are what you exchange for the “free” content; if your time represents your ability to use that time generating your own income, there may also be an opportunity cost to you (that is, you have not generated 1 hour’s income doing paid for work because you have spent 1 hour watching ads). The cost to you is a loss of income you may otherwise have earned by using that time for paid work.

A couple of the missing links in advertising, of course, are reliable feedback about: 1) whether anyone actually pays attention to your ad; 2) whether they act on it. Google cracked part of action puzzle, at least in terms of ad payments, by coming up with an advertising charging model that got advertisers to bid for ad placements and then only pay if someone clicked through on the ad (Pay-per-click, PPC advertsing) rather than using the original display oriented, “impression based” advertising, where advertisers would pay for so many impressions of their advert (CPM, cost per mille (i.e. cost per thousand impressions).

It seems that Google are now trying to put CPM based metrics on a firmer footing with a newly announced metric, Active View (Making the Web Work for Brand Marketers).

Advertisers have long looked for insight into whether consumers saw an ad on page 145 of a magazine, or switched the channel during a TV commercial break. It’s similar online, so we’re rolling out a technology [Active View], … that can count “viewed” impressions (as defined by the IAB’s proposed standard, this is a display ad that is at least 50% viewable on the screen for at least one second).

… Active View data will be immediately actionable — advertisers will be able to pay only for for viewed impressions.

They’re also looking to improve feedback on the demographics of users who actually view an advert:

Active GRP: GRP, or a gross rating point, is at the heart of offline media measurement. For example, when a fashion brand wants their TV campaign to reach 2 million women with two ads each, they use GRP to measure that. We’re introducing a new version of this for the web: Active GRP. …

… Active GRP is calculated by a statistical model that combines aggregated panel data and anonymous user data (either inferred or user-provided), and will work in conjunction with Active View to measure viewed impressions. This approach overcomes problems of potential panel skewing and reliance on a single data source. This approach also has the advantage of never using personally identifiable information, not sharing user data with third parties, and enabling users, through Google’s Ads Preferences Manager, to opt-out.

Both these announcements were made in the context of Google’s Brand Activate initiative.

Facebook, too, is looking to improve it’s reporting – and maybe its ad targeting? – to advertisers. Although I can’t offhand find an original Facebook source, TechCrunch (Facebook Ads Can Now Be Optimized To Drive Any On-Facebook Action, Such As In-App Purchases, Shares, Offer Claims), Mashable (Facebook’s Analytics Tool for Ads Will Soon Measure Actions Other Than ‘Likes’) et al are reporting on a Facebook briefing that described how advertisers will be able to view reports describing the downstream actions taken by people who have viewed a particular advert. The Facebook article also suggests that the likelihood of a user performing a particular action might form part of the targeting criteria (“today Facebook begins allowing advertisers using its API to ask it to show their ads to people most likely to take any specific post-click action on the social network, such as sharing a brand’s content to the news feed, buying virtual goods in their apps, or redeeming one of the new Facebook Offers at a local brick-and-mortar store”).

So now, it seems that the you that is the product may well soon include your (likely) actions…

See also: Expectations Matter, Even If You’re Not ‘A Customer’ which links in to a discussion about what reasonable expectations you might have as a user of a “free” service.

And this: Contextual Content Delivery on Higher Ed Websites Using Ad Servers, on something of Google’s ad targeting capacity as of a couple of years ago…

[Notes: I would reply in the thread but I don’t want to have to pay cash for the, erm, privilege of doing so… I also appreciate that none of these ideas are necessarily original, and I recognise that the model applies to TV, radio, print or whatever other content carrier and container you care to talk about… I suspect that Blue Beetle isn’t actually the source of the “you are the product” slogan this time round, anyway, (in recent months, Wired probably is) although many search engines lead that way. (So for example, it’s easy to find earlier, similarly pithy, expressions of the same sentiment in the web context all over the place… For example, this 2009 post; or this one). And not that you’ll care, this blog is my notebook, and these notes are just me scribbling down some context around the Google survey product (the post construction/writing style reflects that) #trollFeeding PS Since everybody knows that 1+1=2, I figure we probably don’t need to teach it anymore #deadHorseFlogging #gettingChildishNow #justLikeAMetaFilterThread]

Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch

A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:

The recipe is as follows:

– grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
– grab the filings for each of those companies
– trawl through the filings looking for director appointments or terminations
– store a row for each directorial appointment or termination including the company name and the director.

You can find the scraper here: Tesco Sprawl Grapher

import scraperwiki, simplejson,urllib

import networkx as nx

#Keep the API key [private - via
import os, cgi
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))

#note - the opencorporates api also offers a search:  companies/search

def getOCcompanyData(ocid):
    return ocdata

#need to find a way of playing nice with the api, and not keep retrawling

def getOCfilingData(ocid):
    print 'filings',ocid
    #print 'filings',ocid,ocdata
    #print 'filings 2',tmpdata
    while tmpdata['page']<tmpdata['total_pages']:
        print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page'])
    return ocdata

def recordDirectorChange(ocname,ocid,ffiling,director):
    print 'ddata',ddata['fid'], table_name='directors', data=ddata)

def logDirectors(ocname,ocid,filings):
    print 'director filings',filings
    for filing in filings:
        if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01":
            director=desc.replace('DIRECTOR APPOINTED ','')
        elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01":
            director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','')
            director=director.replace('APPOINTMENT TERMINATED, ','')

for entity in entities['result']:

The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.

import scraperwiki
import urllib
import networkx as nx

import networkx.readwrite.gexf as gf

from xml.etree.cElementTree import tostring

scraperwiki.sqlite.attach( 'tesco_sprawl_grapher')
q = '* FROM "directors"'
data =


for row in data:
    if row['fdirector'] not in directors:
    if row['ocname'] not in companies:

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")


print tostring(writer.xml)

Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.

In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)

And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?

Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API

Having got to grips with adding a basic sortable table view to a Scraperwiki view using the Google Chart Tools (Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API), I thought I’d have a look at wiring in an interactive dashboard control.

You can see the result at BBC Bottom Line programme explorer:

The page loads in the contents of a source Scraperwiki database (so only good for smallish datasets in this version) and pops them into a table. The searchbox is bound to the Synopsis column and and allows you to search for terms or phrases within the Synopsis cells, returning rows for which there is a hit.

Here’s the function that I used to set up the table and search control, bind them together and render them:

    google.load('visualization', '1.1', {packages:['controls']});


    function drawTable() {

      var json_data = new google.visualization.DataTable(%(json)s, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true}});
    //i expected this limit on the view to work?

    var formatter = new google.visualization.PatternFormat('<a href="{0}">{0}</a>');
    formatter.format(json_data, [1]); // Apply formatter and set the formatted value of the first column.

    formatter = new google.visualization.PatternFormat('<a href="{1}">{0}</a>');
    formatter.format(json_data, [7,8]);

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Synopsis',
        'matchType': 'any'

  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind(stringFilter, json_table).draw(json_data);


The formatter is used to linkify the two URLs. However, I couldn’t get the table to hide the final column (the OpenCorporates URI) in the displayed table? (Doing something wrong, somewhere…) You can find the full code for the Scraperwiki view here.

Now you may (or may not) be wondering where the OpenCorporates ID came from. The data used to populate the table is scraped from the JSON version of the BBC programme pages for the OU co-produced business programme The Bottom Line (Bottom Line scraper). (I’ve been pondering for sometime whether there is enough content there to try to build something that might usefully support or help promote OUBS/OU business courses or link across to free OU business courses on OpenLearn…) Supplementary content items for each programme identify the name of each contributor and the company they represent in a conventional way. (Their role is also described in what looks to be a conventionally constructed text string, though I didn’t try to extract this explicitly – yet. (I’m guessing the Reuters OpenCalais API would also make light work of that?))

Having got access to the company name, I thought it might be interesting to try to get a corporate identifier back for each one using the OpenCorporates (Google Refine) Reconciliation API (Google Refine reconciliation service documentation).

Here’s a fragment from the scraper showing how to lookup a company name using the OpenCorporates reconciliation API and get the data back:

ocrecURL=''+urllib.quote_plus("".join(i for i in record['company'] if ord(i)<128))
    print ocrecURL,[recData]
    if len(recData['result'])>0:
        if recData['result'][0]['score']>=0.7:

The ocrecURL is constructed from the company name, sanitised in a hack fashion. If we get any results back, we check the (relevance) score of the first one. (The results seem to be ordered in descending score order. I didn’t check to see whether this was defined or by convention.) If it seems relevant, we go with it. From a quick skim of company reconciliations, I noticed at least one false positive – Reed – but on the whole it seemed to work fairly well. (If we look up more details about the company from OpenCorporates, and get back the company URL, for example, we might be able to compare the domain with the domain given in the link on the Bottom Line page. A match would suggest quite strongly that we have got the right company…)

As @stuartbrown suggeted in a tweet, a possible next step is to link the name of each guest to a Linked Data identifier for them, for example, using DBPedia (although I wonder – is @opencorporates also minting IDs for company directors?). I also need to find some way of pulling out some proper, detailed subject tags for each episode that could be used to populate a drop down list filter control…

PS for more Google Dashboard controls, check out the Google interactive playground…

PPS see also: OpenLearn Glossary Search and OpenLearn LEarning Outcomes Search

Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API

In Visualising Networks in Gephi via a Scraperwiki Exported GEXF File I gave an example of how we can publish arbitrary serialised output file formats from Scraperwiki using the GEXF XML file format as a specific example. Of more general use, however, may be the ability to export Scraperwiki data using the Google visualisation API DataTable format. Muddling around the Google site last night, I noticed the Google Data Source Python Library that makes it easy to generate appropriately formatted JSON data that can be consumed by the (client side) Google visualisation library. (This library provides support for generating line charts, bar charts, sortable tables, etc, as well as interactive dashboards.) A tweet to @frabcus questioning whether the gviz_api Python library was available as a third party library on Scraperwiki resulted in him installing it (thanks, Francis:-), so this post is by way of thanks…

Anyway, here are a couple of examples of how to use the library. The first is a self-contained example (using code pinched from here) that transforms the data into the Google format and then drops it into an HTML page template that can consume the data, in this case displaying it as a sortable table (GViz API on scraperwiki – self-contained sortable table view [code]):

Of possibly more use in the general case is a JSONP exporter (example JSON output (code)):

Here’s the code for the JSON feed example:

import scraperwiki
import gviz_api

#Example of:
## how to use the Google gviz Python library to cast Scraperwiki data into the Gviz format and export it as JSON

#Based on the code example at:

scraperwiki.sqlite.attach( 'openlearn-units' )
q = 'parentCourseCode,name,topic,unitcode FROM "swdata" LIMIT 20'
data =

description = {"parentCourseCode": ("string", "Parent Course"),"name": ("string", "Unit name"),"unitcode": ("string", "Unit Code"),"topic":("string","Topic")}

data_table = gviz_api.DataTable(description)

json = data_table.ToJSon(columns_order=("unitcode","name", "topic","parentCourseCode" ),order_by="unitcode")

scraperwiki.utils.httpresponseheader("Content-Type", "application/json")
print 'ousefulHack('+json+')'

I hardcoded the wraparound function name (ousefulHack), which then got me wondering: is there a safe/trusted/approved way of grabbing arguments out of the URL in Scraperwiki so this could be set via a calling URL?

Anyway, what this shows (hopefully) is an easy way of getting data from Scraperwiki into the Google visualisation API data format and then consuming either via a Scraperwiki view using an HTML page template, or publishing it as a Google visualisation API JSONP feed that can be consumed by an arbitrary web page and used direclty to drive Google visualisation API chart widgets.

PS as well as noting that the gviz python library “can be used to create a google.visualization.DataTable usable by visualizations built on the Google Visualization API” ( sourcecode), it seems that we can also use it to generate a range of output formats: Google viz API JSON (.ToJSon), as a simple JSON Response (. ToJSonResponse), as Javascript (“JS Code”) (.ToJSCode), as CSV (.ToCsv), as TSV (.ToTsvExcel) or as an HTML table (.ToHtml). A ToResponse method (ToResponse(self, columns_order=None, order_by=(), tqx=””)) can also be used to select the output response type based on the tqx parameter value (out:json, out:csv, out:html, out:tsv-excel).

PPS looking at eg which can be pulled into a javascript google.visualization.Query(), it seems we get the following returned:
google.visualization.Query.setResponse({"version":"0.6","status":"ok","sig":"1664774139","table":{ "cols":[ ... ], "rows":[ ... ] }})
I think google.visualization.Query.setResponse can be a user defined callback function name; maybe worth trying to implement this one day?

Visualising Networks in Gephi via a Scraperwiki Exported GEXF File

How do you visualise data scraped from the web using Scraperwiki as a network using a graph visualisation tool such as Gephi? One way is to import the a two-dimensional data table (i.e. a CSV file) exported from Scraperwiki into Gephi using the Data Explorer, but at times this can be a little fiddly and may require you to mess around with column names to make sure they’re the names Gephi expects. Another way is to get the data into a graph based representation using an appropriate file format such as GEXF or GraphML that can be loaded directly (and unambiguously) into Gephi or other network analysis and visualisation tools.

A quick bit of backstory first…

A couple of related key features for me of a “data management system” (eg the joint post from Francis Irving and Rufus Pollock on From CMS to DMS: C is for Content, D is for Data) are the ability to put data into shapes that play nicely with predefined analysis and visualisation routines, and the ability to export data in a variety of formats or representations that allow that data to be be readily imported into, or used by, other applications, tools, or software libraries. Which is to say, I’m into glue

So here’s some glue – a recipe for generating a GEXF formatted file that can be loaded directly into Gephi and used to visualise networks like this one of how OpenLearn units are connected by course code and top level subject area:

The inspiration for this demo comes from a couple of things: firstly, noticing that networkx is one of the third party supported libraries on ScraperWiki (as of last night, I think the igraph library is also available; thanks @frabcus ;-); secondly, having broken ground for myself on how to get Scraperwiki views to emit data feeds rather than HTML pages (eg OpenLearn Glossary Items as a JSON feed).

As a rather contrived demo, let’s look at the data from this scrape of OpenLearn units, as visualised above:

The data is available from the openlearn-units scraper in the table swdata. The columns of interest are name, parentCourseCode, topic and unitcode. What I’m going to do is generate a graph file that represents which unitcodes are associated with which parentCourseCodes, and which topics are associated with each parentCourseCode. We can then visualise a network that shows parentCourseCodes by topic, along with the child (unitcode) course units generated from each Open University parent course (parentCourseCode).

From previous dabblings with the networkx library, I knew it’d be easy enough to generate a graph representation from the data in the Scraperwiki data table. Essentially, two steps are required: 1) create and label nodes, as required; 2) tie nodes together with edges. (If a node hasn’t been defined when you use it to create an edge, netwrokx will create it for you.)

I decided to create and label some of the nodes in advance: unit nodes would carry their name and unitcode; parent course nodes would just carry their parentCourseCode; and topic nodes would carry an newly created ID and the topic name itself. (The topic name is a string of characters and would make for a messy ID for the node!)

To keep gephi happy, I’m going to explicitly add a label attribute to some of the nodes that will be used, by default, to label nodes in Gephi views of the network. (Here are some hints on generating graphs in networkx.)

Here’s how I built the graph:

import scraperwiki
import urllib
import networkx as nx

scraperwiki.sqlite.attach( 'openlearn-units' )
q = '* FROM "swdata"'
data =


for row in data:
    if topic not in topics:

Having generated a representation of the data as a graph using networkx, we now need to export the data. networkx supports a variety of export formats, including GEXF. Looking at the documentation for the GEXF exporter, we see that it offers methods for exporting the GEXF representation to a file. But for scraperwiki, we want to just print out a representation of the file, not actually save the printed representation of the graph to a file. So how do we get hold of an XML representation of the GEXF formatted data so we can print it out? A peek into the source code for the GEXF exporter (other exporter file sources here) suggests that the functions we need can be found in the networkx.readwrite.gexf file: a constructor (GEXFWriter), and a method for loading in the graph (.add_graph()). An XML representation of the file can then be obtained and printed out using the ElementTree tostring function.

Here’s the code I hacked out as a result of that little investigation:

import networkx.readwrite.gexf as gf


scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

Note the use of the scraperwiki.utils.httpresponseheader to set the MIMEtype of the view. If we don’t do this, scraperwiki will by default publish an HTML page view, along with a Scraperwiki logo embedded in the page.

Here’s the full code for the view.

And here’s the GEXF view:

Save this file with a .gexf suffix and you can then open the file directly into Gephi.

Hopefully, what this post shows is how you can generate your own, potentially complex, output file formats within Scraperwiki that can then be imported directly into other tools.

PS see also Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API, which shows how to generate a Google Visualisation API JSON from Scraperwiki, allowing for the quick and easy generation of charts and tables using Google Visualisation API components.

University Funding – A Wider View

A post on the Guardian Datablog yesterday (Higher education funding: which institutions will be affected?) alerted me to the release of HEFCE’s “provisional allocations of recurrent funding for teaching and research, and the setting of student number control limits for institutions, for academic year 2012-13” (funding data).

Here are the OU figures for teaching:

Funding for old-regime students (mainstream) Funding for old-regime students (co-funding) High cost funding for new-regime students Widening participation Teaching enhancement and student success Other targeted allocations Other recurrent teaching grants Total teaching funding
59,046,659 0 2,637,827 23,273,796 17,277,704 22,619,320 3,991,473 128,846,779

HEFCE preliminary teaching funding allocations to the Open University, 2012-13

Of the research funding for 2012-13, mainstream funding was 8,030,807, the RDP supervision fund came in at 1,282,371, along with 604,103 “other”, making up the full 9,917,281 research allocation.

Adding Higher Education Innovation Funding of 950,000, the OU’s total allocation was 139,714,060.

So what other funding comes into the universities from public funds?

Open Spending publishes data relating to spend by government departments to named organisations, so we can search that for data spent by government departments with the universities (for example, here is a search on for “open university”:

Given the amounts spent by public bodies on consultancy (try searching OpenCorporates for mentions of PriceWaterhouseCoopers, or any of EDS, Capita, Accenture, Deloitte, McKinsey, BT’s consulting arm, IBM, Booz Allen, PA, KPMG (h/t @loveitloveit)), university based consultancy may come in reasonably cheaply?

The universities also receive funding for research via the UK research councils (EPSRC, ESRC, AHRC, MRC, BBSRC, NERC, STFC) along with innovation funding from JISC. Unpicking the research council funding awards to universities can be a bit of a chore, but scrapers are appearing on Scraperwiki that make for easier access to individual grant awards data:

  • AHRC funding scraper; [grab data using queries of the form select * from `swdata` where organisation like "%open university%" on scraper arts-humanities-research-council-grants]
  • EPSRC funding scraper; [grab data using queries of the form select * from `grants` where department_id in (select distinct id as department_id from `departments` where organisation_id in (select id from `organisations` where name like "%open university%")) on scraper epsrc_grants_1]
  • ESRC funding scraper; [grab data using queries of the form select * from `grantdata` where institution like "%open university%" on scraper esrc_research_grants]
  • BBSRC funding [broken?] scraper;
  • NERC funding [broken?] scraper;
  • STFC funding scraper; [grab data using queries of the form select * from `swdata` where institution like "%open university%" on scraper stfc-institution-data]

In order to get a unified view over the detailed funding of the institutions from these different sources, the data needs to be reconciled. There are several ID schemes for identifying universities (eg UCAS or HESA codes; see for example GetTheData: Universities by Mission Group) but even official data releases tend not make use of these, preferring instead to rely solely on insitution names, as for example in the case of the recent HEFCE provisional funding data release [DOh! This is not the case – identifiers are there, apparently (I have to admit, I didn’t check and was being a little hasty… See the contribution/correction from David Kernohan in the comments to this post…]:

For some time, I’ve been trying to put my finger on why data releases like this are so hard to work with, and I think I’ve twigged it… even when released in a spreadsheet form, the data often still isn’t immediately “database-ready” data. Getting data from a spreadsheet into a database often requires an element of hands-on crafting – coping with rows that contain irregular comment data, as well as handling columns or rows with multicolumn and multirow labels. So here are a couple of things that would make life easier in the short term, though they maybe don’t represent best practice in the longer term…:

1) release data as simple CSV files (odd as it may seem), because these can be easily loaded into applications that can actually work on the data as data. (I haven’t started to think too much yet about pragmatic ways of dealing with spreadsheets where cell values are generated by formulae, because they provide an audit trail from one data set to derived views generated from that data.)

2) have a column containing regular identifiers using a known identification scheme, for example, HESA or UCAS codes for HEIs. If the data set is a bit messy, and you can only partially fill the ID column, then only partially fill it; it’ll make life easier joining those rows at least to other related datasets…

As far as UK HE goes, the JISC monitoring unit/JISCMU has a an api over various administrative data elements relating to UK HEIs (eg GetTheData: Postcode data for HE and FE institutes, but I don’t think it offers a Google Refine reconciliation service, (ideally with some sort of optional string similarity service)…? Yet?! ;-) maybe that’d make for a good rapid innovation project???

PS I’m reminded of a couple of related things: Test Your RESTful API With YQL, a corollary to the idea that you can check your data at least works by trying to use it (eg generate a simple chart from it) mapped to the world of APIs: if you can’t easily generate a YQL table/wrapper for it, it’s maybe not that easy to use? 2) the scraperwiki/okf post from @frabcus and @rufuspollock on the need for data management systems not content management systems.

PPS Looking at the actual Guardian figures reveals all sorts of market levers appearing… Via @dkernohan, FT: A quiet Big Bang in universities

Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine

Listening to Chris Taggart talking about OpenCorporates at netzwerk recherche conf – data, research, stories, I figured I really should start to have a play…

Looking through the example data available from an opencorporates company ID via the API, I spotted that registered trademark data was available. So here’s a quick roundabout way of previewing trademarked images using OpenCorporates and Google Refine.

First step is to grab the data – the opencorporates API reference docs give an example URL for grabbing a company’s (i.e. a legal entity’s) data:

Google Refine supports the import of JSON from a URL:

(Hmm, it seems as if we could load in data from several URLs in one go… maybe data from different BP companies?)

Having grabbed the JSON, we can say which blocks we want to import as row items:

We can preview the rows to check we’re bringing in what we expect…

We’ll take this data by clicking on Create Project, and then start to work on it. Because the plan is to grab trademark images, we need to grab data back from OpenCorporates relating to each trademark. We can generate the API call URLs from the datum – id column:

The OpenCorporates data item API calls are of the form, which we can generate as follows:

Here’s what we get back:

If we look through the data, there are several fields that may be interesting: the “representative_name_lines (the person/group that registered the trademark), the representative_address_lines, the mark_image_type and most importantly of all, the international_registration_number. Note that some of the trademarks are not images – we’ll end up ignoring those (for the purposes of this post, at least!)

We can pull out these data items into separate columns by creating columns directly from the trademark data column:

The elements are pulled in using expressions of the following form:

Here are the expressions I used (each expression is used to create a new column from the trademark data column that was imported from automatically constructed URLs):

  • value.parseJson().datum.attributes.mark_image_type – the first part of the expression parses the data as JSON, then we navigate using dot notation to the part of the Javascript object we want…
  • value.parseJson().datum.attributes.mark_text
  • value.parseJson().datum.attributes.representative_address_lines
  • value.parseJson().datum.attributes.representative_name_lines
  • value.parseJson().datum.attributes.international_registration_number

Finding how to get images from international registration numbers was a bit of a faff. In the end, I looked up several records on the WIPO website that displayed trademarked images, then looked at the pattern of their URLs. The ones I checked seemed to have the form:
where typ is gif or jpg and XXYYNN is the international registration number. (This may or may not be a robust convention, but it worked for the examples I tried…)

The following GREL expression generates the appropriate URL from the trademark column:

if( or(value.parseJson().datum.attributes.mark_image_type==’JPG’, value.parseJson().datum.attributes.mark_image_type==’GIF’), ‘; + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2)[0] + ‘/’ + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2, 2)[1] + ‘/’ + value.parseJson().datum.attributes.international_registration_number + ‘.’ + toLowercase (value.parseJson().datum.attributes.mark_image_type), ”)

The first part checks that we have a GIF or JPG image type identified, and if it does, then we construct the URL path, and finally cast the filetype to lower case, else we return an empty string.

Now we can filter the data to only show rows that contain a trademark image URL:

Finally, we can create a template to export a simple HTML file that will let us preview the image:

Here’s a crude template I tried:

The file is exported as a .txt file, but it’s easy enough to change the suffix to .html so that we can view the fie in a browser, or I can cut and paste the html into this page…

[UPDATE: images look like they now have the form: ? The IDs may also have changed…]

null null
null null
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”BP GROUP TRADE MARKS\”]” “[\”20 Canada Square,\”,\”Canary Wharf\”,\”London E14 5NJ\”]”
“[\”Murgitroyd & Company\”]” “[\”Scotland House,\”,\”165-169 Scotland Street\”,\”Glasgow G5 8PL\”]”
“[\”BP GROUP TRADE MARKS\”]” “[\”20 Canada Square,\”,\”Canary Wharf\”,\”London E14 5NJ\”]”
“[\”BP Group Trade Marks\”]” “[\”20 Canada Square, Canary Wharf\”,\”London E14 5NJ\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”BP Group Trade Marks\”]” “[\”20 Canada Square, Canary Wharf\”,\”London E14 5NJ\”]”
“[\”ROBERT WILLIAM BOAD\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”BP GROUP TRADE MARKS\”]” “[\”20 Canada Square,\”,\”Canary Wharf\”,\”London E14 5NJ\”]”

Okay – so maybe I need to tidy up the registration related columns, but as a recipe, it sort of works. (Note that it took way longer to create this blog post than it did to come up with the recipe…)

A couple of things that came to mind: having used Google Refine to sketch out this hack, we could now move code it up, maybe in something like Scraperwiki. For example, I only found trademarks registered to one legal entity associated with BP, rather than checking for trademarks held by the myriad number of legal entities associated with BP. I also wonder whether it would be possible to “compile” what Google Refine is doing (import from URL, select row items, run operations against columns, export templated data) as code so that it could be run elsewhere (so for example, could all through steps be exported as a single Javascript or Python script, maybe calling on a GREL/Google Refine library that provides some sort of abstraction layer of virtual machine for the script to make use of?)

PS What’s next…? The trademark data also identifies one or more areas in which the trademark applies; I need to find some way of pulling out each of the “en” attribute values from the items listed in the value.parseJson().datum.attributes.goods_and_services_classifications.

Journalist Filters on Twitter – The Reuters View

It seems that Reuters has a new product out – Reuters Social Pulse. As well as highlighting “the stories being talked about by the newsmakers we follow”, there is an area highlighting “the Reuters & Klout 50 where we rank America’s most social CEOs.” Of note here is that this list is ordered by Klout score. Reuters don’t own Klout (yet?!) do they?!

The offering also includes a view of the world through the tweets of Reuters own staff. Apparently, “Reuters has over 3,000 journalists around the world, many of whom are doing amazing work on Twitter. That is too many to keep up with on a Twitter list, so we created a directory Reuters Twitter Directory] that shows you our best tweeters by topic. It let’s you find our reporters, bloggers and editors by category and location so you can drill down to business journalists in India, if you so choose, or tech writers in the UK.”

If you view the source of Reuters Twitter directory page, you can find a Javascript object that lists all(?) the folk in the Reuters Twitter directory and the tags they are associated with… Hmm, I thought… Hmmm…

If we grab that object, and pop it into Python, it’s easy enough to create a bipartite network that links journalists to the categories they are associated with:

import simplejson
import networkx as nx
from networkx.algorithms import bipartite

g = nx.Graph()

#need to bring in reutersJournalistList

#I had some 'issues' with the parsing for some reason? Required this hack in the end...
for user in users:
	for x in user:
		if x=='users':
			print 'user:',user[x][0]['twitter_screen_name']
			for topic in user[x][0]['topics']:
				print '- topic:',topic
				#Add edges from journalist name to each tag they are associated with
#print bipartite.is_bipartite(g)
#print bipartite.sets(g)

#Save a graph file we can visualise in Gephi corresponding to bipartite graph
nx.write_graphml(g, "usertags.graphml")

#We can find the sets of names/tags associated with the disjoint sets in the graph

#Collapse the bipartite graph to a graph of journalists connected via a common tag
ugraph= bipartite.projected_graph(g, users)
nx.write_graphml(ugraph, "users.graphml")

#Collapse the bipartite graph to a set of tags connected via a common journalist
tgraph= bipartite.projected_graph(g, tags)
nx.write_graphml(tgraph, "tags.graphml")

#Dump a list of the journalists Twitter IDs
for uo in users: f.write(uo+'\n')

Having generated graph files, we can then look to see how the tags cluster as a result of how they were applied to journalists associated with several tags:

Reuters journalists twitter directory cotags

Alternatively, we can look to see which journalists are connected by virtue of being associated with similar tags (hmm, I wonder if edge weight carries information about how many tags each connected pair may be associated through? [UPDATE: there is a projection that will calculate this – bipartite.projection.weighted_projected_graph]). In this case, I size the nodes by betweenness centrality to try to highlight journalists that bridge topic areas:

Reuters twitter journalists list via cotags, sized by betweenness centrality

Association through shared tags (as applied by Reuters) is one thing, but there is also structure arising from friendship networks…So to what extent do the Reuters Twitter List journalists follow each other (again, sizing by betweenness centrality):

Reuters twitter journalists friend connections sized by betweenness centrality

Finally, here’s a quick look at folk followed by 15 or more of the folk in the Reuters Twitter journalists list: this is the common source area on Twitter for the journalists on the list. This time, I size nodes by eigenvector centrality.

FOlk followed by 15 or more of folk on reuters twitter journliasts list, size by eigenvector centrality

So why bother with this? Because journalists provide a filter onto the way the world is reported to us through the media, and as a result the perspective we have of the world as portrayed through the media. If we see journalists as providing independent fairwitness services, then having some sort of idea about the extent to which they are sourcing their information severally, or from a common pool, can be handy. In the above diagram, for example, I try to highlight common sources (folk followed by at least 15 of the journalists on the Twitter list). But I could equally have got a feeling for the range of sources by producing a much larger and sparser graph, such as all the folk followed by journalists on the list, or folk followed by only 1 person on the list (40,000 people or so in all – see below), or by 2 to 5 people on the list…

The twitterverse as directly and publicly followed by folk on the Reuters Journalists twitter list

Friends lists are one sort of filter every Twitter user has onto the content been shared on Twitter, and something that’s easy to map. There are other views of course – the list of people mentioning a user is readily available to every Twitter user, and it’s easy enough to set up views around particular hashtags or search terms. Grabbing the journalists associated with one or more particular tags, and then mapping their friends (or, indeed, followers) is also possible, as is grabbing the follower lists for one or more journalists and then looking to see who the friends of the followers are, thus positioning the the journalist in the social media environment as perceived by their followers.

I’m not sure that value Reuters sees in the stream of tweets from the folk on its Twitter journalists lists, or the Twitter networks they have built up, but the friend lenses at least we can try to map out. And via the bipartite user/tag graph, it also becomes trivial for us to find journalists with interests in Facebook and advertising, for example…

PS for associated techniques related to the emergent social positioning of hashtags and shared links on Twitter, see Socially Positioning #Sherlock and Dr John Watson’s Blog… and Social Media Interest Maps of Newsnight and BBCQT Twitterers. For a view over @skynews Twitter friends, and how they connect, see Visualising How @skynews’ Twitter Friends Connect.

Different Speeches? Digital Skills Aren’t just About Coding…

Secretary of State for Education, Michael Gove, gave a speech yesterday on rethinking the ICT curriculum in UK schools. You can read a copy of the speech variously on the Department for Education website, or, err, on the Guardian website.

Seeing these two copies of what is apparently the same speech, I started wondering:

a) which is the “best” source to reference?
b) how come the Guardian doesn’t add a disclaimer about the provenance of, and link, to the DfE version? [Note the disclaimer in the DfE version – “Please note that the text below may not always reflect the exact words used by the speaker.”]
c) is the Guardian version an actual transcript, maybe? That is, does the Guardian reprint the “exact words” used by the speaker?

And that made me think I should do a diff… About which, more below…

Before that, however, here’s a quick piece of reflection on how these two things – the reinvention of the the IT curriculum, and the provenance of, and value added to, content published on news and tech industry blog sites – collide in my mind…

So for example, I’ve been pondering what the role of journalism is, lately, in part because I’m trying to clarify in my own mind what I think the practice and role of data journalism are (maybe I should apply for a Nieman-Berkman Fellowship in Journalism Innovation to work on this properly?!). It seems to me that “communication” is one important part (raising awareness of particular issues, events, or decisions), and holding governments and companies to account is another. (Actually, I think Paul Bradshaw has called me out on that, before, suggesting it was more to do with providing an evidence base through verification and triangulation, as well as comment, against which governments and companies could be held to account (err, I think? As an unjournalist, I don’t have notes or a verbatim quote against which to check that statement, and I’m too lazy to email/DM/phone Paul to clarify what he may or may not have said…(The extent of my checking is typically limited to what I can find on the web or in personal archives…which appear to be lacking on this point…))

Another thing I’ve been mulling over recently in a couple of contexts relates to the notion of what are variously referred to as digital or information skills.

The first context is “data journalism”, and the extent to which data journalists need to be able to do programming (in the sense of identifying the steps in a process that can be automated and how they should be sequenced or organised) versus writing code. (I can’t write code for toffee, but I can read it well enough to copy, paste and change bits that other people have written. That is, I can appropriate and reuse other people’s code, but can’t write it from scratch very well… Partly because I can’t ever remember the syntax and low level function names. I can also use tools such as Yahoo Pipes and Google Refine to do coding like things…) Then there’s the question of what to call things like URL hacking or (search engine) query building?

The second context is geeky computer techie stuff in schools, the sort of thing covered by Michael Gove’s speech at the BETT show on the national ICT curriculum (or lack thereof), and about which the educational digerati were all over on Twitter yesterday. Over the weekend, houseclearing my way through various “archives”, I came across all manner of press clippings from 2000-2005 or so about the activities of the OU Robotics Outreach Group, of which I was a co-founder (the web presence has only recently been shut down, in part because of the retirement of the sys admin on whose server the websites resided.) This group ran an annual open meeting every November for several years hosting talks from the educational robotics community in the UK (from primary school to HE level). The group also co-ordinated the RoboCup Junior competition in the UK, ran outreach events, developed various support materials and activities for use with Lego Mindstorms, and led the EPSRC/AHRC Creative Robotics Research Network.

At every robotics event, we’d try to involve kids and/or adults in elements of problem solving, mechanical design, programming (not really coding…) based around some sort of themed challenge: a robot fashion show, for example, or a treasure hunt (both variants on edge following/line following;-) Or a robot rescue mission, as used in a day long activity in the “Engineering: An Active Introduction” (TXR120) OU residential school, or the 3 hour “Robot Theme Park” team building activity in the Masters level “Team Engineering” (T885) weekend school. [If you’re interested, we may be able to take bookings to run these events at your institution. We can make them work at a variety of difficulty levels from KS3-4 and up;-)]

Given that working at the bits-atoms interface is where the a lot of the not-purely-theoretical-or-hardcore-engineering innovation and application development is likely to take place over the next few years, any mandate to drop the “boring” Windows training ICT stuff in favour of programming (which I suspect can be taught in not only a really tedious way, but a really confusing and badly delivered way too) is probably Not the Best Plan.

Slightly better, and something that I know is currently being mooted for reigniting interest in computing, is the Raspberry Pi, a cheap, self-contained, programmable computer on a board (good for British industry, just like the BBC Micro was…;-) that allows you to work at the interface between the real world of atoms and the virtual world of bits that exists inside the computer. (See also things like the OU Senseboard, as used on the OU course “My Digital Life” (TU100).)

If schools were actually being encouraged to make a financial investment on a par with the level of investment around the introduction of the BBC Micro, back in the day, I’d suggest a 3D printer would have more of the wow factor…(I’ll doodle more on the rationale behind this in another post…) The financial climate may not allow for that (but I bet budget will manage to get spent anyway…) but whatever the case, I think Gove needs to be wary about consigning kids to lessons of coding hell. And maybe take a look at programming in a wider creative context, such as robotics (the word “robotics” is one of the reason why I think it’s seen as a very specialised, niche subject; we need a better phrase, such as “Creative Technologies”, which could combine elements of robotics, games programming, photoshop, and, yex, Powerpoint too… Hmm… thinks.. the OU has a couple of courses that have just come to the end of their life that between them provide a couple of hundred hours of content and activity on robotics (T184) and games programming (T151), and that we delivered, in part, to 6th formers under the OU’s Young Applicants in Schools Scheme.

Anyway, that’s all as maybe… Because there are plenty of digital skills that let you do coding like things without having to write code. Such as finding out whether there are any differences between the text in the DfE copy of Gove’s BETT speech, and the Guardian copy.

Copy the text from each page into a separate text file, and save it. (You’ll need a text editor for that..) Then, if you haven’t already got one, find yourself a good text editor. I use Text Wrangler on a Mac. (Actually, I think MS Word may have a diff function?)

FInding diffs between txt doccs in Text Wrangler

The difference’s all tend to be in the characters used for quotation marks (character encodings are one of the things that can make all sorts of programmes fall over, or misbehave. Just being aware that they may cause a problem, as well as how and why, would be a great step in improving the baseline level understanding of folk IT. Some of the line breaks don’t quite match up either, but other than that, the text is the same.

Now, this may be because Gove was a good little minister and read out the words exactly as they had been prepared. Or it may be the case that the Guardian just reprinted the speech without mentioning provenance, or the disclaimer that he may not actually have read the words of that speech (I have vague memories of an episode of Yes, Minister, here…;-)

Whatever the case, if you know: a) that it’s even possible to compare two documents to see if they are different (a handy piece of folk IT knowledge); and b) know a tool that does it (or how to find a tool that does it, or a person that may have a tool that can do it), then you can compare the texts for yourself. And along the way, maybe learn that churnalism, in a variety of forms, is endemic in the media. Or maybe just demonstrate to yourself when the media is acting in a purely comms, rather than journalistic, role?

PS other phrases in the area: “computational thinking”. Hear, for example: A conversation with Jeannette Wing about computational thinking

PPS I just remembered – there’s a data journalism hook around this story too… from a tweet exchange last night that I was reminded of by an RT:

josiefraser: RT @grmcall: Of the 28,000 new teachers last year in the UK, 3 had a computer-related degree. Not 3000, just 3.
dlivingstone: @josiefraser Source??? Not found it yet. RT @grmcall: 28000 new UK teachers last year, 3 had a computer-related degree. Not 3000, just 3
josiefraser: That ICT qualification teacher stat RT @grmcall: Source is the Guardian

I did a little digging and found the following document on the General Teaching Council of England website – Annual digest of statistics 2010–11 – Profiles of registered teachers in England [PDF] – that contains demographic stats, amongst others, for UK teachers. But no stats relating to subject areas of degree level qualifications held, which is presumably the data referred to in the tweet. So I’m thinking: this is partly where the role of data journalist comes in… They may not be able to verify the numbers by checking independent sources, but they may be able to shed some light on where the numbers came from and how they were arrived at, and maybe even secure their release (albeit as a single point source?)

Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine

What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…

After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)

Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list ( or the list of likes someone has made (; replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.

(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)

Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:

Google Refine - import Facebook friends list

Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:

Google Refine - import Facebook friends

If you click the highlighted selection, you should see the data that will be used to create your project:

Google Refine - click to view the data

You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:

Google Refine - rename columns

We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:

Google Refine - new column from URL

The Likes URL has the form which we’ll tinker with as follows:

Google Refine - crafting URLs for new column creation

The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?

Having imported the data, you should find a new column:

Google Refine - new data imported

At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…

I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:

import simplejson,csv



data = simplejson.load(open(fn,'r'))
for d in data['rows']:
	#'interests' is the column name containing the Likes data
	for i in interests['data']:
		print str(id),i['name'],i['category']

[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]

I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:

Sketching common likes amongst my facebook friends

Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.

[See also: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and how to visualise Google+ networks]

PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…

PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)

PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.

PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?