OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Confused About “The Right to be Forgotten”. Was: Is Web Search is Getting Harder?

with one comment

[Seems I forgot to post this, though I started drafting it on May 19th... Anyway, things seem to have moved on a bit...]

A search related story in the news last week reported on a ruling by the European Union Court of Justice that got wide billing as a “right to be forgotten” (eg BBC News: EU court backs ‘right to be forgotten’ in Google case).

Here’s another example of “censorship”? WordPress not allowing me to link to a URL because it insists on rewriting the & characters in it – here’s the actual link:



For stories like this, I try to look at the original ruling but also tend to turn to law blogs such as my colleague Ray Corrigan’s B2fxxx (though he hasn’t posted on this particular story yet?) or Pinsent Mason’s Out-law (eg Out-law: Google data protection ruling has implications for multi-faceted global businesses) to find out what was actually said.

Here’s the gist of the rulings:

  • “the activity of a search engine consisting in finding information published or placed on the internet by third parties, indexing it automatically, storing it temporarily and, finally, making it available to internet users according to a particular order of preference must be classified as ‘processing of personal data’ within the meaning of Article 2(b) when that information contains personal data and, second, the operator of the search engine must be regarded as the ‘controller’ in respect of that processing, within the meaning of Article 2(d).” So what? A person has a right to object to a data controller about the way their data is processed and can obtain “the rectification, erasure or blocking of data” because of its “incomplete or inaccurate nature”.
  • “processing of personal data is carried out in the context of the activities of an establishment of the controller on the territory of a Member State, within the meaning of that provision, when the operator of a search engine sets up in a Member State a branch or subsidiary which is intended to promote and sell advertising space offered by that engine and which orientates its activity towards the inhabitants of that Member State.” So Google was found to be “established” in EU member state territories. Are there any implications from that ruling regards tax situation, I wonder?
  • Insofar as the processing of personal data that has been subject to a successful objection goes, “the operator of a search engine is obliged to remove from the list of results displayed following a search made on the basis of a person’s name links to web pages, published by third parties and containing information relating to that person, also in a case where that name or information is not erased beforehand or simultaneously from those web pages, and even, as the case may be, when its publication in itself on those pages is lawful.” Note there are limits on this in the case of legitimate general public interest.
  • The final ruling seems to at least admit the possibility that folk can request data be taken down without them having to demonstrate that it is prejudicial to them? “when appraising the conditions for the application of those provisions, it should inter alia be examined whether the data subject has a right that the information in question relating to him personally should, at this point in time, no longer be linked to his name by a list of results displayed following a search made on the basis of his name, without it being necessary in order to find such a right that the inclusion of the information in question in that list causes prejudice to the data subject. As the data subject may, in the light of his fundamental rights under Articles 7 and 8 of the Charter, request that the information in question no longer be made available to the general public on account of its inclusion in such a list of results, those rights override, as a rule, not only the economic interest of the operator of the search engine but also the interest of the general public in having access to that information upon a search relating to the data subject’s name. “However, that would not be the case if it appeared, for particular reasons, such as the role played by the data subject in public life, that the interference with his fundamental rights is justified by the preponderant interest of the general public in having, on account of its inclusion in the list of results, access to the information in question.”.

[Update, July 2nd, 2014]
It seems as if things have moved on – Google is publishing notices in the google.co.uk territotry at least to the effect that “Some results may have been removed under data protection law in Europe” [my emphasis].


The FAQ describes what’s happening thus:

When you search for a name, you may see a notice that says that results may have been modified in accordance with data protection law in Europe. We’re showing this notice in Europe when a user searches for most names, not just pages that have been affected by a removal.

The media are getting uppity about it of course, eg Peston completely misses the point, as well as getting it wrong?


In fact, it seems as if the BBC themselves are doing a much better job of obliviating Peston from their own search results…


What all the hype this time around seems to be missing – as with the reporting around the original ruling – is the interpretation that the court ruled on about the behaviour of the search engines insofar as they are deemed to be processors of “personal data”. (Of course, these companies also process personal data as part of the business of operating user accounts, but the business functions – of managing user accounts versus operating a search index and returning search results from public queries applied to it – are presumably sandboxed as far as data protection legislation goes.)

If Google is deemed to be a data controller of personal data that is processed as a result of the way it operates its search index, it presumably means that I can make a subject access request about the data that the Google search index holds about me (as well as the subject access requests I can make to the bit of Google that operates the various Google accounts that I have registered).

As far as the loss of the “right to discover” that the hacks are banging on about as a consequence of “the right to be forgotten”, does this mean that Google is the start and end point of their research activity? (And also putting aside the point that most folk: a) don’t look past the first few results; b) are rubbish at searching. As far as search engine ranking algorithms go – erm, what sort of “truth” do you think reveal? How do you think Google ranks results? And how do you think it comparatively ranks content generated years ago (when links were more persistent than a brief appearance in Twitter feeds and Facebook streams) to content generated more recently (that doesn’t set up persistent link structures)?)

Don’t they use things like Nexis UK?

Or if anything other than Google is too hard, they can just edit the URL to use google.com rather than google.co.uk

This is where it probably also starts to make sense to look back to the original ruling and spend some time reading it more closely. Is LexisNexis a data controller, subject to data protection legislation, based on it’s index of news media content? Are the indices it operates around court cases similarly covered?

Written by Tony Hirst

July 3, 2014 at 1:35 pm

Posted in Policy

F1 Doing the Data Visualisation Competition Thing With Tata?

with one comment

Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014″? What’s that about then?)

Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.

The competition site can be found here: The F1 Innovation Connectivity Prize.

The first challenge is framed as follows:

The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.

The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:

- Position
– Car number
– Driver’s name
– Fastest lap time
– Gap to the leader’s fastest lap time
– Sector 1 time for the current lap
– Sector 2 time for the current lap
– Sector 3 time for the current lap
– Number of laps

If you aren’t familiar with motorsport timing screens, they typically look like this…


A technical manual is also provided for helping makes sense of the data files.


Here are fragments from the data files – one for practice, one for qualifying and one for the race.

First up, practice:

<transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction>
<transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction>
<transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction>
<transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction>
<transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction>
<transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction>

Then qualifying:

<transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction>
<transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction>
<transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction>
<transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction>
<transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction>
<transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction>

Then the race:

<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>


We can parse the datafiles using python using an approach something like the following:

from lxml import etree

for xml in open(xml_doc, 'r'):

#{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'}

#{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}

A few things are worth mentioning about this format… Firstly, the identifier is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the messagecount. The transactions each update the value of a single cell in the display screen, setting its value and colour. The cell is identified by its row and column co-ordinates. The timestamp also appears to group messages.

Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular identifier. This means the data feed is essentially powering several data structures.

Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.

It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for timestamp=”14:57:11.681″ and row=”3″. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame .align() method.

[I think I'm thinking about this wrong: the updates to a row appear to come in column order, so if column 2 changes, the driver number, then changes to the rest of the row will follow. So if we keep track of a cursor for each row describing the last column updated, we should be able to track things like row changes, end of lap changes when sector times change and so on. Pitting may complicate matters, but at least I think I have an in now... Should have looked more closely the first time... Doh!]

Note: I’m not sure that the timestamps are necessarily unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp and the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?

Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…

The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:

#Hacky load and parse of each row in the datafile
for xml in open('data/F1 Practice.txt', 'r'):

#Dataframe for current state timing screen
    "timestamp", "time",
    "classpos",  "classpos_colour",

#Column mappings

def parse_practice(p,df_practice_pos):
    if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib:
        if p[0].attrib['column'] not in ['10','21','22','23']:
            df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000)
    return df_practice_pos

for p in pl[:2850]:

(See the notebook.)

Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.

That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:


A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.

Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the racedemo.ipynb notebook). If we create an ID based on the timestamp and row, we can repeatedly $set document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the racedemo.ipynb notebook has the first fumblings of an attempt to work this out.

I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-)

PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview…

Written by Tony Hirst

July 2, 2014 at 10:38 pm

Posted in f1stats, Rstats

Tagged with ,

AP Business Wire Service Takes on Algowriters

with 2 comments

Via @simonperry, news that AP will use robots to write some business stories (Automated Insights are one of several companies I’ve been tracking over the years who are involved in such activities, eg Notes on Narrative Science and Automated Insights).

The claim is that using algorithms to do the procedural writing opens up time for the journalists to do more of the sensemaking. One way I see this is that we can use data2text techniques to produce human readable press releases of things like statistical releases, which has a couple of advantages at least.

Firstly, the grunt – and error prone – work of running the numbers (calculating month on month or year on year changes, handling seasonal adjustments etc) can be handled by machines using transparent and reproducible algorithms. Secondly, churning numbers into simple words (“x went up month on month from Sept 2013 to Oct 2013 and down year on year from 2012″) makes them searchable using words, rather than having to write our own database or spreadsheet queries with lots of inequalities in them.

In this respect, something that’s been on my to do list for way to long is to produce some simple “press release” generators based on ONS releases (something I touched on in Data Textualisation – Making Human Readable Sense of Data).

Matt Waite’s upcoming course on “automated story bots” looks like it might produce some handy resources in this regard (code repo). In the meantime, he already shared the code described in How to write 261 leads in a fraction of a second here: ucr-story-bot.

For the longer term, on my “to ponder” list is what might something like “The Grammar of Graphics” be for data textualisation? (For background, see A Simple Introduction to the Graphing Philosophy of ggplot2.)

For example, what might a ggplot2 inspired gtplot library look like for converting data tables not into chart elements, but textual elements? Does it even make sense to try to construct such a grammar? What would the corollaries to aesthetics, geoms and scales be?

I think I perhaps need to mock-up some examples to see if anything comes to mind and that the function names, as well as the outputs, might look like, let alone the code to implement them! Or maybe code first is the way, to get a feel for how to build up the grammar from sensible looking implementation elements? Or more likely, perhaps a bit of iteration may be required?!

Written by Tony Hirst

July 2, 2014 at 10:00 am

Confused about “Confused About…” Posts

with 2 comments

Longtime readers will know that every so often I publish a posts whose title starts “Confused About”. The point of these posts is to put a marker down in this public notebook about words, phrases, ideas or stories that seem to make sense to everyone else but that I really don’t get.

They’re me putting my hand up and asking the possibly stupid question, then trying to explain my confusion and the ways in which I’m trying to make sense of the idea.

As educators, we’re forever telling learners not to be afraid of asking the question (“if you have a question, ask it: someone less comfortable even that than you with asking questions probably has the same question too, so you’ll be doing everyone a favour”), not to be afraid of volunteering an answer.

Of course, as academics, we can’t ever be seen to be wrong or to not know the answer, which means we can’t be expected to admit to being confused or not understanding something. Which is b*****ks of course.

We also can’t be seen to write down anything that might be wrong, because stuff that’s written down takes on the mantle of some sort of eternal truthiness. Which is also b*****ks. (This blog is a searchable, persistent, though mutable by edits, notebook of stuff that I was thinking at the time it was written. As the disclaimer could go, it does not necessarily represent even my own ideas or beliefs…)

It’s easy enough to take countermeasures to avoid citation of course – never publish in formal literature; if it’s a presentation that’s being recorded, put some music in it whose rights owners are litigious, or some pictures of Disney characters. Or swear…. Then people won’t link to you.

Anyway, back to being confused… I think that’s why I post these posts…

I also like to think they’re an example of open practice…

Written by Tony Hirst

July 2, 2014 at 9:03 am

Posted in Anything you want

Tagged with

Anscombe’s Quartet – IPython Notebook

leave a comment »

Anyone who’s seen one of my talks that even touches on data and visualisation will probably know how it like to use Anscombe’s Quartet as a demonstration of why it makes sense to look at data, as well as to illustrate the notion of a macroscope, albeit one applied to a case of N=all where all is small…

Some time ago I posted a small R demo – The Visual Difference – R and Anscombe’s Quartet. For the new OU course I’m working on (TM351 – “The Data Course”), our focus is on using IPython Notebooks. And as there’s a chunk in the course about dataviz, I feel more or less obliged to bring Anscombe’s Quartet in:-)

As we’re still finding our way about how to make use of IPython Notebooks as part of an online distance education course, I’m keen to collect feedback on some of the ways we’re considering using the notebooks.

The Anscombe’s Quartet notebook has quite a simple design – we’re essentially just using the cells as computed reveals – but I’m still be keen to hear any comments about how well folk think it might work as a piece of standalone teaching material, particularly in a distance education setting.

The notebook itself is on github (ou-tm351), along with sample data, and a preview of the unexecuted notebook can be viewed on nbviewer: Anscombe’s Quartet – IPython Notebook.

Just by the by, the notebook also demonstrates the use of pandas for reshaping the dataset (as well as linking out to a demonstration of how to reshape the data using OpenRefine) and the ŷhat ggplot python library (docs, code) for visualising the dataset.

Please feel free to post comments here or as issues on the github repo.

Written by Tony Hirst

June 30, 2014 at 1:54 pm

Posted in OU2.0

Tagged with , , ,

It’s Not a Watch, It’s an Electronic Monitoring Tag

with one comment

In the closing line of his Observer Networker column on Sunday 23 March 2014 – Eisenhower’s military-industrial warning rings truer than ever John Naughton/@jjn1 concluded: “we’re witnessing the evolution of a military-information complex”. John also tweeted the story:

Google’s corporate Code of Conduct may begin with Don’t be evil, but I think a precautionary principle of considering the potential for evil should also be applied when trying to think through the possible implications of what Google, and other information companies of the same ilk, could do…

This isn’t about paranoid tin foil hat wearing fantasy – it’s about thinking through how cool it would be to try stuff out… and then oopsing. Or someone realising that that whatever can make shed loads of money, and surely it can’t hurt. Or a government forcing the company to do whatever. Or another company with an evilness agnostic motto (such as “maximise short term profits”) buying the company.

“Just” and “all you need to do” are often phrases that unpack badly in the tech world (“just” can be really hard, with multiple dependencies; “all you have to do” might mean you have to do pretty much everything). On the other hand “sure, we can do that” can cover things that are flick of a switch possible, but tend not to be done for policy reasons.

Geeks are optimists – “just” can take hours, days, weeks.. “Sure” can be dangerous. “Erm, well we can, but…” can mean game over when don’t be evil becomes a realised potential for evil.

What would we think if G4S, purveyors of monitoring technologies that have nothing to do with wearables or watches (although your “smart watch” is a tag, right?) bought Skybox Imaging?

What would we think if Securitas had bought Boston Dynamics, to keep up with G4S’ claimed use of surveillance drones?

What if Google, who sell advertising to influence you, or Amazon, who try to directly influence you to buy stuff, had been running #pysops experiments testing their ability to manipulate your emotions. [They do, of course...] Like advertising company company Facebook did – Experimental evidence of massive-scale emotional contagion through social networks:

The experiment manipulated the extent to which people (N = 689,003) were exposed to emotional expressions in their News Feed. This tested whether exposure to emotions led people to change their own posting behaviors, in particular whether exposure to emotional content led people to post content that was consistent with the exposure—thereby testing whether exposure to verbal affective expressions leads to similar verbal expressions, a form of emotional contagion. People who viewed Facebook in English were qualified for selection into the experiment.

[The experiment] was consistent with Facebook’s Data Use Policy, to which all users agree prior to creating an account on Facebook, constituting informed consent for this research. [my emphasis]

(For a defense of the experiment, see: In Defense of Facebook by Tal Yarkoni; I think one reason the chatterati are upset is because they remember undergrad psychology experiments and mention of ethics committees…)

Experian collect shed loads of personal data – including personal data they have privileged access to by virtue of being a credit reference checking agency – about you, me, all of us. Then we’re classed, demographically. And this information is sold, presumably to whoever’s buying, (local) government as well as commerce. Do we twitch if Google buys them? Do we care if GCHQ buys data from them?

What about Dunnhumby, processors of supermarket loyalty data among other things. Do we twitch if Amazon buys them?

Tesco Bank just started to offer a current account. Lots more lovely transaction data there:-)

I don’t know how to think through any of this. If raised in conversation, it always comes across as paranoid fantasy. But if you had access to the data, and all those toys? Well, you’d be tempted, wouldn’t you? To see if the “just” is a “just”, or it’s actually a “can’t”, or whether indeed it’s a straightforward “sure”…

Written by Tony Hirst

June 29, 2014 at 6:28 pm

Posted in Anything you want

Data Journalism – Conversations With Data Sources

with one comment

Annotated slides from my opening talk at the University of Lincoln Journalism dept. research day – Data Journalism – Having Conversations with Data:

Written by Tony Hirst

June 28, 2014 at 11:40 am

Posted in Presentation

Tagged with


Get every new post delivered to your Inbox.

Join 794 other followers