OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Tracking Anonymous Wikipedia Edits From Specific IP Ranges

with one comment

Via @davewiner’s blog, I spotted a link to @congressedits, “a bot that tweets anonymous Wikipedia edits that are made from IP addresses in the US Congress”. (For more info, see why @congressedits?, /via @ostephens.) I didn’t follow the link to the home page for that account (doh!), but in response to a question about whether white label code was available, @superglaze pointed me to https://github.com/edsu/anon, a script that “will watch Wikipedia for edits from a set of named IP ranges and will tweet when it notices one”.

It turns out the script was inspired by @parliamentedits, a bot built by @tomscott that “tracks edits to Wikipedia made from Parliamentary IP addresses” built using IFTT and possibly a list of IP ranges operated by the House of Commons gleaned from this FOI request?

Nice…

My immediate thought was set up something to track edits made to Wikipedia from OU IP addresses, then idly wondered if set of feeds for tracking edits from HEIs in general might also be useful (something to add to the UK University Web Observatory for example?)

To the extent that Wikipedia represents an authoritative source of information, for some definition of authoritative(?!), it could be interesting to track the “impact” of our foolish universities in terms of contributing to the sum of of human knowledge as represented by Wikipedia.

It’d also be interesting to track the sorts of edits made from anonymous and named editors from HEI IP ranges. I wonder what classes they may fall into?

  1. edits from the marketing and comms folk?
  2. ego and peer ego edits, eg from academics keeping the web pages of other academics in their field up to date?
  3. research topic edits – academics maintaining pages that relate to their research areas or areas of scholarly interest?
  4. teaching topic edits – academics maintaining pages that relate to their teaching activities?
  5. library edits – edits made from the library?
  6. student edits – edits made by students as part of a course?
  7. “personal” edits – edits made by folk who class themselves and Wikimedians in general and just happen to make edits while they are on an HEI network?

My second thought was to wonder to what extent might news and media organisations be maintaining – or tweaking – Wikipedia pages? The BBC, for example, who have made widespread use of Wikipedia in their Linked Data driven music and wildlife pages.

Hmmm… news.. reminds me: wasn’t a civil servant who made abusive edits to a Wikipedia page sacked recently? Ah, yes: Civil servant fired after Telegraph investigation into Hillsborough Wikipedia slurs, as my OU colleague Andrew Smith suggested might happen.

Or how about other cultural organisations – museums and galleries for example?

Or charities?

Or particular brands? Hmm…

So I wonder: could we try to identify areas of expertise on, or attempted/potential influence over, particular topics by doing reverse IP lookups from pages focussed on those topics? This sort of mapping activity pivots the idea of visualising related entries in Wikipedia to map IP ranges, and perhaps from that locations and individuals associated with maintaining a set of resources around a particular topic area (cf. Visualising Delicious Tag Communities).

I think I started looking at how we might start to map IP ranges for organisations once….? Erm… maybe not, actually: it was looking up domains a company owned from its nameservers.

Hmm.. thinks… webstats show IP ranges of incoming requests – can we create maps from those? In fact, are there maps/indexes that give IP ranges for eg companies or universities?

I’m rambling again…

PS Related: Repository Googalytics – Visits from HEIs which briefly reviews the idea of tracking visits to HEI repositories from other HEIs…

Written by Tony Hirst

July 11, 2014 at 7:45 am

Posted in Anything you want

Foolish is as foolish does… fragments…

with one comment

It’s been that CDSA paperwork time of year again, and if nothing else it has forced me to start pulling together some fragments of ideas backed up by other peoples’ (or people’s – I never can remember) words…

So here are some fragments that are, I think, aligned to some of the things that I’ve thought for a long time and have been thinking again more recently, things that resonated with me as I read them just now…

The way in to this this time round for me was part of a talk by Prof Richard Keeble at the University of Lincoln School of Journalism Research Symposium last week, when he mentioned the role of the fool as a safety valve or regulator in the classical court…

First up, from “Class Clown and Court Jester“, David Chevreau, MA thesis, UBC, April 1994

p10-11
The fool can fill a number of specific roles in the group. He represents the rejected values, lost causes, fiascoes, and incompetencies of the larger gathering. His lowly yet valued position in the office of scapegoat and butt of humour gives him license to depart from the group’s accepted social norms with a unique impunity.
Footnote 10 – The fool is a rebel, outcast, prophet and whipping boy, and his office is a well defined social phenomenon.

As a scapegoat, you can speak the truth and people can choose to listen or not. The truth can be ignored if spoken by the fool, because the fool said it..

p13
Unperturbed by misplaced authority, the Class Clown seizes an opportunity either the naivete of the natural innocent or the insight of the wizened fool.

I’m often confused…

p16
The Fool exists on the fringe of social convention, and thus has a license which frees him from responsibility and consequence.

Scruffy hippy…

p18
It is difficult to accuse a Fool of meaning anything, for his foolish words may be nothing more than the babbling of the idiot or a disguise which reveals hidden truth to some but appears to be senseless chatter to others.

WTF is he on about?

p22
Transgressing the bounds of propriety in his failure to cope with convention the Fool does not suffer the usual loss of dignity associated with social failure.

FFS… swearing again…

p22
Standing at the fringe, the Fool may be a disinterested truth-teller whose apparent madness masks his breadth of perspective. The Fool is a detached observer who lives at the boundary not just between order and chaos but also between what is and what appears to be, and is often confused with the silly and deluded.

So, I deserve a Chair, right?!:-)

p23
The Fool perceives that the world has more to do with seeming than with seeing, for too much of our world is actually unknown and irrational. To take the imposition of order too seriously is the height of folly, and so the Fool, standing aside, sees through the illusion.

Ooh, big data… and squillionty billionty pounds will be made from open data… innit.

So that was that one…

Here’s another take… As I read (/red/) this, I also though about my own role within the university… As above,so below…? “Institutional Heterogeneity and Change: The University as Fool”, Donncha Kavanagh, Organization Volume 16(4): 575–595 ISSN 1350–5084 , 2009.

First, the scene is set…

p577 “Detailed study of the history of the University suggests that it is an institution that acts and has a role akin to the Fool in the royal court of medieval times.”

The paper then explores this idea in narrative form…

Fool as normative narrator:
p586 “The Fool is a story-teller, but its stories are always embedded in a framework of norms and values that connect the moment into longer conversations over time and space.”

There is a context to what we do, and a tradition that informs it, in both form and in content…

p587 “Akin to the medieval fool, who is not there to merely tell stories, the University is expected to provide a normative narrative or a critical interpretation of the world. … the University’s long tradition of academic freedom mirrors the Fool’s position as the Sovereign’s independent critic. … The university does not just (re-)tell stories, parables, and proverbs. Its power also comes about from its material ability to sort things out (Bowker and Star, 1999 [Bowker, G. and Star, S. L. (1999) Sorting Things Out: Classification and its Consequences. London: MIT Press.]); it is a sorter par excellence.”

The university helps make sense of the world… it can do this by putting things into perspective, or ordering them/organising them, in a particular way (that is, is can “sort” them, as you might sort a sock drawer, albeit one that doesn’t necessarily contain any actual pairs of socks…)

p588 “Through these twin processes of normative narrating and sorting the university constructs and maintains what I term the semiotic nexus. The semiotic nexus gives meaning to an institution — be it the University, its sovereign or one of the other institutions in the realm—through telling a multi-part, compelling, value-laden tale about the institution and its place in the world. The university is not the only institution engaged in this process of ‘making meaning’—narrating is a form of theorizing that everyone engages in—but it plays a central role in determining what counts as knowledge, as well as defining what is valuable, peripheral, obscene, sacred, profane, reputable, opinion, fact, etc. The University, like the Fool, personifies truth and reason, in that it is required to tell the truth, to abolish myth, and to distinguish fact from mere opinion. In other words, the University’s normative story-telling ability allied to its sorting practices and technologies are basic to how the University realizes its imagined community of academics, how it at once becomes an institution itself, and also how it maintains and sustains the semiotic nexus underpinning other institutions. In other words, these practices play a significant role in the process of institutionalization.”

I’ve noticed that people find it very hard to play… I can play all day… Erm… I can only play?!

p589 Play in the fool
“The Fool is a ludic spirit within the institutional complex, and play—a free activity standing outside of and opposed to the seriousness of ordinary life (Huizinga, 1955)—is its modus operandi. As with the child, the Fool is allowed, expected and given time and space to play. Through playing with language the Fool sparks a new (yet old) understanding of the here and now. This incandescent quality at once makes events alive—giving them immediate meaning—while simultaneously framing them within a longer temporal structure or longue durée that articulates the empirical with a transcendent truth. Each ‘play’ then endures as a new mental creation, to be repeated and retained in memory, echoing older refrains of truth and tradition. Following Huizinga, play is primordial and because of its close links with the sacred, it works to keep old norms and beliefs alive. The Fool as playmaker extraordinaire is central to this continual process of institutional re-creation through which an institution breathes, lives and renews itself.
Yet, because it takes work to create order within play, play always (sub- liminally) reminds us that the world is fundamentally chaotic and that any meaning within this chaos is always provisional and artificial. The Fool’s work of play then is to institutionalize order and at once to open up order to de-institutionalization. Through its role as playmaker, the Fool puts an institution ‘into play’, which means that work must be done to either recreate or de-stabilize the institution. In this way, the Fool’s ability and license to play is paradoxically central to both institutionalization and de-institutionalization.”

FWIW, seeing that mention of Huizinga, I’m reminded of how play is a serious business… see for example, Getting Philosophical About Games. The Magic Circle applies similarly to the little closed off workds we lock ourselves into when doing a research project. Don’t let anyone ever tell you that research is anything more than play, (though it’s often less…). Note also that that Digital Worlds blog post was itself is an ‘output’ from an uncourse I played with creating a few years ago. The material ended up being used by an actual OU course that followed on. I don’t think anyone in the OU really got it. Then MOOC hype shite came along and nothing really changed.

Playing the fool is a responsbile job, If you aren’t responsible, you can step beyond the bounds of playful foolishness and start “stirring”, or trying to use the cloak of foolishness to cause trouble directly…

“Another transgression occurs when the Fool cannot see beyond the play-making; i.e. the Fool becomes a Trickster, a Lucifer figure working solely to undermine and destroy order. This happens when the Fool forgets that part of the Fool’s role is sustaining order in the institutional complex.”

Beware Anansi taking over, in other words…

The “Emperor’s New Clothes” is one of my favourite stories. The boy is portrayed as foolish in his innocence, but he speaks a truth as a naif, or innocent. We see how corollaries to that story can be played out by the wise fool, rather than truth telling innocent…

p590 fool as educator
“Pursuing the metaphor of the Fool presents an interesting perspective on the University as an educational institution. While the Fool is an educator of sorts, she does not really ‘own’ knowledge that she ‘passes on’ as per our conventional understanding of pedagogy. Unlike the teacher who is usually cast as the learner’s caring coach, the Fool is an irritant, a provocateur, whose modus operandi is to provoke new wisdom in others. The Fool’s approach is, quite literally, to play the fool, acting as a lucid and ludic lens through which others perceive and recognize profound truths, truths that indeed may be lost in the conventions of learning and scholarship. The fool (like the child) is not expected to ‘know’ anything and is therefore free to act the fool, because she cannot, by definition, ‘know any better’. Paradoxically, this epistemic vacuum is also a potential source of great wisdom, which is why the idea of the ‘wise fool’ has such a long tradition. Moreover, the oxymoron ‘wise fool’ is also reversible: he that believes himself to be wise is necessarily foolish. For the Fool also reminds us that knowledge of the mystery of life is always beyond even the wise; at best we can only know that there is much of which we are and can only be ignorant.

[The university] must be the institutional manifestation of an oxymoron, remembering that this word comes from the Greek, oxumo ̄rone, meaning ‘pointedly foolish’.”

– Fin

Written by Tony Hirst

July 10, 2014 at 6:11 pm

Posted in Anything you want

Tagged with

Swipe-ify Next and Previous Links?

with 2 comments

I’ve just been looking at the OU’s Moodle VLE which have things like this in them (my highlighting):

Unit_1_Overview_of_H818__1_3_Introducing_the_idea_of_an_‘open_studio’_and_Unit_1_Overview_of_H818__1_3_Introducing_the_idea_of_an_‘open_studio’

That is, previous and next links, in quite small type. Increasing amounts of out materials are presented as HTML docs, with sections automatically segmented into separate HTML pages. So for example, here’s the navigation (i.e. page chunking) for a randomly selected unit of a randomly selected course.

pagification

The same materials are also made available in a variety of document formats:

alternativeFormats

One of the disadvantages of the HTML page link click model is that it requires mouse cursor movement and click actions. I’m not sure how quickly you can tab to the previous and next links, or whether keyboard shortcuts are available. (If they do exist: a) what are they; b) where would I learn about them as a student?)

On a tablet, the keyboard shortcuts aren’t really relevant – however, what might be useful would be to be able to swipe left or right for the previous/next actions. Maybe the VLE supports that already? Or maybe the browser ties the swipe to forward and back history buttons/operations and overriding them for previous and next link operations (so maybe use upswipe and downswipe, or diagonal swipe, instead?)

I’m guess what I’m really wondering is, is there a progressive enhancement library that allows swipe gestures to be tied to click actions, and if so, if implemented in the VLE (assuming the VLE doesn’t already provide a mobile theme that supports this sort of action), what would it actually feel like to use?

Written by Tony Hirst

July 4, 2014 at 3:34 pm

Posted in OU2.0

Tagged with ,

Lazyweb Request – Node-RED & F1 Timing Data

leave a comment »

A lazyweb request, because I’m rushing for a boat, going to be away from reliable network connections for getting on for a week, and would like to be able to play from a running start when I get back next week…

In context of the Tata/F1 timing data competition, I’d like to be able to have a play with the data in Node-RED. A feed-based, flow/pipes like environment, Node-RED’s been on my “should play with” list for some time, and this provides a good opportunity.

The data as provided looks something like:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

as a text file. (In the wild, it would be a real time data feed over http or https.)

What I’d like as a crib to work from is a Node-RED demo that has:

1) a file reader that reads the data in from the data file and plays it in as a stream in “real time” according to the timestamps, given a dummy start time;

2) an example of handling state – eg keeping track of drivernumber. (The row is effectively race position, Looking at column 2 (driverNumber), we can see what position a driver is in. Keep track of (row,driverNumber) pairs and if a driver changes position, flag it along with what the previous position was);

3) an example of appending the result to a flat file – for example, building up a list of statements “Driver number x has moved from position M to position N” over time.

Shouldn’t be that hard, right? And it would provide a good starting point for other people to be able to have a play without hassling over how to do the input/output bits?

Written by Tony Hirst

July 4, 2014 at 6:15 am

Posted in Tinkering

Tagged with , ,

Confused About “The Right to be Forgotten”. Was: Is Web Search is Getting Harder?

with one comment

[Seems I forgot to post this, though I started drafting it on May 19th... Anyway, things seem to have moved on a bit...]

A search related story in the news last week reported on a ruling by the European Union Court of Justice that got wide billing as a “right to be forgotten” (eg BBC News: EU court backs ‘right to be forgotten’ in Google case).

Here’s another example of “censorship”? WordPress not allowing me to link to a URL because it insists on rewriting the & characters in it – here’s the actual link:

http://curia.europa.eu/juris/document/document.jsf?text=&docid=152065&pageIndex=0&doclang=EN&mode=req&dir=&occ=first&part=1&cid=250828

CURIA_-_Documents_and_A_critical_reflection_on_Big_Data__Considering_APIs__researchers_and_tools_as_data_makers___Vis___First_Monday

For stories like this, I try to look at the original ruling but also tend to turn to law blogs such as my colleague Ray Corrigan’s B2fxxx (though he hasn’t posted on this particular story yet?) or Pinsent Mason’s Out-law (eg Out-law: Google data protection ruling has implications for multi-faceted global businesses) to find out what was actually said.

Here’s the gist of the rulings:

  • “the activity of a search engine consisting in finding information published or placed on the internet by third parties, indexing it automatically, storing it temporarily and, finally, making it available to internet users according to a particular order of preference must be classified as ‘processing of personal data’ within the meaning of Article 2(b) when that information contains personal data and, second, the operator of the search engine must be regarded as the ‘controller’ in respect of that processing, within the meaning of Article 2(d).” So what? A person has a right to object to a data controller about the way their data is processed and can obtain “the rectification, erasure or blocking of data” because of its “incomplete or inaccurate nature”.
  • “processing of personal data is carried out in the context of the activities of an establishment of the controller on the territory of a Member State, within the meaning of that provision, when the operator of a search engine sets up in a Member State a branch or subsidiary which is intended to promote and sell advertising space offered by that engine and which orientates its activity towards the inhabitants of that Member State.” So Google was found to be “established” in EU member state territories. Are there any implications from that ruling regards tax situation, I wonder?
  • Insofar as the processing of personal data that has been subject to a successful objection goes, “the operator of a search engine is obliged to remove from the list of results displayed following a search made on the basis of a person’s name links to web pages, published by third parties and containing information relating to that person, also in a case where that name or information is not erased beforehand or simultaneously from those web pages, and even, as the case may be, when its publication in itself on those pages is lawful.” Note there are limits on this in the case of legitimate general public interest.
  • The final ruling seems to at least admit the possibility that folk can request data be taken down without them having to demonstrate that it is prejudicial to them? “when appraising the conditions for the application of those provisions, it should inter alia be examined whether the data subject has a right that the information in question relating to him personally should, at this point in time, no longer be linked to his name by a list of results displayed following a search made on the basis of his name, without it being necessary in order to find such a right that the inclusion of the information in question in that list causes prejudice to the data subject. As the data subject may, in the light of his fundamental rights under Articles 7 and 8 of the Charter, request that the information in question no longer be made available to the general public on account of its inclusion in such a list of results, those rights override, as a rule, not only the economic interest of the operator of the search engine but also the interest of the general public in having access to that information upon a search relating to the data subject’s name. “However, that would not be the case if it appeared, for particular reasons, such as the role played by the data subject in public life, that the interference with his fundamental rights is justified by the preponderant interest of the general public in having, on account of its inclusion in the list of results, access to the information in question.”.

[Update, July 2nd, 2014]
It seems as if things have moved on – Google is publishing notices in the google.co.uk territotry at least to the effect that “Some results may have been removed under data protection law in Europe” [my emphasis].

_Esmerelda_Bobbins__-_Google_Search

The FAQ describes what’s happening thus:

When you search for a name, you may see a notice that says that results may have been modified in accordance with data protection law in Europe. We’re showing this notice in Europe when a user searches for most names, not just pages that have been affected by a removal.

The media are getting uppity about it of course, eg Peston completely misses the point, as well as getting it wrong?

BBC_News_-_Why_has_Google_cast_me_into_oblivion__and_Stan_O_Neal_site_bbc_co_uk_-_Google_Search_and_https___circle_ubc_ca_bitstream_handle_2429_5042_ubc_1994-0177_pdf____1

In fact, it seems as if the BBC themselves are doing a much better job of obliviating Peston from their own search results…

peston_oblivion

What all the hype this time around seems to be missing – as with the reporting around the original ruling – is the interpretation that the court ruled on about the behaviour of the search engines insofar as they are deemed to be processors of “personal data”. (Of course, these companies also process personal data as part of the business of operating user accounts, but the business functions – of managing user accounts versus operating a search index and returning search results from public queries applied to it – are presumably sandboxed as far as data protection legislation goes.)

If Google is deemed to be a data controller of personal data that is processed as a result of the way it operates its search index, it presumably means that I can make a subject access request about the data that the Google search index holds about me (as well as the subject access requests I can make to the bit of Google that operates the various Google accounts that I have registered).

As far as the loss of the “right to discover” that the hacks are banging on about as a consequence of “the right to be forgotten”, does this mean that Google is the start and end point of their research activity? (And also putting aside the point that most folk: a) don’t look past the first few results; b) are rubbish at searching. As far as search engine ranking algorithms go – erm, what sort of “truth” do you think reveal? How do you think Google ranks results? And how do you think it comparatively ranks content generated years ago (when links were more persistent than a brief appearance in Twitter feeds and Facebook streams) to content generated more recently (that doesn’t set up persistent link structures)?)

Don’t they use things like Nexis UK?

Or if anything other than Google is too hard, they can just edit the URL to use google.com rather than google.co.uk

This is where it probably also starts to make sense to look back to the original ruling and spend some time reading it more closely. Is LexisNexis a data controller, subject to data protection legislation, based on it’s index of news media content? Are the indices it operates around court cases similarly covered?

Written by Tony Hirst

July 3, 2014 at 1:35 pm

Posted in Policy

F1 Doing the Data Visualisation Competition Thing With Tata?

leave a comment »

Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014″? What’s that about then?)

Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.

The competition site can be found here: The F1 Innovation Connectivity Prize.

The first challenge is framed as follows:

The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.

The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:

- Position
– Car number
– Driver’s name
– Fastest lap time
– Gap to the leader’s fastest lap time
– Sector 1 time for the current lap
– Sector 2 time for the current lap
– Sector 3 time for the current lap
– Number of laps

If you aren’t familiar with motorsport timing screens, they typically look like this…

f1-innovation-prize_s3_amazonaws_com_challenge_packs_The_F1_Connectivity_Innovation_Prize_–_Challenge_1_Brief_pdf

A technical manual is also provided for helping makes sense of the data files.

Basic_Timing_Data_Protocol_Overview_pdf__page_1_of_15_

Here are fragments from the data files – one for practice, one for qualifying and one for the race.

First up, practice:

...
<transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction>
<transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction>
<transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction>
<transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction>
<transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction>
<transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction>
...

Then qualifying:

...
<transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction>
<transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction>
<transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction>
<transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction>
<transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction>
<transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction>
...

Then the race:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

We can parse the datafiles using python using an approach something like the following:

from lxml import etree

pl=[]
for xml in open(xml_doc, 'r'):
    pl.append(etree.fromstring(xml))

pl[100].attrib
#{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'}

pl[100][0].attrib
#{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}

A few things are worth mentioning about this format… Firstly, the identifier is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the messagecount. The transactions each update the value of a single cell in the display screen, setting its value and colour. The cell is identified by its row and column co-ordinates. The timestamp also appears to group messages.

Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular identifier. This means the data feed is essentially powering several data structures.

Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.

It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for timestamp=”14:57:11.681″ and row=”3″. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame .align() method.

[I think I'm thinking about this wrong: the updates to a row appear to come in column order, so if column 2 changes, the driver number, then changes to the rest of the row will follow. So if we keep track of a cursor for each row describing the last column updated, we should be able to track things like row changes, end of lap changes when sector times change and so on. Pitting may complicate matters, but at least I think I have an in now... Should have looked more closely the first time... Doh!]

Note: I’m not sure that the timestamps are necessarily unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp and the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?

Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…

The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:

#Hacky load and parse of each row in the datafile
pl=[]
for xml in open('data/F1 Practice.txt', 'r'):
    pl.append(etree.fromstring(xml))

#Dataframe for current state timing screen
df_practice_pos=pd.DataFrame(columns=[
    "timestamp", "time",
    "classpos",  "classpos_colour",
    "racingNumber","racingNumber_colour",
    "name","name_colour",
],index=range(50))

#Column mappings
practiceMap={
    '1':'classpos',
    '2':'racingNumber',
    '3':'name',
    '4':'laptime',
    '5':'gap',
    '6':'sector1',
    '7':'sector2',
    '8':'sector3',
    '9':'laps',
    '21':'sector1_best',
    '22':'sector2_best',
    '23':'sector3_best'
}

def parse_practice(p,df_practice_pos):
    if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib:
        if p[0].attrib['column'] not in ['10','21','22','23']:
            colname=practiceMap[p[0].attrib['column']]
            row=int(p[0].attrib['row'])-1
            df_practice_pos.ix[row]['timestamp']=p.attrib['timestamp']
            tt=p.attrib['timestamp'].replace('.',':').split(':')
            df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000)
            df_practice_pos.ix[row][colname]=p[0].attrib['value']
            df_practice_pos.ix[row][colname+'_colour']=p[0].attrib['colour']
    return df_practice_pos

for p in pl[:2850]:
    df_practice_pos=parse_practice(p,df_practice_pos)
df_practice_pos

(See the notebook.)

Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.

That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:

racedemo

A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.

Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the racedemo.ipynb notebook). If we create an ID based on the timestamp and row, we can repeatedly $set document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the racedemo.ipynb notebook has the first fumblings of an attempt to work this out.

I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-)

PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview…

Written by Tony Hirst

July 2, 2014 at 10:38 pm

Posted in f1stats, Rstats

Tagged with ,

AP Business Wire Service Takes on Algowriters

with 2 comments

Via @simonperry, news that AP will use robots to write some business stories (Automated Insights are one of several companies I’ve been tracking over the years who are involved in such activities, eg Notes on Narrative Science and Automated Insights).

The claim is that using algorithms to do the procedural writing opens up time for the journalists to do more of the sensemaking. One way I see this is that we can use data2text techniques to produce human readable press releases of things like statistical releases, which has a couple of advantages at least.

Firstly, the grunt – and error prone – work of running the numbers (calculating month on month or year on year changes, handling seasonal adjustments etc) can be handled by machines using transparent and reproducible algorithms. Secondly, churning numbers into simple words (“x went up month on month from Sept 2013 to Oct 2013 and down year on year from 2012″) makes them searchable using words, rather than having to write our own database or spreadsheet queries with lots of inequalities in them.

In this respect, something that’s been on my to do list for way to long is to produce some simple “press release” generators based on ONS releases (something I touched on in Data Textualisation – Making Human Readable Sense of Data).

Matt Waite’s upcoming course on “automated story bots” looks like it might produce some handy resources in this regard (code repo). In the meantime, he already shared the code described in How to write 261 leads in a fraction of a second here: ucr-story-bot.

For the longer term, on my “to ponder” list is what might something like “The Grammar of Graphics” be for data textualisation? (For background, see A Simple Introduction to the Graphing Philosophy of ggplot2.)

For example, what might a ggplot2 inspired gtplot library look like for converting data tables not into chart elements, but textual elements? Does it even make sense to try to construct such a grammar? What would the corollaries to aesthetics, geoms and scales be?

I think I perhaps need to mock-up some examples to see if anything comes to mind and that the function names, as well as the outputs, might look like, let alone the code to implement them! Or maybe code first is the way, to get a feel for how to build up the grammar from sensible looking implementation elements? Or more likely, perhaps a bit of iteration may be required?!

Written by Tony Hirst

July 2, 2014 at 10:00 am

Follow

Get every new post delivered to your Inbox.

Join 784 other followers