Archive for the ‘Infoskills’ Category
Reading through another wonderful post on the FullFact blog last night (Full Fact sources index: where to find the information you need), I noticed that the linked to resources from that post were being redirected via Google URL:
A tweet confirmed that this wasn’t intentional, so what had happened? I gather the workflow used to generate the post was to write it in Google docs, and then copy and paste the rich/HTML text into a rich text editor in Drupal, although I couldn’t recreate this effect (and nor could FullFact). However, suitably suspicious, I started having a play, writing a simple test document in Google docs:
The Google doc automatically links the test URL I added to the document. (This is often referred to as “linkification” – if a piece of text is recognised as something that looks like a URL or web link, it gets rewritten as a clickable link. Typically, you might assume that the link you’ll now be clicking on is the link that was recognised. This may be a bad assumption to make…) If you hover over the URL as written in the document, you get a tooltip that suggests the link is to the same URL. However, if you hover over the tooltip listed URL, (or click on it) you can see from the indicator in the bottom left hand corner of the browser what the actual URL you’re clicking on is. Like this:
In this case, the link you’ll actually click on is referral to the original link via a Google URL. This one, in fact:
What this means is that if I click on the link, Google tracks the fact that the link was clicked on. From the value of the usg variable (in this case, AFQjCNHgu25L-v9rkkMqZSX54E8kP_XR-A) it presumably also knows the source document containing the link and whatever follows from that.
Hmmm… If I publish the document, the Google rewrite appears to be removed:
There are also several export options associated with the document:
So what links get exported?
Here’s the Word export:
That seems okay – no tracking. How about odt?
That looks okay too. RTF and and HTML export also seem to publish the “clean” link.
What about PDF?
Hmm… so tracking is included here. So if you write a doc in Google docs that contains links that are autolinked, then you export that doc as PDF and share it with other folk, Google will know when folk click on that link from a copy of that PDF document and (presumably) the originally authored Google docs document (and all that that entails…)
How about if we email a doc as a PDF attachment to someone from within Google docs:
So that seems okay (untracked).
What’s the story then? FullFact claimed they cut and paste rich HTML from Google docs into a rich text editor and the Google redirection attack was inserted into the link. I couldn’t recreate that, and nor could the FullFact folk, so either there are some Google “experiments” going on, or the workflow was misremembered.
In my own experiments, I got a Google redirection from clicking links within my original document, and from the exported PDF, but not from any other formats?
So what do we learn? I guess this at least: be aware that when Google linkifies links for you, it may be redirecting clicks on those links through Google tracking servers. And that these tracked links may be propagated to exported and/or otherwise shared versions of the document.
PS see also Google/Feedburner Link Pollution or More Link Pollution – This Time from WordPress.com for more of the same, and Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There for a quick overview of what might happen when you actually land on a page…
Link rewriters are, of course, to be find in lots of other places too…
Twitter, for example, actually wraps all shared links in it’s t.co wrapper:
Delicious (which I’ve stopped using – I moved over to Pinboard) also uses it’s own proxy for clicked on stored bookmarks…
If you have any other examples, particularly of link rewriting/annotation/pollution where you wouldn’t expect it, please let me know via the comments…
I though this was handy on the OER-DISCUSS mailing list:
Our copyright officer writes:
… US Copyright ‘Fair Use’ or S29 copying for non-commercial research and private study which allows copying but the key word here is ‘private’. i.e. the provisos are that you don’t make the work or copies available to anyone else.
Although there are UK Exceptions for education, they are very limited or obsolete.
S.32 (1) and (2A) do have the proviso “is not done by reprographic process” which basically means that any copying by any mechanical means is excluded, i.e. you may only copy by hand.
S36 educational provision in law for reprographic copying is
a) only applicable to passages in published works i.e. books journals etc and
b) negated becauses licences are now available S.36 (3)
S.32 (2) permits only students studying courses in making Films or Film soundtracks to copy Film, broacasts or sound recordings.
The only educational exception students can rely on is s.32(3) for Examination athough this also is potentially restrictive. For the exception to apply, the work must count towards their final grade/award and any further dealing with the work after the examination process, becomes infringement.
I’m not sure how they are using Voicethread, but if the presentations are part of their assessed coursework and only available to students, staff and examiners on the course, they may use any Copyright protected content, provided it’s all removed from availability after the assessment (not sure how this works with cloud applications though)
There is also exception s.30 for Criticism or Review, which is a general exception for all, and the copying is necessary for a genuine critique or review of it.
If the students can’t rely on the last 3 exceptions, using Copyright free or licenced material (e.g. Creative Commons), would be highly recommended.
Kate Vasili – Copyright Officer, Middlesex University, Sheppard Library
Co-Director Network Data Files in GEXF and JSON from OpenCorporates Data via Scraperwiki and networkx
I’ve been tinkering with OpenCorporates data again, tidying up the co-director scraper described in Corporate Sprawl Sketch Trawls Using OpenCorporates (the new scraper is here: Scraperwiki: opencorporates trawler) and thinking a little about working with the data as a graph/network.
What I’ve been focussing on for now are networks that show connections between directors and companies, something we might imagine as follows:
In this network, the blue circles represent companies and the red circles directors. A line connecting a director with a company says that the director is a director of that company. So for example, company C1 has just one director, specifically D1; and director D2 is director of companies C2 and C3, along with director D3.
It’s easy enough to build up a graph like this from a list of “company-director” pairs (or “relations”). These can be described using a simple two column data format, such as you might find in a simple database table or CSV (comma separated value) text file, where each row defines a separate connection:
This is (sort of) how I’m representing data I’ve pulled from OpenCorporates, data that starts with a company seed, grabs the current directors of that target company, searches for other companies those people are directors of (using an undocumented OpenCorporates search feature – exact string matching on the director search (put the direction name in double quotes…;-), and then captures which of those companies share are least two directors with the original company.
In order to turn the data, which looks like this:
into a map that resembles something like this (this is actually a view over a different dataset):
we need to do a couple of things. Working backwards, these are:
- use some sort of tool to generate a pretty picture from the data;
- get the data out of the table into the tool using an appropriate exchange format.
Note that the positioning of the nodes in the visualisation may be handled in a couple of ways:
- either the visualisation tool uses a layout algorithm to work out the co-ordinates for each of the nodes; or
- the visualisation tool is passed a graph file that contains the co-ordinates saying where each node should be placed; the visualisation tool then simply lays out the graph using those provided co-ordinates.
The dataflow I’m working towards looks something like this:
networkx is a Python library (available on Scraperwiki) that makes it easy to build up representations of graphs simply by adding nodes and edges to a graph data structure. networkx can also publish data in a variety of handy exchange formats, including gexf (as used by Gephi and sigma.js), and a JSON graph representation (as used by d3.js and maybe sigma.js (example plugin?).
As a quick demo, I’ve built a scraperwiki view (opencorporates trawler gexf) that pulls on a directors_ table from my opencorporates trawler and then publishes the information either as gexf file (default) or as a JSON file using URLs of the form:
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2 (gexf default)
This view can therefore be used to easily export data from my OpenCorporates trawler as a gexf file that can be used to easily import data into the Gephi desktop tool, or provide a URL to some JSON data that can be visualised using a Javscript library within a web page (I started doodling the mechanics of one example here: sigmajs test; better examples of what’s possible can be seen at Exploring Data and on the Oxford Internet Institute – Visualisations blog. If anyone would like to explore building a nice GUI to my OpenCorporates trawl data, feel free:-).
We can also use networks to publish data based on processing the network. The example graph above shows a netwrok with two sorts of nodes, connected by edges: company nodes and director nodes. This is a special sort of graph in that companies are only ever connected to directors, and directors are only ever connected to companies. That is, the nodes fall into one of two sorts – company or director – and they only ever connect “across” node type lines. If you look at this sort of graph (sometimes referred to as a bipartite or bimodal graph) for a while, you might be able to spot how you can fiddle with it (technical term;-) to get a different view over the data, such as those directors connected to other directors by virtue of being directors of the same company:
or those companies that are connected by virtue of sharing common directors:
(Note that the lines/edges can be “weighted” to show the number of connections relating two companies or directors (that is, the number of companies that two directors are connected by, or the number of directors that two companies are connected by). We can then visually depict this weight using line/edge thickness.)
The networkx library conveniently provides functions for generating such views over the data, which can also be accessed via my scraperwiki view:
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=companies gives the view over companies connected by virtue of the fact that they share one or more common director;
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=officers&output=json gives a view over how directors are connected based on being co-directors of one or more of the same companies.
As the view is paramaterised via a URL, it can be used as a form of “glue logic” to bring data out of a directors table (which itself was populated by mining data from OpenCorporates in a particular way) and express it in a form that can be directly plugged in to a visualisation toolkit. Simples:-)
PS related: a templating system by Craig Russell for generating XML feeds from Google Spreadsheets – EasyOpenData.
I’ve been doodling… Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I’d have a quick look at some stats around follower groupings…
I started off with a data grab, pulling down the IDs of accounts on a particular Twitter list and then looking up the user details for each follower. This gives summary data such as the number of friends, followers and status updates; a timestamp for when the account was created; whether the account is private or not; the “location”, as well as a possibly more informative timezone field (you may tell fibs about the location setting but I suspect the casual user is more likely to set a timezone appropriate to their locale).
So what can we do with that data? Simple scatter plots, for one thing – here’s how friends vs. followers distribute for MPs on the Tweetminster UKMPs list:
We can also see how follower numbers are distributed across those MPs, for example, which looks reasonable and helps us get our eye in…:
We can also calculate ratios and then plot them – followers per day (the number of followers divided by the number of days since the account was registered, for example) vs the followers per status update (to try to get a feeling of how the number of followers relates to the number of tweets):
This particular view shows a few outliers, and allows us to spot a couple of accounts that have recently had a ‘change of use’.
As well as looking at the stats across the set of MPs, we can pull down the list of followers of a particular account (or sample thereof – I grabbed the lesser of all followers or 10,000 randomly sampled followers from a target account) and then look at the summary stats (number of followers, friends, date they joined Twitter, etc) over those followers.
So for example, things like this – a scatterplot of friends/follower counts similar to the one above:
…sort of. There’s something obviously odd about that graph, isn’t there? The “step up” at a friends count of 2000. This is because Twitter imposes, in most cases, a limit of 2000 friends on an account.
How about the followers per day for an account versus the number of days that account has been on Twitter, with outliers highlighted?
Alternatively, we can do counts by number of days the followers have been on Twitter:
The bump around 1500 days ago corresponds to Twitter getting suddenly popular around then, as this chart from Google Trends shows:
Sometimes, you get a distribution that is very, very wrong… If we do a histogram that has bins along the x-axis specifying that a follower had 0-100 followers of their own, or 500-600 followers etc, and then for all the followers of a particular account, pop them into a corresponding bin given the number of their followers, counting the number of people in each bin once we have allocated them all, we might normally expect to see something like this:
However, if an account is followed by lots of followers that have zero or very few followers of their own, we get a skewed distribution like this:
There’s obviously something not quite, erm, normal(?!) about this account (at least, at the time I grabbed the data, there was something not quite normal etc etc…).
When we get stats from the followers of a set of folk, such as the members of a list, we can generate summary statistics over the sets of followers of each person on the list – for example, the median number of followers, or different ratios (eg mean of the friend/follower ratios for each follower). Lots of possible stats – but which ones does it make sense to look at?
Here’s one… a plot of the median followers per status ratio versus the median friend/follower ratio:
Spot the outlier ;-)
So that’s a quick review of some of the views we can get from data grabs of the user details from the followers of a particular account. A useful complement to the social positioning maps I’ve also been doing for some time:
It’s just a shame that my whitelisted Twitter API key is probably going to die in few weeks:-(
[In the next post in this series I'll describe a plot that estimates when folk started following a particular account, and demonstrate how it can be used to identify notable "events" surrounding the person being followed...]
As the food labeling and substituted horsemeat saga rolls on, I’ve been surprised at how little use has been made of “data” to put the structure of the food chain into some sort of context* (or maybe I’ve just missed those stories?). One place that can almost always be guaranteed to post a few related datasets is the Guardian Datastore, who use EU horse import/export data to produce interactive map of the European trade in horsemeat
*One for the to do list – a round up of “#ddj” stories around the episode.)
(The article describes the source of the data as the Eurpoean Union Unistat statistics website, although I couldn’t find a way of recreating the Guardian spreadsheet from that source. When I asked Simon Rogers how he’d come by the data, he suggested putting questions into the Eurostat press office;-)
The data published by the Guardian datastore is a matrix showing the number of horse imports/exports between EU member countries (as well as major traders outside the EU) in 2012:
One way of viewing this data structure is as an edge weighted adjacency matrix that describes a graph (a network) in which the member countries are nodes and the cells in the matrix define edge weights between country nodes. The weighted edges are also directed, signifying the flow of animals from one country to another.
Thinking about trade as flow suggests a variety of different visualisation types that build on the metaphor of flow, such as a Sankey diagram. In a Sankey diagram, edges of different thicknesses connect different nodes, with the edge thickness dependent on the amount of “stuff” flowing through that connection. (The Guardan map above also uses edge thickness to identify trade volumes.) Here’s an example of a Sankey diagram I created around the horse export data:
(The layout is a little rough and ready – I was more interested in finding a recipe for creating the base graphic – sans design tweaks;-) – from the data as supplied.)
So how did I get to the diagram from the data?
As already mentioned, the data came supplied as an adjacency matrix. The Sankey diagram depicted above was generated by passing data in an appropriate form to the Sankey diagram plugin to Mike Bostock’s d3.js graphics library. The plugin requires data in a JSON data format that describes a graph. I happen to know that that the Python networkx library can generate an appropriate data object from a graph modeled using networkx, so I know that if I can generate a graph in networkx I can create a basic Sankey diagram “for free”.
So how can we create the graph from the data?
The networkx documentation describes a method – read_weighted_edgelist – for reading in a weighted adjacency matrix from a text file, and creating a network from it. If I used this to read the data in, I would get a directed network with edges going into and out of country nodes showing the number of imports and exports. However, I wanted to create a diagram in which the “import to” and “export from” nodes were distinct so that exports could be seen to flow across the diagram. The approach I took was to transform the two-dimensional adjacency matrix into a weighted edge list in which each row has three columns: exporting country, importing country, amount.
So how can we do that?
One way is to use R. Cutting and pasting the export data of interest from the spreadsheet and into a text file (adding in the missing first column header as I did so) gives a source data file that looks something like this:
So how do we get from one to the other?
Here’s the R script I used – it reads the file in, does a bit of fiddling to remove commas from the numbers and turn the result into integer based numbers, and then uses the melt function from the reshape library to generate the edge list, finally filtering out edges where there were no exports:
#R code horseexportsEU <- read.delim("~/Downloads/horseexportsEU.txt") require(reshape) #Get a "long" edge list x=melt(horseexportsEU,id='COUNTRY') #Turn the numbers into numbers by removing the comma, then casting to an integer x$value2=as.integer(as.character(gsub(",", "", x$value, fixed = TRUE) )) #If we have an NA (null/empty) value, make it -1 x$value2[ is.na(x$value2) ] = -1 #Column names with countries that originally contained spaces convert spaces dots. Undo that. x$variable=gsub(".", " ", x$variable, fixed = TRUE) #I want to export a subset of the data xt=subset(x,value2>0,select=c('COUNTRY','variable','value2')) #Generate a text file containing the edge list write.table(xt, file="foo.csv", row.names=FALSE, col.names=FALSE, sep=",")
(Another way of getting a directed, weighted edge list from an adjacency table might be to import it into networkx from the weighted adjacency matrix and then export it as weighted edge list. R also has graph libraries available, such as igraph, that can do similar things. But then, I wouldn’t have go to show the “melt” method to reshaping data;-)
Having got the data, I now use a Python script to generate a network, and then export the required JSON representation for use by the d3js Sankey plugin:
#python code import networkx as nx import StringIO import csv #Bring in the edge list explicitly #rawdata = '''"SLOVENIA","AUSTRIA",1200 #"AUSTRIA","BELGIUM",134600 #"BULGARIA","BELGIUM",181900 #"CYPRUS","BELGIUM",200600 #... etc #"ITALY","UNITED KINGDOM",12800 #"POLAND","UNITED KINGDOM",129100''' #We convert the rawdata string into a filestream f = StringIO.StringIO(rawdata) #..and then read it in as if it were a CSV file.. reader = csv.reader(f, delimiter=',') def gNodeAdd(DG,nodelist,name): node=len(nodelist) DG.add_node(node,name=name) #DG.add_node(node,name=name) nodelist.append(name) return DG,nodelist nodelist= DG = nx.DiGraph() #Here's where we build the graph for item in reader: #Even though import and export countries have the same name, we create a unique version depending on # whether the country is the importer or the exporter. importTo=item+'.' exportFrom=item amount=item if importTo not in nodelist: DG,nodelist=gNodeAdd(DG,nodelist,importTo) if exportFrom not in nodelist: DG,nodelist=gNodeAdd(DG,nodelist,exportFrom) DG.add_edge(nodelist.index(exportFrom),nodelist.index(importTo),value=amount) json = json.dumps(json_graph.node_link_data(DG)) #The "json" serialisation can then be passed to a d3js containing web page...
Once the JSON object is generated, it can be handed over to d3.js. The whole script is available here: EU Horse imports Sankey Diagram.
What this recipe shows is how we can chain together several different tools and techniques (Google spreadsheets, R, Python, d3.js) to create a visualisation with too much effort (honestly!). Each step is actually quite simple, and with practice can be achieved quite quickly. The trick to producing the visualisation becomes one of decomposing the problem, trying to find a path from the format the data is in to start with, to a form in which it can be passed directly to a visualisation tool such as the d3js Sankey plugin.
PS In passing, as well as the data tables that can be searched on Eurostat, I also found the Eurostat Yearbook, which (for the most recent release at least), includes data tables relating to reported items:
So it seems that the more I look, the more and more places seems to making data that appears in reports available as data…
As well as serendipity, I believe in confluence…
A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.
It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.
In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:
So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)
Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.
So what’s confluential about those?
The IWCP article localises the Fairer Gambling data in several ways:
– the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
– a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
– a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.
The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).
Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.
A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).
Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.
The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)
Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).
The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.
Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.
PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…
Slides without commentary from a presentation I gave to undergrads on data journalism at the University of Lincoln yesterday…
My plan is to write some words around this deck, (or maybe even to record (and perhaps transcribe), some sort of narration…) just not right now…
I’m happy to give variants of this presentation elsewhere, if you can cover costs…