Archive for the ‘Infoskills’ Category
Reading through another wonderful post on the FullFact blog last night (Full Fact sources index: where to find the information you need), I noticed that the linked to resources from that post were being redirected via Google URL:
A tweet confirmed that this wasn’t intentional, so what had happened? I gather the workflow used to generate the post was to write it in Google docs, and then copy and paste the rich/HTML text into a rich text editor in Drupal, although I couldn’t recreate this effect (and nor could FullFact). However, suitably suspicious, I started having a play, writing a simple test document in Google docs:
The Google doc automatically links the test URL I added to the document. (This is often referred to as “linkification” – if a piece of text is recognised as something that looks like a URL or web link, it gets rewritten as a clickable link. Typically, you might assume that the link you’ll now be clicking on is the link that was recognised. This may be a bad assumption to make…) If you hover over the URL as written in the document, you get a tooltip that suggests the link is to the same URL. However, if you hover over the tooltip listed URL, (or click on it) you can see from the indicator in the bottom left hand corner of the browser what the actual URL you’re clicking on is. Like this:
In this case, the link you’ll actually click on is referral to the original link via a Google URL. This one, in fact:
What this means is that if I click on the link, Google tracks the fact that the link was clicked on. From the value of the usg variable (in this case, AFQjCNHgu25L-v9rkkMqZSX54E8kP_XR-A) it presumably also knows the source document containing the link and whatever follows from that.
Hmmm… If I publish the document, the Google rewrite appears to be removed:
There are also several export options associated with the document:
So what links get exported?
Here’s the Word export:
That seems okay – no tracking. How about odt?
That looks okay too. RTF and and HTML export also seem to publish the “clean” link.
What about PDF?
Hmm… so tracking is included here. So if you write a doc in Google docs that contains links that are autolinked, then you export that doc as PDF and share it with other folk, Google will know when folk click on that link from a copy of that PDF document and (presumably) the originally authored Google docs document (and all that that entails…)
How about if we email a doc as a PDF attachment to someone from within Google docs:
So that seems okay (untracked).
What’s the story then? FullFact claimed they cut and paste rich HTML from Google docs into a rich text editor and the Google redirection attack was inserted into the link. I couldn’t recreate that, and nor could the FullFact folk, so either there are some Google “experiments” going on, or the workflow was misremembered.
In my own experiments, I got a Google redirection from clicking links within my original document, and from the exported PDF, but not from any other formats?
So what do we learn? I guess this at least: be aware that when Google linkifies links for you, it may be redirecting clicks on those links through Google tracking servers. And that these tracked links may be propagated to exported and/or otherwise shared versions of the document.
PS see also Google/Feedburner Link Pollution or More Link Pollution – This Time from WordPress.com for more of the same, and Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There for a quick overview of what might happen when you actually land on a page…
Link rewriters are, of course, to be find in lots of other places too…
Twitter, for example, actually wraps all shared links in it’s t.co wrapper:
Delicious (which I’ve stopped using – I moved over to Pinboard) also uses it’s own proxy for clicked on stored bookmarks…
If you have any other examples, particularly of link rewriting/annotation/pollution where you wouldn’t expect it, please let me know via the comments…
I though this was handy on the OER-DISCUSS mailing list:
Our copyright officer writes:
… US Copyright ‘Fair Use’ or S29 copying for non-commercial research and private study which allows copying but the key word here is ‘private’. i.e. the provisos are that you don’t make the work or copies available to anyone else.
Although there are UK Exceptions for education, they are very limited or obsolete.
S.32 (1) and (2A) do have the proviso “is not done by reprographic process” which basically means that any copying by any mechanical means is excluded, i.e. you may only copy by hand.
S36 educational provision in law for reprographic copying is
a) only applicable to passages in published works i.e. books journals etc and
b) negated becauses licences are now available S.36 (3)
S.32 (2) permits only students studying courses in making Films or Film soundtracks to copy Film, broacasts or sound recordings.
The only educational exception students can rely on is s.32(3) for Examination athough this also is potentially restrictive. For the exception to apply, the work must count towards their final grade/award and any further dealing with the work after the examination process, becomes infringement.
I’m not sure how they are using Voicethread, but if the presentations are part of their assessed coursework and only available to students, staff and examiners on the course, they may use any Copyright protected content, provided it’s all removed from availability after the assessment (not sure how this works with cloud applications though)
There is also exception s.30 for Criticism or Review, which is a general exception for all, and the copying is necessary for a genuine critique or review of it.
If the students can’t rely on the last 3 exceptions, using Copyright free or licenced material (e.g. Creative Commons), would be highly recommended.
Kate Vasili – Copyright Officer, Middlesex University, Sheppard Library
Co-Director Network Data Files in GEXF and JSON from OpenCorporates Data via Scraperwiki and networkx
I’ve been tinkering with OpenCorporates data again, tidying up the co-director scraper described in Corporate Sprawl Sketch Trawls Using OpenCorporates (the new scraper is here: Scraperwiki: opencorporates trawler) and thinking a little about working with the data as a graph/network.
What I’ve been focussing on for now are networks that show connections between directors and companies, something we might imagine as follows:
In this network, the blue circles represent companies and the red circles directors. A line connecting a director with a company says that the director is a director of that company. So for example, company C1 has just one director, specifically D1; and director D2 is director of companies C2 and C3, along with director D3.
It’s easy enough to build up a graph like this from a list of “company-director” pairs (or “relations”). These can be described using a simple two column data format, such as you might find in a simple database table or CSV (comma separated value) text file, where each row defines a separate connection:
This is (sort of) how I’m representing data I’ve pulled from OpenCorporates, data that starts with a company seed, grabs the current directors of that target company, searches for other companies those people are directors of (using an undocumented OpenCorporates search feature – exact string matching on the director search (put the direction name in double quotes…;-), and then captures which of those companies share are least two directors with the original company.
In order to turn the data, which looks like this:
into a map that resembles something like this (this is actually a view over a different dataset):
we need to do a couple of things. Working backwards, these are:
- use some sort of tool to generate a pretty picture from the data;
- get the data out of the table into the tool using an appropriate exchange format.
Note that the positioning of the nodes in the visualisation may be handled in a couple of ways:
- either the visualisation tool uses a layout algorithm to work out the co-ordinates for each of the nodes; or
- the visualisation tool is passed a graph file that contains the co-ordinates saying where each node should be placed; the visualisation tool then simply lays out the graph using those provided co-ordinates.
The dataflow I’m working towards looks something like this:
networkx is a Python library (available on Scraperwiki) that makes it easy to build up representations of graphs simply by adding nodes and edges to a graph data structure. networkx can also publish data in a variety of handy exchange formats, including gexf (as used by Gephi and sigma.js), and a JSON graph representation (as used by d3.js and maybe sigma.js (example plugin?).
As a quick demo, I’ve built a scraperwiki view (opencorporates trawler gexf) that pulls on a directors_ table from my opencorporates trawler and then publishes the information either as gexf file (default) or as a JSON file using URLs of the form:
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2 (gexf default)
This view can therefore be used to easily export data from my OpenCorporates trawler as a gexf file that can be used to easily import data into the Gephi desktop tool, or provide a URL to some JSON data that can be visualised using a Javscript library within a web page (I started doodling the mechanics of one example here: sigmajs test; better examples of what’s possible can be seen at Exploring Data and on the Oxford Internet Institute – Visualisations blog. If anyone would like to explore building a nice GUI to my OpenCorporates trawl data, feel free:-).
We can also use networks to publish data based on processing the network. The example graph above shows a netwrok with two sorts of nodes, connected by edges: company nodes and director nodes. This is a special sort of graph in that companies are only ever connected to directors, and directors are only ever connected to companies. That is, the nodes fall into one of two sorts – company or director – and they only ever connect “across” node type lines. If you look at this sort of graph (sometimes referred to as a bipartite or bimodal graph) for a while, you might be able to spot how you can fiddle with it (technical term;-) to get a different view over the data, such as those directors connected to other directors by virtue of being directors of the same company:
or those companies that are connected by virtue of sharing common directors:
(Note that the lines/edges can be “weighted” to show the number of connections relating two companies or directors (that is, the number of companies that two directors are connected by, or the number of directors that two companies are connected by). We can then visually depict this weight using line/edge thickness.)
The networkx library conveniently provides functions for generating such views over the data, which can also be accessed via my scraperwiki view:
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=companies gives the view over companies connected by virtue of the fact that they share one or more common director;
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=officers&output=json gives a view over how directors are connected based on being co-directors of one or more of the same companies.
As the view is paramaterised via a URL, it can be used as a form of “glue logic” to bring data out of a directors table (which itself was populated by mining data from OpenCorporates in a particular way) and express it in a form that can be directly plugged in to a visualisation toolkit. Simples:-)
PS related: a templating system by Craig Russell for generating XML feeds from Google Spreadsheets – EasyOpenData.
I’ve been doodling… Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I’d have a quick look at some stats around follower groupings…
I started off with a data grab, pulling down the IDs of accounts on a particular Twitter list and then looking up the user details for each follower. This gives summary data such as the number of friends, followers and status updates; a timestamp for when the account was created; whether the account is private or not; the “location”, as well as a possibly more informative timezone field (you may tell fibs about the location setting but I suspect the casual user is more likely to set a timezone appropriate to their locale).
So what can we do with that data? Simple scatter plots, for one thing – here’s how friends vs. followers distribute for MPs on the Tweetminster UKMPs list:
We can also see how follower numbers are distributed across those MPs, for example, which looks reasonable and helps us get our eye in…:
We can also calculate ratios and then plot them – followers per day (the number of followers divided by the number of days since the account was registered, for example) vs the followers per status update (to try to get a feeling of how the number of followers relates to the number of tweets):
This particular view shows a few outliers, and allows us to spot a couple of accounts that have recently had a ‘change of use’.
As well as looking at the stats across the set of MPs, we can pull down the list of followers of a particular account (or sample thereof – I grabbed the lesser of all followers or 10,000 randomly sampled followers from a target account) and then look at the summary stats (number of followers, friends, date they joined Twitter, etc) over those followers.
So for example, things like this – a scatterplot of friends/follower counts similar to the one above:
…sort of. There’s something obviously odd about that graph, isn’t there? The “step up” at a friends count of 2000. This is because Twitter imposes, in most cases, a limit of 2000 friends on an account.
How about the followers per day for an account versus the number of days that account has been on Twitter, with outliers highlighted?
Alternatively, we can do counts by number of days the followers have been on Twitter:
The bump around 1500 days ago corresponds to Twitter getting suddenly popular around then, as this chart from Google Trends shows:
Sometimes, you get a distribution that is very, very wrong… If we do a histogram that has bins along the x-axis specifying that a follower had 0-100 followers of their own, or 500-600 followers etc, and then for all the followers of a particular account, pop them into a corresponding bin given the number of their followers, counting the number of people in each bin once we have allocated them all, we might normally expect to see something like this:
However, if an account is followed by lots of followers that have zero or very few followers of their own, we get a skewed distribution like this:
There’s obviously something not quite, erm, normal(?!) about this account (at least, at the time I grabbed the data, there was something not quite normal etc etc…).
When we get stats from the followers of a set of folk, such as the members of a list, we can generate summary statistics over the sets of followers of each person on the list – for example, the median number of followers, or different ratios (eg mean of the friend/follower ratios for each follower). Lots of possible stats – but which ones does it make sense to look at?
Here’s one… a plot of the median followers per status ratio versus the median friend/follower ratio:
Spot the outlier ;-)
So that’s a quick review of some of the views we can get from data grabs of the user details from the followers of a particular account. A useful complement to the social positioning maps I’ve also been doing for some time:
It’s just a shame that my whitelisted Twitter API key is probably going to die in few weeks:-(
[In the next post in this series I'll describe a plot that estimates when folk started following a particular account, and demonstrate how it can be used to identify notable "events" surrounding the person being followed...]
As the food labeling and substituted horsemeat saga rolls on, I’ve been surprised at how little use has been made of “data” to put the structure of the food chain into some sort of context* (or maybe I’ve just missed those stories?). One place that can almost always be guaranteed to post a few related datasets is the Guardian Datastore, who use EU horse import/export data to produce interactive map of the European trade in horsemeat
*One for the to do list – a round up of “#ddj” stories around the episode.)
(The article describes the source of the data as the Eurpoean Union Unistat statistics website, although I couldn’t find a way of recreating the Guardian spreadsheet from that source. When I asked Simon Rogers how he’d come by the data, he suggested putting questions into the Eurostat press office;-)
The data published by the Guardian datastore is a matrix showing the number of horse imports/exports between EU member countries (as well as major traders outside the EU) in 2012:
One way of viewing this data structure is as an edge weighted adjacency matrix that describes a graph (a network) in which the member countries are nodes and the cells in the matrix define edge weights between country nodes. The weighted edges are also directed, signifying the flow of animals from one country to another.
Thinking about trade as flow suggests a variety of different visualisation types that build on the metaphor of flow, such as a Sankey diagram. In a Sankey diagram, edges of different thicknesses connect different nodes, with the edge thickness dependent on the amount of “stuff” flowing through that connection. (The Guardan map above also uses edge thickness to identify trade volumes.) Here’s an example of a Sankey diagram I created around the horse export data:
(The layout is a little rough and ready – I was more interested in finding a recipe for creating the base graphic – sans design tweaks;-) – from the data as supplied.)
So how did I get to the diagram from the data?
As already mentioned, the data came supplied as an adjacency matrix. The Sankey diagram depicted above was generated by passing data in an appropriate form to the Sankey diagram plugin to Mike Bostock’s d3.js graphics library. The plugin requires data in a JSON data format that describes a graph. I happen to know that that the Python networkx library can generate an appropriate data object from a graph modeled using networkx, so I know that if I can generate a graph in networkx I can create a basic Sankey diagram “for free”.
So how can we create the graph from the data?
The networkx documentation describes a method – read_weighted_edgelist – for reading in a weighted adjacency matrix from a text file, and creating a network from it. If I used this to read the data in, I would get a directed network with edges going into and out of country nodes showing the number of imports and exports. However, I wanted to create a diagram in which the “import to” and “export from” nodes were distinct so that exports could be seen to flow across the diagram. The approach I took was to transform the two-dimensional adjacency matrix into a weighted edge list in which each row has three columns: exporting country, importing country, amount.
So how can we do that?
One way is to use R. Cutting and pasting the export data of interest from the spreadsheet and into a text file (adding in the missing first column header as I did so) gives a source data file that looks something like this:
So how do we get from one to the other?
Here’s the R script I used – it reads the file in, does a bit of fiddling to remove commas from the numbers and turn the result into integer based numbers, and then uses the melt function from the reshape library to generate the edge list, finally filtering out edges where there were no exports:
#R code horseexportsEU <- read.delim("~/Downloads/horseexportsEU.txt") require(reshape) #Get a "long" edge list x=melt(horseexportsEU,id='COUNTRY') #Turn the numbers into numbers by removing the comma, then casting to an integer x$value2=as.integer(as.character(gsub(",", "", x$value, fixed = TRUE) )) #If we have an NA (null/empty) value, make it -1 x$value2[ is.na(x$value2) ] = -1 #Column names with countries that originally contained spaces convert spaces dots. Undo that. x$variable=gsub(".", " ", x$variable, fixed = TRUE) #I want to export a subset of the data xt=subset(x,value2>0,select=c('COUNTRY','variable','value2')) #Generate a text file containing the edge list write.table(xt, file="foo.csv", row.names=FALSE, col.names=FALSE, sep=",")
(Another way of getting a directed, weighted edge list from an adjacency table might be to import it into networkx from the weighted adjacency matrix and then export it as weighted edge list. R also has graph libraries available, such as igraph, that can do similar things. But then, I wouldn’t have go to show the “melt” method to reshaping data;-)
Having got the data, I now use a Python script to generate a network, and then export the required JSON representation for use by the d3js Sankey plugin:
#python code import networkx as nx import StringIO import csv #Bring in the edge list explicitly #rawdata = '''"SLOVENIA","AUSTRIA",1200 #"AUSTRIA","BELGIUM",134600 #"BULGARIA","BELGIUM",181900 #"CYPRUS","BELGIUM",200600 #... etc #"ITALY","UNITED KINGDOM",12800 #"POLAND","UNITED KINGDOM",129100''' #We convert the rawdata string into a filestream f = StringIO.StringIO(rawdata) #..and then read it in as if it were a CSV file.. reader = csv.reader(f, delimiter=',') def gNodeAdd(DG,nodelist,name): node=len(nodelist) DG.add_node(node,name=name) #DG.add_node(node,name=name) nodelist.append(name) return DG,nodelist nodelist= DG = nx.DiGraph() #Here's where we build the graph for item in reader: #Even though import and export countries have the same name, we create a unique version depending on # whether the country is the importer or the exporter. importTo=item+'.' exportFrom=item amount=item if importTo not in nodelist: DG,nodelist=gNodeAdd(DG,nodelist,importTo) if exportFrom not in nodelist: DG,nodelist=gNodeAdd(DG,nodelist,exportFrom) DG.add_edge(nodelist.index(exportFrom),nodelist.index(importTo),value=amount) json = json.dumps(json_graph.node_link_data(DG)) #The "json" serialisation can then be passed to a d3js containing web page...
Once the JSON object is generated, it can be handed over to d3.js. The whole script is available here: EU Horse imports Sankey Diagram.
What this recipe shows is how we can chain together several different tools and techniques (Google spreadsheets, R, Python, d3.js) to create a visualisation with too much effort (honestly!). Each step is actually quite simple, and with practice can be achieved quite quickly. The trick to producing the visualisation becomes one of decomposing the problem, trying to find a path from the format the data is in to start with, to a form in which it can be passed directly to a visualisation tool such as the d3js Sankey plugin.
PS In passing, as well as the data tables that can be searched on Eurostat, I also found the Eurostat Yearbook, which (for the most recent release at least), includes data tables relating to reported items:
So it seems that the more I look, the more and more places seems to making data that appears in reports available as data…
As well as serendipity, I believe in confluence…
A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.
It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.
In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:
So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)
Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.
So what’s confluential about those?
The IWCP article localises the Fairer Gambling data in several ways:
- the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
- a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
- a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.
The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).
Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.
A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).
Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.
The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)
Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).
The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.
Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.
PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…
Slides without commentary from a presentation I gave to undergrads on data journalism at the University of Lincoln yesterday…
My plan is to write some words around this deck, (or maybe even to record (and perhaps transcribe), some sort of narration…) just not right now…
I’m happy to give variants of this presentation elsewhere, if you can cover costs…
One of my most successful posts in terms of lifetime traffic numbers has been a recipe for scraping data from a Wikipedia page, pulling it into a Google spreadsheet, publishing it as CSV, pulling it into a Yahoo Pipe, geo-coding it, publishing it as a KML file, displaying the KML in Google maps and embedding the map in another page (which could in principle be the original WIkipedia page): Data Scraping Wikipedia with Google Spreadsheets.
A variant of this recipe in other words:
Running the hack now on a new source web page, we get the following data pulled into the spreadsheet:
And the following appearing in the pipe (I am trying to replace the first line with my own headers):
The CSV file appears to be misbehaving… Downloading the CSV data and looking at it in TextWrangler, a text editor, we start to see what’s wrong:
The text editor creates line numbers for things it sees as separate, well formed rows in the CSV data. We see that the header, which should be a single row, is actually spread over four rows. In addition, the London data is split over two rows. The line for Greater Manchester behaves correctly…: if you look at the line numbers, you can see line 7 overflows in the editor (the … in the line number count shows the CSV line (a separate dataflow) has overflowed the width of the editor and been wrapped round in the editor view).
If I tell the editor to stop “soft wrapping” each line of data in the CSV file, the editor displays each line of the CSV file on a single line in the editor:
So… where does this get us in fixing the pipe? Not too far. We can skip the first 5 lines of the file that we import into the pipe, and that gets around all the messed up line breaks at the top of the file, but we lose the row containing the data for London. In the short term, this is probably the pragmatic thing to do.
Next up, we might look to the Wikipedia file and see how the elements that appear to be breaking the the CSV file might be fixed to unbreak them. Finally, we could go to the Google Spreadsheets forums and complain about the pile of crap broken CSV generation that the Googlers appear to have implemented…
PS MartinH suggests workaround in the comments, wrapping a QUERY round the import and renaming the columns…
[The following is my *personal* opinion only. I know as much about FutureLearn as Google does. Much of the substance of this post was circulated internally within the OU prior to posting here.]
In common with other MOOC platforms, one of the possible ways of positioning FutureLearn is as a marketing platform for universities. Another might see it as a tool for delivering informal versions of courses to learners who are not currently registered with a particular institution. [A third might position it in some way around the notion of "learning analytics", eg as described in a post today by Simon Buckingham Shum: The emerging MOOC data/analytics ecosystem] If I understand it correctly, “quality of the learning experience” will be at the heart of the FutureLearn offering. But what of innovation? In the same way that there is often a “public benefit feelgood” effect for participants in medical trials, could FutureLearn provide a way of engaging, at least to a limited extent, in “learning trials”.
This need not be onerous, but could simply relate to trialling different exercises or wording or media use (video vs image vs interactive) in particular parts of a course. In the same way that Google may be running dozens of different experiments on its homepage in different combinations at any one time, could FutureLearn provide universities with a platform for trying out differing learning experiments whilst running their MOOCs?
The platform need not be too complex – at first. Google Analytics provides a mechanism for running A/B tests and “experiments” across users who have not disabled Google Analytics cookies, and as such may be appropriate for initial trialling of learning content A/B tests. The aim? Deciding on metrics is likely to prove a challenge, but we could start with simple things to try out – does the ordering or wording of resource lists affect click-through or download rates for linked resources, for example? (And what should we do about those links that never get clicked and those resources that are never downloaded?) Does offering a worked through exercise before an interactive quiz improve success rates on the quiz, and so on.
The OU has traditionally been cautious when running learning experiments, delivering fee-waived pilots rather than testing innovations as part of A/B testing on live courses with large populations. In part this may be through a desire to be ‘equitable’ and not jeopardise the learning experience for any particular student by providing them with a lesser quality offering than we could*. (At the same time, the OU celebrates the diversity and range of skills and abilities of OU students, which makes treating them all in exactly the same way seem rather incongruous?)
* Medical trials face similar challenges. But it must be remembered that we wouldn’t trial a resource we thought stood a good chance of being /less/ effective than one we were already running… For a brief overview of the broken worlds of medical trials and medical academic publishing, as well as how they could operate, see Ben Goldacre’s Bad Pharma for an intro.
FutureLearn could start to change that, and open up a pathway for experimentally testing innovations in online learning as well as at a more micro-level, tuning images and text in order to optimise content for its anticipated use. By providing course publishers with a means of trialling slightly different versions of their course materials, FutureLearn could provide an effective environment for trialling e-learning innovations. Branding FutureLearn not only as a platform for quality learning, but also as a platform for “doing” innovation in learning, gives it a unique point of difference. Organisations trialling on the platform do not face the threat of challenges made about them delivering different learning experiences to students on formally offered courses, but participants in courses are made aware that they may be presented with slightly different variants of the course materials to each other. (Or they aren’t told… if an experiment is based on success in reading a diagram where the labels are presented in different fonts or slightly different positions, or with or without arrows, and so on, does that really matter if the students aren’t told?)
Consultancy opportunities are also likely to arise in the design and analysis of trials and new interventions. The OU is also provided with both an opportunity to act according to it’s beacon status as far communicating innovative adult online learning/pedagogy goes, as well as gaining access to large trial populations.
Note that what I’m not proposing is not some sort of magical, shiny learning analytics dashboard, it’d be a procedural, could have been doing it for years, application of web analytics that makes use of online learning cohorts that are at least a magnitude or two larger than is typical in a traditional university course setting. Numbers that are maybe big enough to spot patterns of behaviour in (either positive, or avoidant).
There are ethical challenges and educational challenges in following such a course of action, of course. But in the same way that doctors might randomly prescribe between two equally good (as far as they know) treatments, or who systematically use one particular treatment over another that is equally good, I know that folk who create learning materials also pick particular pedagogical treatments “just because”. So why shouldn’t we start trialling on a platform that is branded as such?
Once again, note that I am not part of the FutureLearn project team and my knowledge of it is largely limited to what I have found on Google.
See also: Treating MOOC Platforms as Websites to be Optimised, Pure and Simple…. For some very old “course analytics” ideas about using Google Analytics, see Online Course Analytics, which resulted in OUseful blogarchive: “course analytics”. Note that these experiments never got as far as content optimisation, A/B testing, search log analysis etc. The approach I started to follow with the Library Analytics series had a little more success, but still never really got past the starting post and into a useful analyse/adapt cycle. Google Analytics has moved on since then of course… If I were to start over, I;d probably focus on creating custom dashboards to illustrate very particular use cases, as well as
Christmas Tree day, and though it’s not decorated yet, at least it’s up. The screws in the trusty metal base had rusted a little since Twelfth Night last year, but a pair of pliers and a dab of engine from the mower dipstick seemed to do the trick in loosening them; then it was time to go off to the “Lights of Love” Carol Service in aid of the local hospice at the Church in our old parish.
(A similar service in the local pub missed last week.)
The drive, a little bit longer than usual: temporary traffic lights set up around a hole in the road – gas leak, apparently.
I’ve never visited the Island’s hospice (Earl Mountbatten was the governor, then first Lord Lieutenant, years ago), but it seems they’ve recently opened a wifi enable café; must check it out some time…
Carols sung, community choir, coffee and a mince pie. Shop bought, not Mothers’ Union home made. Illness put the refreshments in doubt, homebaking too? Had I known, maybe I should have brought some of mine. Or maybe not.
Chat with the town Mayor, (sounds grand, doesn’t it?! Longstanding friend.). Remembrance. Surprise declared about my lack of engagement, on civil liberties grounds, about a recent action by the Isle of Wight Council, the police, and representatives of the DWP – Department of Work and Pensions – involving the stopping of cars at commuter time and drivers (presumably just drivers?) consequently “quizzed about their National Insurance numbers, who their employers were and who they lived” [src]. (The story so far… Crackdown on bad driving and aftershock ‘Big Brother’ criticism of operation. The council leader also had a response: Tories hit back at criticism of benefits operation/Cllr Pugh: IW Conservatives support Police/DWP Benefit vehicle stops).
Home. Remains of chicken from yesterday’s roast, fried with bacon. Melted butter, flour, roux. Milk. White sauce. Add the meat, almost cooked farfalle, seasoning. Sorted.
All I did was go to a carol service. Now this: department work pensions question suspected fraud
Interesting – the Power of FOI: [p]lease would you be able to provide me with a copy of the procedure and guidance followed by the DWP fraud investigators where there is suspected fraud.
Fraudsters driving? Clampdown?
Explanatory Notes are often a good place to start. Almost readable.
Rat hole… Section 110A. Administration Act. Which Administration Act?
Ah – this one:
Original form only. No amended section (paragraph?) 110. No 110A.
Google. /Social Security Administration 110A/
Amendments to 110A. But no 110A?
(Ooh – RSS change feed. Could be handy…)
@psychemedia am. == amended == modified; prosp == prospective == the effect is not yet in force.
— johnlsheridan (@johnlsheridan) December 9, 2012
Thanks, John. (John’s bringing legislation up to date and on to the web. Give the man an honour.)
Click the links, in anticipation of the 110A, as introduced. No joy.
Google. Again. Desperate. Loose search. /site:legislation.gov.uk 110A/
Enough of a clue.
Cmd-f on a Mac (ctrl-f on Windows). Footnote. Small print.
That looks like a likely candidate…
Gone. Amended out of existence. Replaced.
Maybe that’s why I could only find amendments to 110A. It may not be current, but I’m intrigued. How was it originally enacted?
All I did was go to a carol service. All I want to do is be informed about what powers Local Authorities and the DWP have with respect to “quizzing” citizens stopped apparently at random. I just want to be an informed citizen: what powers were available to the Isle of Wight Council and the DWP a week or two ago?
So I’ve tracked down the original 110A, but so what? It’s not currently enacted, is it? That was a sidetrack. An aside. What does 110A allow today, bearing in mind it’s due for repeal (and possibly, prior to that, subject to as yet uncommenced further amendments)?
I guess I could pay a subscription to a legal database to more directly look up the state of the law. (Only I probably wouldn’t have to pay – .ac.uk credential and all that. Member of an academic library and the attendant privileges that go with it. Lessig found that, in a medical context. 11 minutes in. Because. BECAUSE. $435 to you. But then… Table 4. That’s with the privilege of .edu or .ac.uk library access. That’s with the unpaid work of academics running the journals, providing the editorial review services, handing over copyrights (that they possibly handed over to their institutions anyway…) to publishers for free – only not; at public expense, for publicly funded academics and researchers. And for the not-insignificant profit line of the (commercial) academic publishers. As Monbiot suggests.)
But that wouldn’t be right, would it? Ignorance of the law may not be a defence, but it can’t be right that to find out the law I need to pay for access? The legislation.gov.uk team are doing a great job, but as I’m starting to find, the law is oh, so messy, and it needs to be posted right. But I believe that they will get it up to date and they *will* get it all there. (At least give the man an honour when it’s there…)
So where was I? 110A. Going round in circles, or is it a spiral..? Back at s. 46 of the Welfare Reform Act 2007 (sections 46 and 47 of the Welfare Reform Act, as mentioned in the guidance notes on the National Fraud Partnership Agreement).
So what does the legislation actually say?
Right – so now I’m totally confused. This has been repealed.. but the repeal has not yet commenced? What’s this s. 109A rathole? And what’s Welfare Reform Act 2012 all about?
All I did was go to a carol service. And all I want to find out is the bit of legislation that describes the powers the local council and the DWP were acting under when they were “quizzing” motorists a couple of weeks ago.
So 109A – where can I find 109A? Ah – is this an ‘as currently stands” copy of the Social Security Administration Act 1992 (as amended)?
In which case…
But I’m too tired to read it and my battery is about to die. So I’m not really any the wiser.
Nine Lessons form this? Sheesh…
All I did was go to a carol service.
PS why not make a donation to the Earl Mountbatten Hospice? Or your local hospice. In advance.
S.163 of the Road Traffic Act 1988 appears to be the regulation that requires motorists to stop if so directed by a uniformed police or traffic officer.
Now I’m also wondering: as well as the powers available to the DWP and the local council, by what right and under what conditions were cars being stopped by the police and how were they being selected?