Archive for the ‘Infoskills’ Category
Every so often I do a round up of job openings in different areas. This is particular true around year end, as I look at my dwindling salary (no more increments, ever, and no hope of promotion, …) and my overall lack of direction, and try to come up with sort sort of resolution to play with during the first few weeks of the year.
The data journalism phrase has being around for some time now (was it really three and half years ago since Data Driven Journalism Round Table at the European Journalism Centre? (FFS, what have I been doing for the last three years?!:-( and it seems to be maturing a little. We’ve had the period of shiny, shiny web apps requiring multiskilled development teams and designers working with the hacks to produce often confusing and wtf am I supposed to be doing now?! interactives and things seem to be becoming a little more embedded… Perhaps…
My reading (as an outsider) is that there is now more of a move towards developing some sort of data skillbase that allows journalists to do “investigative” sorts of things with data, often using very small data sets or concise summary datasets. To complement this, there seems to be some sort of hope that visually appealing charts can be used to hook eyeballs into a story (rather than pushing eyeballs away) – Trinity Mirror’s Ampp3d (as led by Martin Belam) is a good example of this, as is the increasing(?) use of the DataWrapper library.
From working with the School of Data, as well as a couple of bits of data journalism not-really-training with some of the big news groups, I’ve come to realise there is probably some really basic, foundational work to be done in the way people think (or don’t think) about data. For example, I don’t believe that people in general read charts. I think they may glance at them, but they don’t try to read them. They have no idea what story they tell. Given a line chart that plots some figure over time. How many people ride the line to get a feel for how it really changed?
Hans Rosling famously brings data alive with his narrative commentary around animated development data charts, including bar charts…
But if you watch the video with the sound off, or just look at the final chart, do you have the feeling of being told the same story? Can you even retell yourself the story by looking at the chart. And how about if you look at another bar chart? Can you use any of Hans Rosling’s narrative or rhetorical tricks to help you read through those?
(The rhetoric of data (and the malevolent arts of persuasion) is something I want to ponder in more depth next year, along with the notion of data aesthetics and the theory of forms given a data twist.)
Another great example of narrated data storytelling comes from Kurt Vonnegut as he describes the shapes of stories:
Is that how you read a line chart when you see one?
One thing about the data narration technique is that it is based around the construction of a data trace. There is a sense of anticipation about where the line will go next, and uncertainty as to what sort of event will cause the line to move one way or another. Looking back at a completed data chart, what points do we pick from it that we want to use as events in our narration or reading of it? (The lines just connect the points – they are processional in the way they move us from one point of interest to the next, although the gradient of the line may provide us with ideas for embellishing or decorating the story a little.)
It’s important to make art because the people that get the most out of art are the ones that make it. It’s not … You know there’s this idea that you go to a wonderful art gallery and it’s good for you and it makes you a better person and it informs your soul, but actually the person who’s getting the most out of any artistic activity is the person who makes it because they’re sort of expressing themselves and enjoying it, and they’re in the zone and you know it’s a nice thing to do. [Grayson Perry, Reith Lectures 2013, Lecture 2, Q&A [transcript, PDF], audio]
In the same way, the person who gets the most out of a chart is the person who constructed it. They know what they left in and what they left out. They know why the axes are selected as they are, why elements are coloured or sized as they are. They know the question that led up to the chart and the answers it provides to those questions. They know where to look. Like an art critic who reads their way round a painting, they know how to read one or many different stories from the chart.
The interactives that appeared during the data journalism wave from a couple of years ago sought to provide a playground for people to play with data and tells their own stories with it. But they didn’t. In part because they didn’t know how to play with data, didn’t know how to use it in a constructive way as part of a narrative, (even a made up, playful narrative). And in part this comes back to not knowing how to read – that is, recover stories from – a chart.
It is often said that a picture saves a thousand words, but if the picture tells a thousand word story, how many of us try to read that thousand word story from each picture or chart? Maybe we need to use a thousand words as well as the chart? (How many words does Hans Rosling use? How many, Kurt Vonnegut?)
When producing a chart that essentially represents a summary of a conversation with have had with a dataset, it’s important to remember that for someone looking at the final chart it might not make as much sense in absence of the narrative that was used to construct it. Edward de Bono’s constructed illustrations helps read a the final image through recalling his narrative. But if we just look at a “completed” sketch from one of his talks, it will probably be meaningless.
One of the ideas that works for me when I reflect on my own playing with data is that it is a conversation. Meaning is constructed through the conversation I have with a dataset, and the things it reveals when I pose particular questions to it. In many cases, these questions are based on filtering a dataset, although the result may be displayed in many ways. The answers I get to a question inform the next question I want to ask. Questions take the form of constructing this chart as opposed to that chart, though I am free to ask the same question in many slightly different ways if the answers don’t appear to be revealing of anything.
It is in this direction – of seeing data as a source that can be interviewed and coaxed into telling stories – that I sense elements of the data journalism thang are developing. This leads naturally into seeing data journalism skills as core investigative style skills that all journalists would benefit from. (Seeing things as data allows you to ask particular sorts of question in very particular ways. Being able to cast things into a data form – as for example in Creating Data from Text – Regular Expressions in OpenRefine) – so that they become amenable to data-style queries, is the next idea I think we need to get across…
So what are the jobs that are out at the moment? Here’s a quick round-up of some that I’ve spotted…
- Data editor (Guardian): “develop and implement a clear strategy for the Data team and the use of data, numbers and statistics to generate news stories, analysis pieces, blogs and fact checks for The Guardian and The Observer.
You will take responsibility for commissioning and editing content for the Guardian and Observer data blogs as well as managing the research needs of the graphics department and home and foreign news desks. With day-to-day managerial responsibility for a team of three reporters / researchers working on the data blog, you will also be responsible for data analysis and visualisation: using a variety of specialist software and online tools, including Tableau, ARCGis, Google Fusion, Microsoft Access and Excel”
Perpetuating the “recent trad take” on data journalism, viewed as gonzo journalist hacker:
- Data Journalist [Telegraph Media Group]: “[S]ource, sift and surface data to find and generate stories, assist with storytelling and to support interactive team in delivering data projects.
“The Data Journalist will sit within the Interactive Data team, and will work with a team of designers, web developers and journalists on data-led stories and in developing innovative interactive infographics, visualisations and news applications. They will need to think and work fast to tackle on-going news stories and tight deadlines.
- One of the most exciting opportunities that I can see around data related published is in new workflows and minimising the gap between investigative tools and published outputs. This seems to me a bit risky in that it seems so conservative when it comes to getting data outputs actually published?
Designer [Trinity Mirror]: “Trinity Mirror’s pioneering data unit is looking for a first-class designer to work across online and print titles. … You will be a whizz with design software – such as Illustrator, Photoshop and InDesign – and understand the principles of designing infographics, charts and interactives for the web. You will also be able to design simple graphical templates for re-purposing around the group.
“You should have a keen interest in current affairs and sport, and be familiar with – and committed to – the role data journalism can play in a modern newsroom.”
- [Trinity Mirror]: Can you take an API feed and turn it into a compelling gadget which will get the whole country talking?
“Trinity Mirror’s pioneering data unit is looking for a coder/developer to help it take the next step in helping shape the future of journalism. …
“You will be able to create tools which automatically grab the latest data and use them to create interactive, dynamically-updated charts, maps and gadgets across a huge range of subjects – everything from crime to football. …
“The successful candidate will have a thorough knowledge of scraping techniques, ability to manage a database using SQL, and demonstrable ability in at least one programming language.”
But there is hope about the embedding of data skills as part of everyday journalistic practice:
- Culture report (Guardian): “We are looking for a Culture Reporter to generate stories and cover breaking news relating to Popular Culture, Film and Music … Applicants should also have expertise with digital tools including blogging, social media, data journalism and mobile publishing. “
- Investigations Correspondent [BBC Newsnight]: “Reporting to the Editor, the Investigations Correspondent will contribute to Newsnight by producing long term investigations as well as sometimes contributing to big ongoing stories. Some investigations will take months, but there will also be times when we’ll need to dig up new lines on moving the stories in days.
“We want a first rate reporter with a proven track record of breaking big stories who can comfortably work across all subject areas from politics to sport. You will be an established investigative journalist with a wide range of contacts and sources as well as having experience with a range of different investigative approaches including data journalism, Freedom Of Information (FOI) and undercover reporting.”
- News Reporter, GP [Haymarket Medical Media]: “GP is part of Haymarket Medical Media, which also produces MIMS, Medeconomics, Inside Commissioning, and mycme.com, and delivers a wide range of medical education projects. …
“Ideally you will also have some experience of data journalism, understand how social media can be used to enhance news coverage and have some knowledge of multimedia journalism, including video and blogs.”
- Reporter, ENDS Report [Haymarket]: “We are looking for someone who has excellent reporting and writing skills, is enthusiastic and able to digest and summarise in depth documents and analysis. You will also need to be comfortable with dealing with numbers and statistics and prepared to sift through data to find the story that no one else spots.
“Ideally you will have some experience of data journalism, understand how social media can be used to enhance news coverage.”
Are there any other current ones I’m missing?
I think the biggest shift we need is to get folk treating data as a source that responds to a particular style of questioning. Learning how to make the source comfortable and get it into a state where you can start to ask it questions is one key skill. Knowing how to frame questions so that discover the answers you need for a story are another. Choosing which bits of the conversation you use in a report (if any – maybe the conversation is akin to a background chat?) yet another.
Treating data as a source also helps us think about how we need to take care with it – how not to ask leading questions, how not to get it to say things it doesn’t mean. (On the other hand, some folk will undoubtedly force the data to say things it never intended to say…
“If you torture the data enough, nature will always confess” [Ronald Coase]
[Disclaimer: I started looking at some medical data for Haymarket.]
Via @wilm, I notice that it’s time again for someone (this time at the Wall Street Journal) to have written about the scariness that is your Google personal web history (the sort of thing you probably have to opt out of if you sign up for a new Google account, if other recent opt-in by defaults are to go by…)
It may not sound like much, but if you do have a Google account, and your web history collection is not disabled, you may find your emotional response to seeing months of years of your web/search history archived in one place surprising… Your Google web history.
Not mentioned in the WSJ article was some of the games that the Chrome browser gets up. @tim_hunt tipped me off to a nice (if technically detailed, in places) review by Ilya Grigorik of some the design features of the Chrome browser, and some of the tools built in to it: High Performance Networking in Chrome. I’ve got various pre-fetching tools switched off in my version of Chrome (tools that allow Chrome to pre-emptively look up web addresses and even download pages pre-emptively*) so those tools didn’t work for me… but looking at chrome://predictors/ was interesting to see what keystrokes I type are good predictors of web pages I visit…
* By the by, I started to wonder whether webstats get messed up to any significant effect by Chrome pre-emptively prefetching pages that folk never actually look at…?
In further relation to the tracking of traffic we generate from our browsing habits, as we access more and more web/internet services through satellite TV boxes, smart TVs, and catchup TV boxes such as Roku or NowTV, have you ever wondered about how that activity is tracked? LG Smart TVs logging USB filenames and viewing info to LG servers describes not only how LG TVs appear to log the things you do view, but also the personal media you might view, and in principle can phone that information home (because the home for your data is a database run by whatever service you happen to be using – your data is midata is their data).
there is an option in the system settings called “Collection of watching info:” which is set ON by default. This setting requires the user to scroll down to see it and, unlike most other settings, contains no “balloon help” to describe what it does.
At this point, I decided to do some traffic analysis to see what was being sent. It turns out that viewing information appears to be being sent regardless of whether this option is set to On or Off.
you can clearly see that a unique device ID is transmitted, along with the Channel name … and a unique device ID.
This information appears to be sent back unencrypted and in the clear to LG every time you change channel, even if you have gone to the trouble of changing the setting above to switch collection of viewing information off.
It was at this point, I made an even more disturbing find within the packet data dumps. I noticed filenames were being posted to LG’s servers and that these filenames were ones stored on my external USB hard drive.
Hmmm… maybe it’s time I switched out my BT homehub for a proper hardware firewalled router with a good set of logging tools…?
PS FWIW, I can’t really get my head round how evil on the one hand, or damp squib on the other, the whole midata thing is turning out to be in the short term, and what sorts of involvement – and data – the partners have with the project. I did notice that a midata innovation lab report has just become available, though to you and me it’ll cost 1500 squidlly diddlies so I haven’t read it: The midata Innovation Opportunity. Note to self: has anyone got any good stories to say about TSB supporting innovation in micro-businesses…?
PPS And finally, something else from the Ilya Grigorik article:
The HTTP Archive project tracks how the web is built, and it can help us answer this question. Instead of crawling the web for the content, it periodically crawls the most popular sites to record and aggregate analytics on the number of used resources, content types, headers, and other metadata for each individual destination. The stats, as of January 2013, may surprise you. An average page, amongst the top 300,000 destinations on the web is:
– 1280 KB in size
– composed of 88 resources
– connects to 15+ distinct hosts
Is it any wonder that pages take so long to load on a mobile phone off the 3G netwrok, and that you can soon eat up your monthly bandwidth allowance!
There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners, we can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.
To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).
The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.
To query the data, I’ve popped up a simple query form on Scraperwiki: Google Spreadsheet Explorer
To use the explorer, you need to:
- provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
- preview the table headings;
- construct a query using the column letters;
- select the output format;
- run the query.
So what sort of questions might we want to ask of the data? Let’s build some up.
We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F
We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC
Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').
How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC
Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'
Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.
How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…
The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A
In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5
How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5
Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC
How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC
To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.
A couple of months ago, I came across an interesting slide deck reviewing some of the initiatives that Narrative Science have been involved with, including the generation of natural language interpretations of school education grade reports (I think: some natural language take on an individual’s academic scores, at least?). With MOOC fever in part focussing on the development of automated marking and feedback reports, this represents one example of how we might take numerical reports and dashboard displays and turn them into human readable text with some sort of narrative. (Narrative Science do a related thing for reports on schools themselves – How To Edit 52,000 Stories at Once.)
Whenever I come across a slide deck that I think may be in danger of being taken down (for example, because it’s buried down a downloads path on a corporate workshop promoter’s website and has CONFIDENTIAL written all over it) I try to grab a copy of it, but this presentation looked “safe” because it had been on Slideshare for some time.
Since I discovered the presentation, I’ve been recommending it to variou folk, particularly slides 20-22? that refer to the educational example. Trying to find the slidedeck today, a websearch failed to turn it up so I had to go sniffing around to see if I had mentioned a link to the original presentation anywhere. Here’s what I found:
The Wayback machine had grabbed bits and pieces of text, but not the actual slides…
Not only did I not download the presentation, I don’t seem to have grabbed any screenshots of the slides I was particularly interested in… bah:-(
For what it’s worth, here’s the commentary:
Introduction to Narrative Science — Presentation Transcript
We Transform Data IntoStories and Insight…In Seconds
Automatically,Without Human Intervention and at a Significant Scale
To Help Companies: Create New Products Improve Decision-MakingOptimize Customer Interactions
Customer Types Media and Data Business Publishing Companies Reporting
How Does It Work? The Data The Facts The Angles The Structure Stats Tests Calls The Narrative Language Completed Text Our technology platform, Quill™, is a powerful integration of artificial intelligence and data analytics that automatically transforms data into stories.
The following slides are examples of our work based upon a simple premise: structured data in, narrative out. These examples span several domains, including Sports Journalism, Financial Reporting, Real Estate, Business Intelligence, Education, and Marketing Services.
Sports Journalism: Big Ten Network – Data InTransforming Data into Stories
Sports Journalism: Big Ten Network – NarrativeTransforming Data into Stories
Financial Journalism: Forbes – Data InTransforming Data into Stories
Financial Journalism: Forbes – NarrativeTransforming Data into Stories
Short Sale Reporting: Data Explorers – JSON Input
Short Sale Reporting: Data Explorers – Overview North America Consumer Services Short Interest Update There has been a sharp decline in short interest in Marriott International (MAR) in the face of an 11% increase in the companys stock price. Short holdings have declined nearly 14% over the past month to 4.9% of shares outstanding. In the last month, holdings of institutional investors who lend have remained relatively unchanged at just below 17% of the companys shares. Investors have built up their short positions in Carnival (CCL) by 54.3% over the past month to 3.1% of shares outstanding. The share price has gained 8.3% over the past week to $31.93. Holdings of institutional investors who lend are also up slightly over the past month to just above 23% of the common shares in issue by the company. Institutional investors who make their shares available to borrow have reduced their holdings in Weight Watchers International (WTW) by more than 26% to just above 10% of total shares outstanding over the past month. Short sellers have also cut back their positions slightly to just under 6% of the market cap. The price of shares in the company has been on the rise for seven consecutive days and is now at $81.50.
Sector Reporting: Data Explorers – JSON Input
Sector Reporting: Data Explorers – OverviewThursday, October 6, 2011 12:00 PM: HEALTHCARE MIDDAY COMMENTARY:The Healthcare (XLV) sector underperformed the market in early trading on Thursday. Healthcarestocks trailed the market by 0.4%. So far, the Dow rose 0.2%, the NASDAQ saw growth of 0.8%, andthe S&P500 was up 0.4%.Here are a few Healthcare stocks that bucked the sectors downward trend.MRK (Merck & Co Inc.) erased early losses and rose 0.6% to $31.26. The company recentlyannounced its chairman is stepping down. MRK stock traded in the range of $31.21 – $31.56. MRKsvolume was 86.1% lower than usual with 2.5 million shares trading hands. Todays gains still leavethe stock about 11.1% lower than its price three months ago.LUX (Luxottica Group) struggled in early trading but showed resilience later in the day. Shares rose3.8% to $26.92. LUX traded in the range of $26.48 – $26.99. Luxottica Group’s early share volumewas 34,155. Todays gains still leave the stock 21.8% below its 52-week high of $34.43. The stockremains about 16.3% lower than its price three months ago.Shares of UHS (Universal Health Services Inc.) are trading at $32.89, up 81 cents (2.5%) from theprevious close of $32.08. UHS traded in the range of $32.06 – $33.01…
Real Estate: Hanley Wood – Data InTransforming Data into Stories
Real Estate: Hanley Wood – NarrativeTransforming Data into Stories
BI: Leading Fast Food Chain – Data InTransforming Data into Stories
BI: Leading Fast Food Chain – Store Level Report January Promotion Falling Behind Region The launch of the bagels and cream cheese promotion began this month. While your initial sales at the beginning of the promotion were on track with both your ad co-op and the region, your sales this week dropped from last week’s 142 units down to 128 units. Your morning guest count remained even across this period. Taking better advantage of this promotion should help to increase guest count and overall revenue by bringing in new customers. The new item with the greatest growth opportunity this week was the Coffee Cake Muffin. Increasing your sales by just one unit per thousand transactions to match Sales in the region would add another $156 to your monthly profit. That amounts to about $1872 over the course of one year.Transforming Data into Stories
Education: Standardized Testing – Data InTransforming Data into Stories
Education: Standardized Testing – Study RecommendationsTransforming Data into Stories
Marketing Services & Digital Media: Data InTransforming Data into Stories
Marketing Services & Digital Media: NarrativeTransforming Data into Stories
PS Slideshare appears to have a new(?) feature – Saved Files – that keeps a copy of files you have downloaded. Or does it? If I save a file and someone deletes it, will the empty shell only remain in my “Saved Files” list?
Reading through another wonderful post on the FullFact blog last night (Full Fact sources index: where to find the information you need), I noticed that the linked to resources from that post were being redirected via Google URL:
A tweet confirmed that this wasn’t intentional, so what had happened? I gather the workflow used to generate the post was to write it in Google docs, and then copy and paste the rich/HTML text into a rich text editor in Drupal, although I couldn’t recreate this effect (and nor could FullFact). However, suitably suspicious, I started having a play, writing a simple test document in Google docs:
The Google doc automatically links the test URL I added to the document. (This is often referred to as “linkification” – if a piece of text is recognised as something that looks like a URL or web link, it gets rewritten as a clickable link. Typically, you might assume that the link you’ll now be clicking on is the link that was recognised. This may be a bad assumption to make…) If you hover over the URL as written in the document, you get a tooltip that suggests the link is to the same URL. However, if you hover over the tooltip listed URL, (or click on it) you can see from the indicator in the bottom left hand corner of the browser what the actual URL you’re clicking on is. Like this:
In this case, the link you’ll actually click on is referral to the original link via a Google URL. This one, in fact:
What this means is that if I click on the link, Google tracks the fact that the link was clicked on. From the value of the usg variable (in this case, AFQjCNHgu25L-v9rkkMqZSX54E8kP_XR-A) it presumably also knows the source document containing the link and whatever follows from that.
Hmmm… If I publish the document, the Google rewrite appears to be removed:
There are also several export options associated with the document:
So what links get exported?
Here’s the Word export:
That seems okay – no tracking. How about odt?
That looks okay too. RTF and and HTML export also seem to publish the “clean” link.
What about PDF?
Hmm… so tracking is included here. So if you write a doc in Google docs that contains links that are autolinked, then you export that doc as PDF and share it with other folk, Google will know when folk click on that link from a copy of that PDF document and (presumably) the originally authored Google docs document (and all that that entails…)
How about if we email a doc as a PDF attachment to someone from within Google docs:
So that seems okay (untracked).
What’s the story then? FullFact claimed they cut and paste rich HTML from Google docs into a rich text editor and the Google redirection attack was inserted into the link. I couldn’t recreate that, and nor could the FullFact folk, so either there are some Google “experiments” going on, or the workflow was misremembered.
In my own experiments, I got a Google redirection from clicking links within my original document, and from the exported PDF, but not from any other formats?
So what do we learn? I guess this at least: be aware that when Google linkifies links for you, it may be redirecting clicks on those links through Google tracking servers. And that these tracked links may be propagated to exported and/or otherwise shared versions of the document.
PS see also Google/Feedburner Link Pollution or More Link Pollution – This Time from WordPress.com for more of the same, and Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There for a quick overview of what might happen when you actually land on a page…
Link rewriters are, of course, to be find in lots of other places too…
Twitter, for example, actually wraps all shared links in it’s t.co wrapper:
Delicious (which I’ve stopped using – I moved over to Pinboard) also uses it’s own proxy for clicked on stored bookmarks…
If you have any other examples, particularly of link rewriting/annotation/pollution where you wouldn’t expect it, please let me know via the comments…
I though this was handy on the OER-DISCUSS mailing list:
Our copyright officer writes:
… US Copyright ‘Fair Use’ or S29 copying for non-commercial research and private study which allows copying but the key word here is ‘private’. i.e. the provisos are that you don’t make the work or copies available to anyone else.
Although there are UK Exceptions for education, they are very limited or obsolete.
S.32 (1) and (2A) do have the proviso “is not done by reprographic process” which basically means that any copying by any mechanical means is excluded, i.e. you may only copy by hand.
S36 educational provision in law for reprographic copying is
a) only applicable to passages in published works i.e. books journals etc and
b) negated becauses licences are now available S.36 (3)
S.32 (2) permits only students studying courses in making Films or Film soundtracks to copy Film, broacasts or sound recordings.
The only educational exception students can rely on is s.32(3) for Examination athough this also is potentially restrictive. For the exception to apply, the work must count towards their final grade/award and any further dealing with the work after the examination process, becomes infringement.
I’m not sure how they are using Voicethread, but if the presentations are part of their assessed coursework and only available to students, staff and examiners on the course, they may use any Copyright protected content, provided it’s all removed from availability after the assessment (not sure how this works with cloud applications though)
There is also exception s.30 for Criticism or Review, which is a general exception for all, and the copying is necessary for a genuine critique or review of it.
If the students can’t rely on the last 3 exceptions, using Copyright free or licenced material (e.g. Creative Commons), would be highly recommended.
Kate Vasili – Copyright Officer, Middlesex University, Sheppard Library