Are We Just Google’s Lab Rats?
There are some interesting comments relating to my previous post on Google Lock-In Lock-Out in a comment thread on OSnews: Why Google gets so much credit. Here are some of my own lazy Sunday morning notes/thoughts relating to that, and other comments…
- killing Google Reader does not kill RSS/there was no “malicious intent” mapping out the Reader/RSS strategy:
A nice phrase in an #opentech talk yesterday was that we (technologists and engineers and data scientists, for example) have to “act responsibly”. Google Reader helped popularise feed reading when some of us were hopeful for its future (“We ignore RSS at OUr Peril”), and as such attracted many readers away from other clients (myself included), with the result that competition was harder (“compete against Google? Hmm… maybe not…”). Google Reader’s infrastructure and unofficial APIs enabled folk to build services off the back of the Google Reader infrastructure turning it into de facto infrastructure for other peoples’ applications and services. (Remember: the Google Maps API was unofficial at first). There aren’t many OPML bundlers out there, for example, but for hackers into appropriating tech Google Reader is one. Since I moved away from Google Reader (to theoldreader) I haven’t used Flipboard so much, which as far as I was concerned was using Reader essentially as infrastructure. Caveat emptor, I guess, for developers building on top of other companies services (as many Twitter and Facebook app developers keep discovering).
With Feedburner, Google bought up a service that acted as a proxy, taking public syndication feeds, instrumenting them with analytics, and then encouraging the people taking up the syndicated content to subscribe to the Feedburner feed. Where RSS and Atom were designed to support syndication between independent parties, Feedburner – and then Google – insinuated itself between those parties. By replacing self-controlled feeds as the subscription endpoint with Google controlled endpoints, publishers gave up control of their syndication infrastructure. With Google losing interest in open syndication feeds as it pursues its own closed content network agenda, we are faced with a situation whereby Google can potentially trash a widespread syndication infrastructure that would have remained resilient if Google hadn’t insinuated itself into it. Or if we hadn’t been so stupid as to simplistically accept it’s overtures.
Hmmm… thinks… do we need a Google users’ motto? Don’t be stupid perhaps…?!
I applaud Google for developing the services it does, getting them to scale and opening up API access. But as these services become de facto infrastructure, the question of how Google acknowledges any responsibility, that flows from this (even if this responsibility is incorrectly assumed) becomes an issue. Responsibilities arise in other areas too, of course. Such as taxation and corporate transparency. But that’s another issue. (Would Google act differently if its motto was “Be responsible” or “Act responsibly” rather than “Don’t be evil”? It strikes me that “Act responsibly” could work as a motto for both companies and their users?)
It seems to me that with Google+, Google is not adopting open syndication standards in two ways: not using it “internally”, and not making feeds publicly available. There may be good technical reasons for the first, but by the second Google is *not allowing* its community members to participate in a open content syndication network/system. Google’s choice, but I’m not playing.
Google is not killing the open standards by closing off access to them in commercial licensing terms, but it may contribute to stifling their adoption by adopting alternative standards that others feel they have to adopt because of the influence Google has on web traffic.
Consider this other way of looking at it – Google is presumably trying to get other parties to adopt WebP by developing it as an openstandard. Google assumes that it can drive adoption of this as a web standard by adopting it itself. In terms of argumentation, it doesn’t follow that by not adopting something Google can prevent it being adopted, (i.e. not adopting or by stopping its own use of a standard, Google kills it generally) but people follow bad logic all the time (and if they follow Google for their technology choices, or have a technology model based on being parasitic on Google infrastructure, Google’s dropping of a standard effectively kills it for those people) …
- control of what we see
Google makes money by putting ad-links in front of eyeballs that people click on. By presenting “relevant” ads, Google presumably tries to maximise the click-thru rate so that it can make more money per displayed link.
To encourage you to spend your attention on pages that Google controls, Google has adopted the idea that by presenting you (and me; us) with “relevant” content, we are likely to remain engaged. With Google web search, the relevance of search results supposedly attracts us back to the Google search tool. With services such as Google now, Google pre-emptively tries to present you with information it thinks you need, presumably based on predictive models of sequences of action that other people (or you yourself) have demonstrated in the past.
I’m not really up on behavioural psychology models, but I have a vague memory that intermittent reinforcement schedules were demonstrated to be one of the more effect modes of behaviourist training/operant conditioning. So I wonder: how effective are predictive intermittent positive reinforcement schedules. (You get the idea, right? We’re pigeons that peck at Android phones and Google is the experimenter trying to get us to peck the right way, by reinforcing us every now and again by satisfying out intent. That is, has there been in a flip away from Google using us to provide reinforcement training signals to its algorithms in to a situation in which we have become Google’s experimental lab rats that are coupled in a series of ongoing experiments that train us and its algorithms, jointly, together, to maximise… something…)
There is a danger, I think, in Google chasing the “relevance” thing too far, seeing the maximisation of whatever conversion metrics it decides on as being a sign that it has “got things right” for us, that it is satisfying our “intent”. And if operant conditioning does influence the way we behave, maybe we do actually need to start thinking about what the machine algorithms are training us to do. Are training us to do. Training us.
Google’s stated aim is to “organize the world’s information and make it universally accessible and useful”.
- Through web search, it started to organise information it presented to use through search results that were more appealingly ranked (seemed “more relevant”) than the other search engines did.
- Through personalised search, it started to organise the way it presented results to each of us individually.
- Through web tracking, it presents us with information – adverts – organised in a way it presumably thinks are more personally meaningful to use (but maximising what metic exactly? More likely to cause us to act in a particular way, as measured by whether we click the link, or linger on a page, or engage in a particular behaviour that can be captured – for model building and exploitation purposes – by web tracking algorithms?)
- Through Google Now, and the new Google image gallery tools, Google is seeking to organise our information (we’re part of the world, right?) on our behalf and present it back to us in a way that the Google algorithms decide.
The old photos in a drawer back at my family home are sorted howsoever (by whatever algorithm “use” and random access results in). Now they’ll be sorted by Google. Maybe the algorithms are similar. Or maybe they’re not. What would be evil, I think, was if the ranking algorithms that are used to decide the order in which organic information is presented us start to be influenced by the algorithms that are tied to advertising or marketing, that is, to algorithms that are used to try to maximise the extent to which we are influenced in accord with the goals, beliefs, desires and intents of others (with a hat tip there to agent logic and the theories of intelligent software agents).
At the moment I believe that Google believes it is trying to develop algorithms that benefit us personally, in an utilitarian way. But I’m not sure what function it is they are maximising or how they think it maps onto any personal theories or preferences we may have about what is “accessible” and “useful”. I guess we might also ask whether “accessible” and “useful” are the road to a Good Life (because in the end this comes down to philosophy and ethics, doesn’t it?) or whether we should be “organising the world’s information” with some other purpose in mind?
PS Just by the by, it’s worth noting that the educational arena is seeking to use learning analytics to instrumentalise our behaviour and engagement within learning systems and contexts for our, erm, learning benefit. (Measured how?)
Google Lock-In Lock-Out
As John Naughton feels obliged to remind folk every now and again, the web is not the internet. Because we all know that for many people, Facebook apparently is. Or Google is.
And as anyone following my tweets over the last year or two will know, I’ve started finding Google more and more irksome.
It’s not just that the one or two people I know who use Google Plus (Google+?) are now all but lost to me as sources of neat ideas because I don’t do Gooplus and it doesn’t do RSS…
It’s not just because Google is shutting down the Google Reader backbone that powers a lot of RSS and Atom syndication feed services (and leaves me wondering: how long is Feedburner for this world? Maybe it’s time to start moving your feeds and trying to get folk off that piece of infrastructure…)…
It’s not just that geocoding done within Fusion Tables is not exported – if you look at a KML feed from Google Fusion Tables, you’ll find there’s no lat-long data there. To get a geo-view, you need to stick in Google Fusion Tables or wire the feed into Google Earth, which will then “initiate geocoding of location descriptions while viewing [the] KML file”…
It’s not just that Google is deprecating gadgets from spreadsheets, which as Martin points out means that if I want to visualise data in a spreadsheet all I’m going to be left with is Google’s crappy charts…
It’s not just that Google moved away from using CalDav to support calendar interoperability…
It’s not just that Google is moving away from using the XMPP instant messaging protocol (and nor, I think, making a move towards using MQTT?)…
It’s not just that Google will be using your photos to create photos you never took and presumably offer them up via your image gallery in favour of photos it thinks aren’t up to scratch…
Though I’m sure that Google wouldn’t start pushing images in just the WebP image format so that you’d feel obliged to use Chrome…
And also in the browser, I’m sure Google wouldn’t start using Google Public DNS as a Chrome default setting. (Is the same true of Chromebook? Presumably folk connected to Google Fiber use Google Public DNS?) But does it use SPDY as a default? How about on Android?
It’s not just that Google will tag your social media posts using tags you might never use yourself, and as it does so altering the externalised memory embodied by that post…
It’s not just that as web search gets increasingly personalised and localised, we lose any sense of Google ground truth; I’m not quite sure how the info-skills trainers are going to address this when training a motley crew of different learners to discover a particular resource other than by using known-item search strategies (which sort of misses the point). Or maybe it’s right that a cohort of students should all get different results when they run ostensibly the same search?
Hmmm.. thinks: if personalised/localised search could be reduced to raw search phrase (whatever I put in the search box) plus a set of invisible search limits that reflect the personalisation/localisation tweaks applied to my search, how might my hidden/invisible search limits compare with yours?
It’s not just that Google uses tax efficient corporate structures to minimise its tax bill, because lots of companies do that…
It’s not just any one of these things, taken on its own merits… it’s all of them taken together…
“Embrace, extend, extinguish”… where have we heard that before?
Drip; drip; drip…
PS see also M. Wunsch on The Great Google Goat Rodeo
Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language
There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners, we can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.
To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).
The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.
To query the data, I’ve popped up a simple query form on Scraperwiki: Google Spreadsheet Explorer
To use the explorer, you need to:
- provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
- preview the table headings;
- construct a query using the column letters;
- select the output format;
- run the query.
So what sort of questions might we want to ask of the data? Let’s build some up.
We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F
We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC
Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').
How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC
Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'
Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.
How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…
The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A
In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5
How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5
Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC
How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC
To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.
To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info
In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)
From the Scraperwiki database, we need to interrogate the API:
To do this, we’ll use a database query language – SQL.
What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:
select * from support where support in (select candinit from support)
Note that “support” refers to two things here – these are columns:
select * from support where support in (select candinit from support)
and these are the table the columns are being pulled from:
select * from support where support in (select candinit from support)
Here’s the result of Runing the query:
We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).
There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:
Simples:-)
More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?
In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)
This is what we’ll be working towards:
If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.
Here’s a reminder of what the original PDF doc looked like (archive copy):
Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…
When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).
We can download a CSV representation of the data from the scraper directly:
The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.
To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.
Start Gephi, select Data Laboratory tab and then New Project from the File menu.
You should now see a toolbar that includes an “Import Spreadsheet option”:
Import the CSV file as such, identifying it as an Edges Table:
You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…
You should now see the data has been loaded in to Gephi…
If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.
Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:
If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)
However, there are also other groupings in which one candidate may support another:
These connections may allow us to see grouping of candidates supporting each other along party lines.
One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (oit degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:
Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:
That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:
We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:
Let’s reset the colours and go back to colourings along party lines:
If we go to the Preview view, we can generate a prettified view of the network:
In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:
Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)
To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:
Just that one then…
Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.
Here’s the result:
And again we can go into Preview mode to generate a nice vectorised version of the graph:
This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).
So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.
If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.
PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)
Ephemeral Citations – When Presentations You Have Cited Vanish from the Public Web
A couple of months ago, I came across an interesting slide deck reviewing some of the initiatives that Narrative Science have been involved with, including the generation of natural language interpretations of school education grade reports (I think: some natural language take on an individual’s academic scores, at least?). With MOOC fever in part focussing on the development of automated marking and feedback reports, this represents one example of how we might take numerical reports and dashboard displays and turn them into human readable text with some sort of narrative. (Narrative Science do a related thing for reports on schools themselves – How To Edit 52,000 Stories at Once.)
Whenever I come across a slide deck that I think may be in danger of being taken down (for example, because it’s buried down a downloads path on a corporate workshop promoter’s website and has CONFIDENTIAL written all over it) I try to grab a copy of it, but this presentation looked “safe” because it had been on Slideshare for some time.
Since I discovered the presentation, I’ve been recommending it to variou folk, particularly slides 20-22? that refer to the educational example. Trying to find the slidedeck today, a websearch failed to turn it up so I had to go sniffing around to see if I had mentioned a link to the original presentation anywhere. Here’s what I found:
The Wayback machine had grabbed bits and pieces of text, but not the actual slides…
Not only did I not download the presentation, I don’t seem to have grabbed any screenshots of the slides I was particularly interested in… bah:-(
For what it’s worth, here’s the commentary:
Introduction to Narrative Science — Presentation Transcript
We Transform Data IntoStories and Insight…In Seconds
Automatically,Without Human Intervention and at a Significant Scale
To Help Companies: Create New Products Improve Decision-MakingOptimize Customer Interactions
Customer Types Media and Data Business Publishing Companies Reporting
How Does It Work? The Data The Facts The Angles The Structure Stats Tests Calls The Narrative Language Completed Text Our technology platform, Quill™, is a powerful integration of artificial intelligence and data analytics that automatically transforms data into stories.
The following slides are examples of our work based upon a simple premise: structured data in, narrative out. These examples span several domains, including Sports Journalism, Financial Reporting, Real Estate, Business Intelligence, Education, and Marketing Services.
Sports Journalism: Big Ten Network – Data InTransforming Data into Stories
Sports Journalism: Big Ten Network – NarrativeTransforming Data into Stories
Financial Journalism: Forbes – Data InTransforming Data into Stories
Financial Journalism: Forbes – NarrativeTransforming Data into Stories
Short Sale Reporting: Data Explorers – JSON Input
Short Sale Reporting: Data Explorers – Overview North America Consumer Services Short Interest Update There has been a sharp decline in short interest in Marriott International (MAR) in the face of an 11% increase in the companys stock price. Short holdings have declined nearly 14% over the past month to 4.9% of shares outstanding. In the last month, holdings of institutional investors who lend have remained relatively unchanged at just below 17% of the companys shares. Investors have built up their short positions in Carnival (CCL) by 54.3% over the past month to 3.1% of shares outstanding. The share price has gained 8.3% over the past week to $31.93. Holdings of institutional investors who lend are also up slightly over the past month to just above 23% of the common shares in issue by the company. Institutional investors who make their shares available to borrow have reduced their holdings in Weight Watchers International (WTW) by more than 26% to just above 10% of total shares outstanding over the past month. Short sellers have also cut back their positions slightly to just under 6% of the market cap. The price of shares in the company has been on the rise for seven consecutive days and is now at $81.50.
Sector Reporting: Data Explorers – JSON Input
Sector Reporting: Data Explorers – OverviewThursday, October 6, 2011 12:00 PM: HEALTHCARE MIDDAY COMMENTARY:The Healthcare (XLV) sector underperformed the market in early trading on Thursday. Healthcarestocks trailed the market by 0.4%. So far, the Dow rose 0.2%, the NASDAQ saw growth of 0.8%, andthe S&P500 was up 0.4%.Here are a few Healthcare stocks that bucked the sectors downward trend.MRK (Merck & Co Inc.) erased early losses and rose 0.6% to $31.26. The company recentlyannounced its chairman is stepping down. MRK stock traded in the range of $31.21 – $31.56. MRKsvolume was 86.1% lower than usual with 2.5 million shares trading hands. Todays gains still leavethe stock about 11.1% lower than its price three months ago.LUX (Luxottica Group) struggled in early trading but showed resilience later in the day. Shares rose3.8% to $26.92. LUX traded in the range of $26.48 – $26.99. Luxottica Group’s early share volumewas 34,155. Todays gains still leave the stock 21.8% below its 52-week high of $34.43. The stockremains about 16.3% lower than its price three months ago.Shares of UHS (Universal Health Services Inc.) are trading at $32.89, up 81 cents (2.5%) from theprevious close of $32.08. UHS traded in the range of $32.06 – $33.01…
Real Estate: Hanley Wood – Data InTransforming Data into Stories
Real Estate: Hanley Wood – NarrativeTransforming Data into Stories
BI: Leading Fast Food Chain – Data InTransforming Data into Stories
BI: Leading Fast Food Chain – Store Level Report January Promotion Falling Behind Region The launch of the bagels and cream cheese promotion began this month. While your initial sales at the beginning of the promotion were on track with both your ad co-op and the region, your sales this week dropped from last week’s 142 units down to 128 units. Your morning guest count remained even across this period. Taking better advantage of this promotion should help to increase guest count and overall revenue by bringing in new customers. The new item with the greatest growth opportunity this week was the Coffee Cake Muffin. Increasing your sales by just one unit per thousand transactions to match Sales in the region would add another $156 to your monthly profit. That amounts to about $1872 over the course of one year.Transforming Data into Stories
Education: Standardized Testing – Data InTransforming Data into Stories
Education: Standardized Testing – Study RecommendationsTransforming Data into Stories
Marketing Services & Digital Media: Data InTransforming Data into Stories
Marketing Services & Digital Media: NarrativeTransforming Data into Stories
Bah…:-(
PS Slideshare appears to have a new(?) feature – Saved Files – that keeps a copy of files you have downloaded. Or does it? If I save a file and someone deletes it, will the empty shell only remain in my “Saved Files” list?
Questioning Election Data to See if It Has a Story to Tell
I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…
So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?
Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):
Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:
So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.
In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)
Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:
In OpenRefine, we can Create a New Project and then import the data directly:
The data is in comma separated CSV format, so let’s specify that:
We can then name and create the project and we’re ready to start…
…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.
Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.
And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:
Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT
Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT
The %20 is a character encoding for a space. In this case, we can also use a +.
So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)
Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.
To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.
What we want to do then is to create a new, derived column based on the address:
And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:
Here’s the result…
Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:
When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:
The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.
When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…
We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:
We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:
filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )[0]['name']
to create a new column containing the electoral division:
Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:
If we collapse down the spare columns, we get a clearer picture:
Like this:
If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):
We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…
Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)
There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson's Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?
One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:
'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')
The data, as imported from the Ordnance Survey, looks something like this:
As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.
Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.
The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:
Here’s what the returned lat/long data looks like:
We can then create a new column derived from this JSON data by parsing it as follows

A similar trick can be used to generate a column containing just the longitude data.
We can then export a view over the data to a CSV file, or direct to Google Fusion tables.
With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222
Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:
We can inspect the location data using a more convenient “natural” view over it…
By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:
(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)
We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:
Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:
Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…
So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.










































































