Archive for the ‘Anything you want’ Category
Way back when, when I was full of hope for social feed architectures constructed out of RSS and Atom content syndication feeds, I used to advocate the use of Yahoo Pipes as a means by which folk could start to develop their own content wrangling solutions. At one point, I even started dabbling with the idea of doing a simple recipe book, something I might even be able to make a bit of pin money from. But a completer finisher I am not, so…
Anyway – as the summer break turns into email nightmare catch-up, and I dream of a life not this one, I started pondering the recipe book idea again. Flicking through the Pipes related pages I’ve posted, and some ideas I never got round to adding in, I noticed that many of the recipes that I’d sketched with over the years are now defunct because the open and accessible technologies they were built on have been closed off.
So for example, using Twitter search feeds (JSON, I think, though RSS used to be an alternative too?) for mapping tweets, or discovering colocation communities. Or a twitter to audio pipe; or a pipe for serendiptitously discovering content related to a Twitter stream; and so on…
These pipes are not even pining now – they’re dead; Twitter gave up on RSS/Atom, opting for JSON instead; and while this wasn’t a problem – Pipes handle JSON just as well as XML based syndication feeds – the addition of authentication as a precursor to accessing Twitter data kills off the easy flow access to the data that Yahoo Pipes made such good use of.
Authentication also killed off a whole range of Amazon related mashups (remember mashups? I used to play with what used to be called mashups all the time;-): an Amazon Book Search Pipe, for example, or Looking Up Alternative Copies of a Book on Amazon, via ThingISBN; or even Amazon Reviews from Different Editions of the Same Book.
I also seem to remember making use of Amazon Listmania lists – for example, in support of the feed powered StringLE (String’n'Glue Learning Environment) riff on disaggregated MIT courseware using RSS feeds – although there again, I note that RIP Amazon Listmania.
Way back when, when the web was still opening up, services like Amazon – and then Twitter – help me cut my teeth on wrangling with web tech and near friction free information flows. Those services grew up, closed themselves off (or at least, added more friction than I care to work around). And just as I gave up playing with Amazon – and ceased taking an interest in pondering the flow of book information from Amazon sources – when authentication hit, so too I’ve now given up on playing with Twitter data (and as a result, cut down on my Twitter usage too; I don’t really care for it as much as an information space any more).
Such is life, I guess. The web has moved on, and I have got stuck. So maybe I need to move on too? Offers…?
PS see also Google Lock-In Lock-Out
A quick round-up of some recent-ish posts that I’ve popped up on the School Of Data blog…
- Hunting for Data – Learning How to Read and Write Web Addresses, aka URLs – understanding a little bit about how the web is wired in terms of the web addresses (aka web locations, URLs etc etc) it uses can help you improve the power of your web searches, and also provides a gentle way in to the sort of thinking behind how databases are structured, and how we can query them.
- Asking Questions of Data – Some Simple One-Liners – a simple intro to querying a database, in this case, a Google Spreadsheet…
- When A Government Minister’s Data Laundry is Hung Out to Dry… – government ministers really shouldn’t try to pass dodgy stats…
- Asking Questions of Data – Garment Factories Data Expedition – a quick intro to interrogating a simple database with a structured query language.
- Analysing UK Lobbying Data Using OpenRefine – I’ve done a couple of tutorials on OpenRefine on this blog recently, but here’s another, that shows how we can use OpenRefine to start to wrangle with text descriptions and turn them into meaningful and well structured data elements.
- Several Takes on Defining Data Journalism – this is an opening pitch, and something that needs iterating several times, I suspect, to fully shake it down. Comments appreciated…
- Get Started With Scraping – Extracting Simple Tables from PDF Documents – how to use Python to scrape data from simple table over several pages of a relatively uncluttered PDF document. Even if the thought of hacking code to write your own scrapers leaves you cold, the post may give you a little insight into the sorts of puzzles that are involved in getting data out of document formats they have no right to be in.
As to what’s coming up next…? I’m not sure… I feel a bit worded out at the moment!
PS For other recent writings elsewhere, see also: Recent Robotics Reviews on OpenLearn…
Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone, helping to reduce journey times and create 10,400 new jobs. Really? 10,400 new jobs?
In Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine I had a little poke around some of the data that was used to power a map on a Lib Dems’ website, A Million Jobs:
Liberal Democrats have helped businesses create over 1 million new private sector jobs. Click on the map below to find out what we’ve done where you live.
And then there was the map…
One thing we might take away from this as an assumption is that the markers correspond to locations or environs where jobs were created, and that by adding up the number of jobs created at those locations, we would get to a number over a million.
Whilst I was poking through the data that powers the map, I started to think this might be an unwarranted assumption. I also started to wonder about how the “a million jobs” figure was actually calculated?
Using a recipe described in the Critiquing Data Stories post, I pulled out marker descriptions containing the phrase “helping to reduce journey” along with the number of jobs created (?!) associated with those claims, where a number was specified.
Claims were along the lines of:
Summary: Lib Dems in Government have allocated £2,600,000 to fund the A38 Markeaton improvements , helping to reduce journey times and create 12,300 new jobs. The project will also help build 3,300 new homes.
Note that as well as claims about jobs, we can also pull out claims about homes.
If we use OpenRefine’s Custom Tabular Exporter to upload the data to a Google spreadsheet (here) we can use the Google Spreadsheet-as-a-database query tool (as described in Asking Questions of Data – Garment Factories Data Expedition) to sum the total number of jobs “created” by road improvements (from the OpenRefine treatment, I had observed the rows were all distinct – the count of each text facet was 1).
The sum of jobs “created”? 468, 184. A corresponding sum for the number of homes gives 203,976.
Looking at the refrain through the descriptions, we also notice that the claim is along the lines of: “Lib Dems in Government have allocated £X to fund [road improvement] helping to reduce journey times and create Y new jobs. The project will also help build Z new homes.” Has allocated. So it’s not been spent yet? [T]o create X new jobs. So they haven’t been created yet? And if those jobs are the result of other schemes made possible by road improvements, numbers will be double counted? [W]ill also help build So the home haven’t been built yet, but may well be being claimed as achievements elsewhere?
Note that the numbers I calculated are lower bounds, based on scheme descriptions that contained the specified search phrase and (“helping to reduce journey”) and a job numbers specified according to the pattern detected by the following Jython regular expression:
tmp=re.sub(r'.* creat(e|ing) ([0-9,\.]*) new jobs.*',r'\2',tmp)
In addition, the housing numbers were extracted only from rows where a number of jobs was identified by that regular expression, and where they were described in a way that could be extracted using the following the Jython regular expression re.sub(r'.* The project will also help build ([0-9,\.]*) new homes.*',r'\1',tmp)
PS I’m reading The Smartest Guys in the Room at the moment, learning about the double counting and accounting creativity employed by Enron, and how confusing publicly reported figures often went unchallenged…
It also makes me wonder about phrases like “up to” providing numbers that are then used when calculating totals?
So there’s another phrase to look for, maybe? have agreed a new ‘City Deal’ with …
From the glimpses I’ve seen of it over the last few days, the news appears to have been dominated with talk about a US government surveillance operation referred to as “Prism”. I don’t really have much idea what Prism is, or does, nor do I suspect do most of the folk who’ve been wittering on about it. It partly reminded me of Glimmerglass, but there again, I don’t really know what that tech does…; it also made me ponder the extent to which, if there are surveillance taps built in to various systems, they can be co-opted and subverted. As a code word, however, Prism sounds like it could be suitably sinister, although perhaps not quite at the level of “SPECTRE” or “Quantum”, so it’s a great opportunity for the press to play at spooks.
One thing I have noticed is that the reporting has also started referring to the notion of metadata. For example, the Guardian/Observer mention it thus (Boundless Informant: the NSA’s secret tool to track global surveillance data):
The focus of the internal NSA tool is on counting and categorizing the records of communications, known as metadata, rather than the content of an email or instant message.
In the case of email, this could include sender and recipient information, as well as the message timestamp, and maybe data about the size of the email, whether there were any attachments, and so on. For web transactions, the time you viewed a page and the address of that page would count as metadata about that transaction.
One thing I haven’t seen mention of is the signals intelligence (SIGINT) technique known as traffic analysis. In an article on The Origination and Evolution of Radio Traffic Analysis: World War II, a definition of “traffic analysis” from another report is presented as follows:
Traffic analysis comprises the study of enemy communications for the purpose of gathering information of military value without recourse to cryptanalysis of the text of intercepted messages. From such studies a certain amount of special intelligence of a tactical and strategical nature with regard to the enemy order of battle. direction of movements, massing of troops, probable intention, withdrawals. etc., can be derived. In addition … a large amount of technical intelligence valuable to the intercept and cryptanalytic functions of the Signal Security Service is obtained. In general, the technical information obtained from such studies, when applied to global intercept and cryptanalytic problems, must be derived from a global analysis of traffic. For the proper functioning of units collecting data upon which such studies will be based, their administrative control also must parallel the administrative direction of global intercept and cryptanalytic functions.
The local commander can obtain considerable benefit from the results of traffic analysis as regards special tactical and strategical intelligence derived therefrom, because such special intelligence is based primarily upon enemy communications in close proximity to his sphere of activity …
While it is not so far reaching in consequence as that which might be obtained from a successful cryptanalytic study of a high grade enemy cryptographic system, the results may sometimes be available instantaneously, and are subject only to proper interpretation on the part ofthe local staffand prompt coordination ofthe pertinent data bythe central agency.
The focus of traffic analysis is, therefore, an analysis of the metadata associated with a set of communications, rather than an analysis of the actual content of those communications. Traffic analysis (and social network analysis) is one of the reasons why it be useful in intelligence terms to collect metadata around communications.
For some worked examples around traffic analysis, see for example:
- Traffic Analysis of Anonymity Systems – includes a review of how folk might still be able to tell what web pages you’re visiting even if you use an anonymising proxy;
- Exploration of Communication Networks from the Enron Email Corpus
- Some introductory ideas about Inferring Social Network Structure using Mobile Phone Data and a more worked up example: Forensic Analysis of Phone Call Networks
And so on…
PS see also this glimpse of a social network as built by the NSA (also this review of it. How does all this play out in the context of mosaic theory, eg as framed in the sense of states acting against their own citizens and building up incriminating (and possibly hallucinated) pictures of them?
Over the last few weeks, I’ve started pondering what sort of data sets might be “almost available” on local council websites, along with the extent to which we might be able to use these datasets to support transparency goals, such as generating signals about the extent of cuts to local council services, or developing data driven local services, such as pub finders;-)
So for example, by chance I came across a page on my local council website detailing property the council is selling off:
Surplus to requirements, eh? I wonder how much property has gone up for sale or lease across other councils over the last year or so, and what sorts of services they used to house along with whether those services have been replaced with alternatives, in any meaningful sense?
As a start for ten, here’s a search to try out on your favourite web search engine:
"property for sale" intitle:council site:gov.uk
This won’t search across all council websites, but it’ll have a stab at ones that are hosted on the .gov.uk domain. For more thoughts on searching council websites by proxy, see Aggregated Local Government Verticals Based on LocalGov Service IDs.
And here’s an example of the sort of local news story that might result… @thisissurrey Surrey County Council makes £68million by selling off land and public buildings
Another area of the IW council website that was new to me was the list of public license registers:
So for example, I can look up establishments with a more than a few gaming machines:
No lat/long data, but there are addresses and postcodes, we means weCanHaz maps easily enough…
public register licenses site:gov.uk intitle:council
Looking for reputable suppliers is something I often turn to the parish magazine for (it’s a proxy for trust…), as well as the local Chamber of Commerce members list. But it seems as if this is also something the trading standards aspect of the council may be able to help out on… in th island’s case, there’s a Buy With Confidence register, for example.
“Trader register” seems to be the phrase to go for?
trader register site:gov.uk intitle:council
For food establishments, whilst the IW council participates in the Food Standards Agency’s ‘Food Hygiene Rating Scheme’, it doesn’t seem to pull any of that data into an access point on the council website? (I think my scraper of the FSA site may have rotted too? Food Standards Agency scraper.)
As well as the statutory disclosure of major spend items, the council also publishes details of local contracts – again, if we’re looking to track evidence about cuts, a log of contracts that don’t get renewed might be interesting over an extended period?
Whilst the council webpages don’t make it easy for you to see all the extant contracts,
another scraper can help….
For the holidaymakers, in part, the council produces a table of Beach Water Quality measures, though not on a map as far as I can tell (which reminds me of an old, old map hack mashup….!)? I suspect some of the beaches may be designated public places (no booze…), but at the last time of looking I couldn’t find any data identifying the extent of such areas on the island, let alone any shapefiles of the same…? I’m not sure if there’s data around showing when and which part of the beaches allow dogs on them, either?
"designated public places" site:gov.uk intitle:council
In terms of advertising local events, the council maintain a major events calendar, although I couldn’t spot an iCal feed so I can’t easily subscribe to it in my own calendar…
If you need to find somewhere to park, the council does publish lists of car parks – sort of:
Ooh – my mistake – they do a Google Map too [on which I also spy a KML link]…
As well as accommodation for holiday folk, the island has its fair share of care homes. Quality inspections, it seems, aren’t a council thing – data for that is handled by the Care Quality Commission.
The Isle of Wight Council doesn’t publish FOI disclosure logs as a matter of course, though some other councils do, along with responses:
(foi OR freedom information) +"disclosure log" site:gov.uk intitle:council
And why are FOI disclosure logs interesting? Well for one thing, they allow us to take the FOI Route to Real (Fake) Open Data.
Okay – that’s enough for now, methinks…
Over the last month or so, I’ve made a start reading through Mark Newman’s Networks: An Introduction, trying (though I’m not sure how successfully!) to bring an element of discipline to my otherwise osmotically acquired understanding of the techniques employed by various network analysis tools.
One distinction that made a lot of sense to me came from the domain of bibliometrics, specifically between the notions of bibliographic coupling and co-citation.
The idea of co-citation will be familiar to many – when one article cites a set of other articles, those other articles are “co-cited” by the first. When the same articles are co-cited by lots of other articles, we may have reason to believe that they are somehow related in a meaningful way.
In graph terms, we might also represent this as simpler graph within which edges between two articles indicate that they have been co-cited by documents within a particular corpus, with the weight of each edge representing the number of documents within that corpus that have co-cited them.
Bibliographic coupling is actually an earlier notion, describing the extent to which two works are related by virtue of them both referencing the same other work.
Again, in graph terms, we might think of a simpler undirected network in which edges between two articles act as an indicator that they have cited or referenced the same work, with the weight of the edge representing the number of documents that they have co-cited.
A comparison of co-citation and bibliographic coupling networks shows one to be “retrospective” and the other to be “forward looking”. The articles referenced in bibliographic coupling network can be generated directly from a corpus set of articles, and to this extent bibliographic coupling looks to the past. In a co-citation network, the edges that connect two articles can only be generated when a future published article cites them both.
Co-citation, Bibliographic Coupling and Company Director Networks
For some time I’ve been tinkering with the notion of co-director networks, using OpenCorporates data as a data source (eg Mapping Corporate Networks With OpenCorporates). What I’ve tended to focus on are networks built up from active companies and their current directors, looking to see which companies are currently connected by virtue of currently sharing the same directors. On the to do list are timelines showing the companies that a particular director has been associated with, and when, as well as directorial appointments and terminations within a particular company.
In both co-citation and bibliographic analyses, the nodes are the same type of thing (that is, works that are citated, such as articles). A work cites a work. (Note: does author co-citation analysis rely on mappings from works to cited authors, or citing authors to cited authors?). In company-director networks, we have bipartite representation, with directors and companies representing the two types of node and where edges connect companies and directors but not companies and companies or directors and directors; unless a company is a director, but we generally fudge the labelling there.
If we treat “companies that retain directors” as “articles that cite other articles”:
- under a “co-citation” style view, we generate links between companies that share common directors;
- under a “bibliographic coupling” style view, we generate links between directors of the same companies.
I’ve been doing this anyway, but the bibliographic coupling/co-citation distinction may help me tighten it up a little, as well as improving ways of calculating and analysing these networks by reusing analyses described by the bibliometricians?
Pondering the “future vs. past” distinction, the following also comes to mind:
- at the moment, I am generating networks based on current directors of active companies;
- could we construct a dynamic (temporal?) hypergraph from hyperedges that connect all the directors associated with a particular company at a particular time? If so, what could we do with this graph?! (As an aside, it’s probably worth noting that I know absolutely nothing about hypergraphs!)
I’ve also started wondering about ‘director pathways’ in which we define directors as nodes (where all we require was that a person was a director of a company at some time) and directed “citation” edges. These edges would go from one director to other director nodes under the condition that the “citing” director was appointed to a particular company within a particular time period t1..t2 before the appointment to the same company of a “cited” director. If one director follows another director into more than one company, we increase the weight of the edge accordingly. (We could maybe also explore modes in which edge weights represent the amount of time that two directors are in the same company together.)
The aim is… probably pointless and not that interesting. Unless it is… The sort of questions this approach would allow us to ask would be along the lines of: are there groups of directors whose directorial appointments follow similar trajectories through companies; or are there groups of directors who appear to move from one company to another along with each other?
In October 2009, the New York Times Media Decoder blog picked up on a story that had been doing the rounds about a research project called Stats Monkey from the Intelligent Information Laboratory at Northwestern University. The Robots Are Coming!, it declared, with the immediate rejoinder, Oh, They’re Here. Using play by play baseball data, Stats Monkey produced human readable reports of a baseball game, formulaic admittedly, but good enough, particularly when complemented by quotes from a post-match press conference report. Mechanical churnalism complementing data-driven analysis, cast into prose. (It’s worth noting that the Media Decoder post itself is little more than a restatement of what was presumably the Stats Monkey website blurb at the time.)
In April 2010, Bloomberg Businessweek Magazine asked Are Sportswriters Really Necessary?, describing how Narrative Science, a company that incorporated at the start of that year and spun out off the back of the Stats Monkey project had teamed up with the Big Ten Network to produce automatically generated sports reports, a relationship that presumably continues to this day.
A year later, and Forbes magazine produced a report in June 2011 about GameChanger and Narrative Science: Fulfilling the Heretofore Unrealized Demand for Stilted Stories About Children’s Games, describing a tie-up between Narrative Science and GameChanger, a company that produces a scorekeeping app that allows sports fans, parents and coaches to capture data about a match.
(What other companies/apps are out there for crowdsourcing sports analytics in this way, I wonder?)
Using GameChanger data and narrative Science story generation tools, it was possible to automate the creation of match reports for small number audiences. I don’t know if these stories used to be freely accessible, but today the match reports appear to take the form of paywalled notion of recap stories.
Paywall aside, examples of other stories generated by Narrative Science using GameChanger data can be found using a simple web search on the phrase “Powered by Narrative Science and GameChanger Media”
You can also just search for the byline, as for example it appears in this report:
In passing, it’ll be interesting to see how automatically generated stories start to feed into the glitch aesthetic (h/t @danmcquillan for introducing me to this phrase and the related notion of the new aesthetic in his presentation at #opentech last week).
September 2011 saw a media outlook report from Mediabistro’s Media Jobs Daily noting that Narrative Science’s ‘Robot Journalists’ Now Tackling Real Estate. The story links through to a page on Builder Online that provides a summary report of housing data for various US cities.
What this example, and the GameChanger example, show is how the generation of timely text stories can be automated on top of the regularly updated datasets. The use of natural language interpretive text to describe patterns observed in the underlying data presumably also has SEO benefits.
That same month, September 2011, saw another stats-to-insight company, again emerging from the automated interpretation of sports data, renaming itself from StatSheet to Automated Insights. Today, Statsheet continues to publish game recaps combining short natural language summaries with statistical charts, all of which are presumably automatically generated. Within a year, the parent company, Automated Insights had scaled up and begun publishing recaps for Yahoo!’s fantasy sports matches.
More recently, Automated Insights have started producing realtime content feeds to support sports commentators – Real-time Insights for MLB – as well as feeding consumers via the stat.us powered Twitter feeds.
(See also: yseop, a French company that generates automated reports from data. [Any more?])
Fast forward to the start of 2013, and Narrative Science started publishing human readable prose reports based on US schools data (ProPublica: How To Edit 52,000 Stories at Once). They’re also doing a lot more work with financial reporting, for example with Forbes as well as for financial services clients, as this interview with Narrative Science’s Stuart Frankel describes.
Generating human readable reports from Google Analytics data and dashboards also appears to be a hot topic, with both Narrative Science (Automated Insight From Google Analytics With Quill) and Automated Insights (With Site Ai, Automated Insights Provides A Cliffs Notes Version Of Your Web Analytics) recently developing tools around this topic.
What I thought was particularly interesting about the ProPublica example was how it suggests a possible widespread future use of “automatically generated insight” pulling out headline interpretations from open data sets, as touched on in this great introductory technical presentation by Narrative Science’s Larry Adams (which also happens to mention the possibility of Narrative Science offering platform services via an API…? It also mentions work with the NHS?):
At one point during that presentation, Larry Adams suggests that Narrative Science use small set of narrative templates or story types (“the horserace” for example, or “top 10″) to frame the construction of their stories, as well as mentioning the sorts of feature that they look for within a data set (trends and changes in trends, for example, or outliers). Another presentation, this time by Narrative Science’s Kris Hammond also hints at some of the features they look for in data: “inflexion points, trends, correlations”.
So what sorts of techniques might we use ourselves to start generating the insights that we might be able to work up into simple narrative sentences, at least for starters?
Top 10, bottom 5 are easy pickings if we can rank the data somehow. I thought this trick for detecting inflexions by coding a time series symbolically and then using a regular expression to detect features was really interesting: Finding patterns in time series using regular expressions. And I wonder, how does the OpenSecrets anomaly tracker define the anomalies it detects?
PS seems like generate text summaries from data may be something the intelligence services may also be interested in: The CIA Invests in Narrative Science and Its Automated Writers
Other posts you might be interested in:
- The Tesco Data Business – Notes on “Scoring Points”
- More Remarks on the Tesco Data Play
PPS I note that Narrative Science have picked up some more funding… Narrative Science raises $11.5M in equity funding
PPPS See also Data2Text, a start-up spun-out of a Natural language generation research group at the University of Aberdeen.
There are some interesting comments relating to my previous post on Google Lock-In Lock-Out in a comment thread on OSnews: Why Google gets so much credit. Here are some of my own lazy Sunday morning notes/thoughts relating to that, and other comments…
- killing Google Reader does not kill RSS/there was no “malicious intent” mapping out the Reader/RSS strategy:
A nice phrase in an #opentech talk yesterday was that we (technologists and engineers and data scientists, for example) have to “act responsibly”. Google Reader helped popularise feed reading when some of us were hopeful for its future (“We ignore RSS at OUr Peril”), and as such attracted many readers away from other clients (myself included), with the result that competition was harder (“compete against Google? Hmm… maybe not…”). Google Reader’s infrastructure and unofficial APIs enabled folk to build services off the back of the Google Reader infrastructure turning it into de facto infrastructure for other peoples’ applications and services. (Remember: the Google Maps API was unofficial at first). There aren’t many OPML bundlers out there, for example, but for hackers into appropriating tech Google Reader is one. Since I moved away from Google Reader (to theoldreader) I haven’t used Flipboard so much, which as far as I was concerned was using Reader essentially as infrastructure. Caveat emptor, I guess, for developers building on top of other companies services (as many Twitter and Facebook app developers keep discovering).
With Feedburner, Google bought up a service that acted as a proxy, taking public syndication feeds, instrumenting them with analytics, and then encouraging the people taking up the syndicated content to subscribe to the Feedburner feed. Where RSS and Atom were designed to support syndication between independent parties, Feedburner – and then Google – insinuated itself between those parties. By replacing self-controlled feeds as the subscription endpoint with Google controlled endpoints, publishers gave up control of their syndication infrastructure. With Google losing interest in open syndication feeds as it pursues its own closed content network agenda, we are faced with a situation whereby Google can potentially trash a widespread syndication infrastructure that would have remained resilient if Google hadn’t insinuated itself into it. Or if we hadn’t been so stupid as to simplistically accept it’s overtures.
Hmmm… thinks… do we need a Google users’ motto? Don’t be stupid perhaps…?!
I applaud Google for developing the services it does, getting them to scale and opening up API access. But as these services become de facto infrastructure, the question of how Google acknowledges any responsibility, that flows from this (even if this responsibility is incorrectly assumed) becomes an issue. Responsibilities arise in other areas too, of course. Such as taxation and corporate transparency. But that’s another issue. (Would Google act differently if its motto was “Be responsible” or “Act responsibly” rather than “Don’t be evil”? It strikes me that “Act responsibly” could work as a motto for both companies and their users?)
It seems to me that with Google+, Google is not adopting open syndication standards in two ways: not using it “internally”, and not making feeds publicly available. There may be good technical reasons for the first, but by the second Google is *not allowing* its community members to participate in a open content syndication network/system. Google’s choice, but I’m not playing.
Google is not killing the open standards by closing off access to them in commercial licensing terms, but it may contribute to stifling their adoption by adopting alternative standards that others feel they have to adopt because of the influence Google has on web traffic.
Consider this other way of looking at it – Google is presumably trying to get other parties to adopt WebP by developing it as an openstandard. Google assumes that it can drive adoption of this as a web standard by adopting it itself. In terms of argumentation, it doesn’t follow that by not adopting something Google can prevent it being adopted, (i.e. not adopting or by stopping its own use of a standard, Google kills it generally) but people follow bad logic all the time (and if they follow Google for their technology choices, or have a technology model based on being parasitic on Google infrastructure, Google’s dropping of a standard effectively kills it for those people) …
- control of what we see
Google makes money by putting ad-links in front of eyeballs that people click on. By presenting “relevant” ads, Google presumably tries to maximise the click-thru rate so that it can make more money per displayed link.
To encourage you to spend your attention on pages that Google controls, Google has adopted the idea that by presenting you (and me; us) with “relevant” content, we are likely to remain engaged. With Google web search, the relevance of search results supposedly attracts us back to the Google search tool. With services such as Google now, Google pre-emptively tries to present you with information it thinks you need, presumably based on predictive models of sequences of action that other people (or you yourself) have demonstrated in the past.
I’m not really up on behavioural psychology models, but I have a vague memory that intermittent reinforcement schedules were demonstrated to be one of the more effect modes of behaviourist training/operant conditioning. So I wonder: how effective are predictive intermittent positive reinforcement schedules. (You get the idea, right? We’re pigeons that peck at Android phones and Google is the experimenter trying to get us to peck the right way, by reinforcing us every now and again by satisfying out intent. That is, has there been in a flip away from Google using us to provide reinforcement training signals to its algorithms in to a situation in which we have become Google’s experimental lab rats that are coupled in a series of ongoing experiments that train us and its algorithms, jointly, together, to maximise… something…)
There is a danger, I think, in Google chasing the “relevance” thing too far, seeing the maximisation of whatever conversion metrics it decides on as being a sign that it has “got things right” for us, that it is satisfying our “intent”. And if operant conditioning does influence the way we behave, maybe we do actually need to start thinking about what the machine algorithms are training us to do. Are training us to do. Training us.
- Through web search, it started to organise information it presented to use through search results that were more appealingly ranked (seemed “more relevant”) than the other search engines did.
- Through personalised search, it started to organise the way it presented results to each of us individually.
- Through web tracking, it presents us with information – adverts – organised in a way it presumably thinks are more personally meaningful to use (but maximising what metic exactly? More likely to cause us to act in a particular way, as measured by whether we click the link, or linger on a page, or engage in a particular behaviour that can be captured – for model building and exploitation purposes – by web tracking algorithms?)
- Through Google Now, and the new Google image gallery tools, Google is seeking to organise our information (we’re part of the world, right?) on our behalf and present it back to us in a way that the Google algorithms decide.
The old photos in a drawer back at my family home are sorted howsoever (by whatever algorithm “use” and random access results in). Now they’ll be sorted by Google. Maybe the algorithms are similar. Or maybe they’re not. What would be evil, I think, was if the ranking algorithms that are used to decide the order in which organic information is presented us start to be influenced by the algorithms that are tied to advertising or marketing, that is, to algorithms that are used to try to maximise the extent to which we are influenced in accord with the goals, beliefs, desires and intents of others (with a hat tip there to agent logic and the theories of intelligent software agents).
At the moment I believe that Google believes it is trying to develop algorithms that benefit us personally, in an utilitarian way. But I’m not sure what function it is they are maximising or how they think it maps onto any personal theories or preferences we may have about what is “accessible” and “useful”. I guess we might also ask whether “accessible” and “useful” are the road to a Good Life (because in the end this comes down to philosophy and ethics, doesn’t it?) or whether we should be “organising the world’s information” with some other purpose in mind?
PS Just by the by, it’s worth noting that the educational arena is seeking to use learning analytics to instrumentalise our behaviour and engagement within learning systems and contexts for our, erm, learning benefit. (Measured how?)
As John Naughton feels obliged to remind folk every now and again, the web is not the internet. Because we all know that for many people, Facebook apparently is. Or Google is.
And as anyone following my tweets over the last year or two will know, I’ve started finding Google more and more irksome.
It’s not just that the one or two people I know who use Google Plus (Google+?) are now all but lost to me as sources of neat ideas because I don’t do Gooplus and it doesn’t do RSS…
It’s not just because Google is shutting down the Google Reader backbone that powers a lot of RSS and Atom syndication feed services (and leaves me wondering: how long is Feedburner for this world? Maybe it’s time to start moving your feeds and trying to get folk off that piece of infrastructure…)…
It’s not just that geocoding done within Fusion Tables is not exported – if you look at a KML feed from Google Fusion Tables, you’ll find there’s no lat-long data there. To get a geo-view, you need to stick in Google Fusion Tables or wire the feed into Google Earth, which will then “initiate geocoding of location descriptions while viewing [the] KML file”…
It’s not just that Google is deprecating gadgets from spreadsheets, which as Martin points out means that if I want to visualise data in a spreadsheet all I’m going to be left with is Google’s crappy charts…
It’s not just that Google moved away from using CalDav to support calendar interoperability… (announcement: “CalDAV API will become available for whitelisted developers, and will be shut down for other developers on September 16, 2013. Most developers’ use cases are handled well by Google Calendar API, which we recommend using instead.”) [UPDATE: seems Google may have had a rethink, thought I'm thinking not not really for the reason given...Making Google’s CalDAV and CardDAV APIs available for everyone]
It’s not just that Google is moving away from using the XMPP instant messaging protocol (and nor, I think, making a move towards using MQTT?)…
It’s not just that Google will be using your photos to create photos you never took and presumably offer them up via your image gallery in favour of photos it thinks aren’t up to scratch…
Though I’m sure that Google wouldn’t start pushing images in just the WebP image format so that you’d feel obliged to use Chrome…
And also in the browser, I’m sure Google wouldn’t start using Google Public DNS as a Chrome default setting. (Is the same true of Chromebook? Presumably folk connected to Google Fiber use Google Public DNS?) But does it use SPDY as a default? How about on Android?
It’s not just that Google will tag your social media posts using tags you might never use yourself, and as it does so altering the externalised memory embodied by that post…
It’s not just that as web search gets increasingly personalised and localised, we lose any sense of Google ground truth; I’m not quite sure how the info-skills trainers are going to address this when training a motley crew of different learners to discover a particular resource other than by using known-item search strategies (which sort of misses the point). Or maybe it’s right that a cohort of students should all get different results when they run ostensibly the same search?
Hmmm.. thinks: if personalised/localised search could be reduced to raw search phrase (whatever I put in the search box) plus a set of invisible search limits that reflect the personalisation/localisation tweaks applied to my search, how might my hidden/invisible search limits compare with yours?
It’s not just any one of these things, taken on its own merits… it’s all of them taken together…
“Embrace, extend, extinguish”… where have we heard that before?
Drip; drip; drip…
PS see also M. Wunsch on The Great Google Goat Rodeo
PPS Although not an open standard, I forgot this one – Google dropped support for the closed Microsoft ActiveSync protocol (see also Google Sync End of Life)
I wonder whether this will be the last round of elections without a national live data feed from somewhere pushing out the results in a standardised form? So far, here are the local “live election data” initiatives I’ve spotted/had pointed out to me:
The Lincolnite – Lincolnshire Local Elections 2013, described here: The Lincolnite to cover elections live with interactive map. (Ex-?) University of Lincoln developer Alex Bilbie (@alexbilbie), who built the Lincolnshire map app, describes a little of the process behind it here Developing an interactive county council election map (part one).
The Warwickshire team are also making shapefiles/KML files (for the plotting of boundary line maps) and live results data (via Google Fusion Tables) too, as well as making data available about previously elected candidates: 2013 Elections – In Real Time
Here’s the map after the fact… I like the clear statement of seats by part in the bottom left corner too…
Surrey has the Surrey elections dashboard (via @BenUnsworth) that will switch to a live map as the results come in, but currently offers a search box that accepts a postcode and then tells you who your candidates are and where you can vote:
This looks pretty, from Kent County Council:
I was looking forward to seeing how this view played out once the results started to come in, but, erm, oops?!
Managing the bursty load on an election results service server is probably something worth building into the planning…
Bristol opted for a tabbed display showing a range of views over their council elecitons results. A simple coloured symbol maps shows the distribution of seats by ward across the parties:
(Would it be useful to also be able to see this as percentage turnouts? Or to depict the proportional turnout on a map to see if any geographical reasons jump out as a source of possible differences?)
Cumbria County Council show how to make use of boundary files to mark or <choropleth election maps relate the party affiliation of the candidate taking each particular seat by electoral division area:
Cumbria also provided a view of seat allocations in districts; I don’t understand the scale they used to on the x-axes though? It differs from district to district. For example, it looks to me as if more seats went to Conservatives in Eden than in Carlise? Or is the scale related to the percentage of seats in the district? I’d class as “infographic-standard”, i.e. meaningless as a visualisation;-)
Norfolk’s election map looks quite, erm, “child-friendly” (chunky?! kids TV?) to me?
Norfolk also produced a graphic showing how seats might be distributed in the chamber:
I think one of the major issues with this sort of graphic is how you communicate the possible structurings of the chamber based on what sort of affiliations and groupings play out?
Wales online have a nice clean feel to their results map for Anglesey, but what’s going on with the legend? They don’t make it easy to get the branding into the screengrab either?!
THe Telegraph produced a map showing results of the elections at national scale based on control of councils by party:
Hmmm… do I recognise that sort of layout? Ah, I know, it reminds me of this example of Data Journalists Engaging in Co-Innovation… around boundary changes.
For lists of current councillors, see OpenlyLocal, which has data available via a JSON API relating to current councillors and their affiliations. (It would be good if a frozen snapshot of this could be grabbed today, for comparison with the results following today’s election?)
Data relating to general elections can be found on the Electoral Commission website: General Elections. TheyWorkForYou provide an API over current MPs by constituency, and MySOciety also produce the MapIt service for accessing constituency and electoral division boundary line data files.
If you’re interested in doing something data related around the election, or would like to learn how to do something with the data generated by the election, check out this informal resource co-ordination document. If you’re interested in checking out your local council website to see whether they publish any #opendata that would help you generate you own live maps, dashboards or consoles, the School of Data post “wot I wrote” on Proving the Data – A Quick Guide to Mapping England and Wales Local Elections may provide you with a quick start guide to making use of some of it…
If you know of any other councils or local presses publishing election related data warez, maps, live data feeds, etc, please post a link and brief description in the comments, and I’ll try to keep this post up to date…