Google’s Appetite for Training Data

A placeholder post – I’ll try to remember to add to this as and when I see examples of Google explicitly soliciting training data from users that can be used to train its AI models…

Locations – “Popular Times”


For example: Google Tracking How Busy Places are by Looking at Location Histories [SEO by the Sea] which also refers to a patent describing the following geo-intelligence technique: latency analysis.

A latency analysis system determines a latency period, such as a wait time, at a user destination. To determine the latency period, the latency analysis system receives location history from multiple user devices. With the location histories, the latency analysis system identifies points-of-interest that users have visited and determines the amount of time the user devices were at a point-of-interest. For example, the latency analysis system determines when a user device entered and exited a point-of-interest. Based on the elapsed time between entry and exit, the latency analysis system determines how long the user device was inside the point-of-interest.

Dangers of a Walled Garden…

Reading a recent Economist article (The value of friendship) about the announcement last week that Facebook is to float as a public company, and being amazed as ever about how these valuations, err, work, I recalled a couple of observations from a @currybet post about the Guardian Facebook app (“The Guardian’s Facebook app” – Martin Belam at news:rewired). The first related to using Facebook apps to (only partially successfully) capture attention of folk on Facebook and get them to refocus it on the Guardian website:

We knew that 77% of visits to the Guardian from only lasted for one page. A good hypothesis for this was that leaving the confines of Facebook to visit another site was an interruption to a Facebook session, rather than a decision to go off and browse another site. We began to wonder what it would be like if you could visit the Guardian whilst still within Facebook, signed in, chatting and sharing with your friends. Within that environment could we show users a selection of other content that would appeal to them, and tempt them to stay with our content a little bit longer, even if they weren’t on our domain.

The second thing that came to mind related to the economic/business models around the app Facebook app itself:

The Guardian Facebook app is a canvas app. That means the bulk of the page is served by us within an iFrame on the Facebook domain. All the revenue from advertising served in that area of the page is ours, and for launch we engaged a sponsor to take the full inventory across the app. Facebook earn the revenue from advertising placed around the edges of the page.

I’m not sure if Facebook runs CPM (cost per thousand) display based ads, where advertisers pay per impression, or follow the Google AdWords model, where advertisers pay per click (PPC), but it got me wondering… A large number of folk on Facebook (and Twitter) share links to third party websites external to Facebook. As Martin Belam points out, the user return rate back to Facebook for folk visiting third party sites from Facebook seems very high – folk seem to follow a link from Facebook, consume that item, return to Facebook. Facebook makes an increasing chunk of its revenue from ads it sells on (though with the amount of furniture and Facebook open graph code it’s getting folk to include on their own websites, it presumably wouldn’t be so hard for them to roll out their own ad network to place ads on third party sites?) so keeping eyeballs on Facebook is presumably in their commercial interest.

In Twitter land, where the VC folk are presumably starting to wonder when the money tap will start to flow, I notice “sponsored tweets” are starting to appear in search results:

ANother twitter search irrelevance

Relevance still appears to be quite low, possibly because they haven’t yet got enough ads to cover a wide range of keywords or prompts:

Dodgy twitter promoted tweet

(Personally, if the relevance score was low, I wouldn’t place the ad, or I’d serve an ad tuned to the user, rather than the content, per se…)

Again, with Twitter, a lot of sharing results in users being taken to external sites, from which they quickly return to the Twitter context. Keeping folk in the Twitter context for images and videos through pop-up viewers or embedded content in the client is also a strategy pursued in may Twitter clients.

So here’s the thought, though it’s probably a commercially suicidal one: at the moment, Facebook and Twitter and Google+ all automatically “linkify” URLs (though Google+ also takes the strategy of previewing the first few lines of a single linked to page within a Google+ post). That is, given a URL in a post, they turn it into a link. But what if they turned that linkifier off for a domain, unless a fee was paid to turn it back on. Or what if the linkifier was turned off if the number of clickthrus on links to a particular domain, or page within a domain, exceeded a particular threshold, and could only be turned on again at a metered, CPM rate. (Memories here of different models for getting folk to pay for bandwidth, because what we have here is access to bandwidth out of the immediate Facebook, Twitter or Google+ context).

As a revenue model, the losses associated with irritating users would probably outweigh any revenue benefits, but as a thought experiment, it maybe suggests that we need to start paying more attention to how these large attention-consuming services are increasingly trying to cocoon us in their context (anyone remember AOL, or to a lesser extent Yahoo, or Microsoft?), rather than playing nicely with the rest of the web.

PS Hmmm…”app”. One default interpretation of this is “app on phone”, but “Facebook app” means an app that runs on the Facebook platform… So for any give app, that it is an “app” implies that that particular variant means “software application that runs on a proprietary platform”, which might actually be a combination of hardware and software platforms (e.g. Facebook API and Android phone)???

So Where Am I Socially Situated on Google+?

I haven’t really entered into the spirit of Google Plus yet – I haven’t created any circles or started populating them, for example, and I post rarely – but if you look at my public profile page you’ll see a list of folk who have added me to their circles…

This is always a risky thing of course – because my personal research ethic means that for anyone who pops their head above the horizon in my social space by linking publicly to one of my public profiles, their public data is fair game for an experiment… (I’m also aware that via authenticated access I may well be able to find grab even more data – but again, my personal research ethic is such that I try to make sure I don’t use data that requires any form of authentication in order to acquire it.)

So, here’s a started for 10: a quick social positioning map generated around who folk who have added me to public circles on Google+ publicly follow… Note that for folk who follow more than 90 people, I’m selecting a random sample of 90 of their friends to plot the graph. The graph is further filtered to only show folk who are followed by 5 or more of the folk who have added me to their circles (bear in mind that this may miss people out because of the 90 sample size hack).

Who folk who put me in a g+ circle follow

Through familiarity with many of the names, I spot what I’d loosely label as an OU grouping, a JISC grouping, an ed-techie grouping and a misc/other grouping…

Given one of the major rules of communication is ‘know your audience’, I keep wondering why so many folk who “do” the social media thing have apparently no interest in who they’re bleating at or what those folk might be interested in… I guess it’s a belief in “if I shout, folk will listen…”?

PS if you want to grab your own graph and see how you’re socially positioned on Google Plus, the code is here (that script is broken… I’ve started an alternative version here). It’s a Python script that requires the networkx library. (The d3 library is also included but not used – so feel free to delete that import…)

Fragment: Looking up Who’s in Whose Google+ Circles…

Just a quick observation about how to grab the lists of circles an individual is in on Google+, and who’s in their circles… From looking at the HTTP calls that are made when you click on the ‘View All’ links for who’s in a person’s circles, and whose circles they are in, on their Google Profile page, we see URLs of the form:

– in X’s circles:

– in whose circles?

You can find the GOOGLEPLUSUSERID by looking in the URL of a person’s Google+ profile page. I’m not sure if the &rt=j is required/what it does exactly?

Results are provided via JSON some crappy hacky weird format, with individual records of the form:

,["Liam Green-Hughes",,,,"4af0a6e759a1b","EIRbEkFnHHwjFFNFCJwnCHxA", "BoZjAo3532KEBnJjHkFxCmRz",,"//",,1,
"Milton Keynes",,,"Developer and blogger",0,,[]

A scratch API, of a sort, is available form

Note that these connections don’t seem to be available via the Google Social Graph API? (Try popping in the URL to your Google+ profile page and see who it thinks you’re linked to…)

Google Spreadsheets API: Listing Individual Spreadsheet Sheets in R

In Using Google Spreadsheets as a Database Source for R, I described a simple Google function for pulling data into R from a Google Visualization/Chart tools API query language query applied to a Google spreadsheet, given the spreadsheet key and worksheet ID. But how do you get a list of sheets in spreadsheet, without opening up the spreadsheet and finding the sheet names or IDs directly? [Update: I’m not sure the query language API call lets you reference a sheet by name…]

The Google Spreadsheets API, that’s how… (see also GData Samples. The documentation appears to be all over the place…)

To look up the sheets associated with a spreadsheet identified by its key value KEY, construct a URL of the form:

This should give you an XML output. To get the output as a JSON feed, append ?alt=json to the end of the URL.

Having constructed the URL for sheets listing for a spreadsheet with a given key identifier, we can pull in and parse either the XML version, or the JSON version, into R and identify all the different sheets contained within the spreadsheet document as a whole.

First, the JSON version. I use the RJSONIO library to handle the feed:

ssURL=paste( sep="", '', sskey, '/public/basic?alt=json' )
for (el in spreadsheet$feed$entry) sheets=c(sheets,el$title['$t'])

Using a variant of the function described in the previous post, we can look up the data contained in a sheet by the sheet ID (I’m not sure you can look it up by name….?) – I’m not convinced that the row number is a reliable indicator of sheet ID, especially if you’ve deleted or reordered sheets. It may be that you do actually need to go to the spreadsheet to look up the sheet number for the gid, which actually defeats a large part of the purpose behind this hack?:-(

gsqAPI = function( key, query,gid=0){ return( read.csv( paste( sep="", '', 'tqx=out:csv', '&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ) ) ) }
gsqAPI(sskey,"select * limit 10", 9)

getting a list of sheet names from a goog spreadsheet into R

The second approach is to pull on the XML version of the sheet data feed. (This PDF tutorial got me a certain way along the road: Extracting Data from XML, but then I got confused about what to do next (I still don’t have a good feel for identifying or wrangling with R data structures, though at least I now know how to use the class() function to find out what R things the type of any given item is;-) and had to call on the lazy web to work out how to do this in the end!)

ssURL=paste( sep="", '', ssKey, '/public/basic' )
ssd=xmlTreeParse( ssURL, useInternal=TRUE )
nodes=getNodeSet( ssd, "//x:entry", "x" )
titles=sapply( nodes, function(x) xmlSApply( x, xmlValue ) )
data.frame( sheetName = titles['content',], sheetId = str_sub(titles['id',], -3, -1 ) )

data frame in r

In this example, we also pull out the sheet ID that is used by the Google spreadsheets API to access individual sheets, just in case. (Note that these IDs are not the same as the numeric gid values used in the chart API query language…)

PS Note: my version of R seemed to choke if I gave it https: headed URLs, but it was fine with http:

Google’s Universal User Channel

After a relaxing day spent finishing off Steven Levy’s In the Plex, here are a handful of very quick reflections:

– Google is an AI company: their scope for supervised training algorithms (with training signals coming from user actions and corrections), as well as unsupervised training (the algorithms for which are made available, for a fee, via the Google Prediction API) is ubrivalled;

– Google is a hardware company: when I used to run robotics outreach activities, I used to joke that Lego was the world’s biggest manufacturer of tyres. Nowadays, I’m as likely to quip that at one point in its history at least, Google was the world’s biggest computer manufacturer (not strictly true – I think they’re third or fourth…?)

– Google has unprecedented tracking ability: through Google cookies, Double Click cookies, Google Analytics and the Chrome browser, they have the potential to track a ridiculous amount of web usage… (Are there any services that allow you to use e.g. your DoubleClick cookie to get a view over what data has been collected and stored against that cookie on DoubleClick Google’s databases? That is, can I turn my cookies back on the companies that set them and demand to see what data is associated with them, VRM style?)

– Google has a channel to *huge* audience through AdSense text ads, DoubleClick display ads, and embedded YouTube videos. I’m largely blind to Google AdWords – I avoid right hand side ad-filled columns like the plague – but if Google started putting my upcoming calendar events or priority email headers into the AdSense display box alongside page contextual ads, I might start looking at them as I go to look at my personal messages… Imagine it: the AdSense block, (with a publisher’s permission), including updates from members of your social circle, as well as ads… I guess if the presence of updates (“Personal network messages”, compared to “Sponsored Links”) in the AdSense block increased ad click-thrus, it would make commercial sense anyway..? If I was watching a YouTube video and an urgent/priority message hit my GMail inbox, I could get a lower third alert box appearing in the movie to tell me. And so on…

If you haven’t read it yet, I recommend it: Steven Levy’s In the Plex.

As for me, I’m going to chase this reference that I came across in the book…: David Gelertner’s Mirror Worlds: or The Day Software Puts the Universe in a Shoebox… How it Will Happen and What it Will Mean

Googling the Future – from the Present and the Past

An XKCD cartoon today described Googling the future using search terms such as “in <year>” and “by <year>”:

So I tried it:

Hmm – results from the future?

So I had a play in Google News… could this be a good way of searching forecasts?

By searching the past, we can search for old forecasts of the future…

I leave it as an exercise for the reader to search results from 2006, 2001, and 1991 for the 5, 10 and 20 years forecasts respectively for this year… let me know in the comments if anything interesting turns up;-)

See also: Google Impact…? The “Google Suggest” Factor

PS ANd this: Quantifying the Advantage of Looking Forward, which looks for different countries at the ratio of searches for year+1 and year-1 over the course of a year, then plots the resulting quotient against GDP. The results appear to suggest that there is a correlation between GDP and the forward looking tendency of the population. (But is this right? Do the search volumes get normalised (on Google Trends) by the volume of the first term at the start of the trend period? If the user numbers are growing over the course of the year, might we be skewing the future looking component because of loaded terms at the end of the year?)

Getting Started With Google APIs – Round Up of Google Interactive API Explorers

The official Google Code blog a announced a Google APIs Explorer earlier today that allows you to experiment with a variety of Google APIs – Google API Explorer.

Currently supported APIs include:

– Buzz API
– Custom Search API
– Diacritize API
– Moderator API
– Shopping API
– Translate API
– URLShortener API

though you have to assume that more will be added as and when…

The API Explorer provides a quick way of checking API calls, as well as generating RESTful calls to the corresponding API (simply grab the GET request URL that is generated).

If your favourite API isn’t supported by the explorer, fear not, because there are several other interactive explorers for Google APIs and services around…

First up, the Google Analytics API Explorer (aka the Google Analytics Data Feed Query Explorer).

Then there’s the Google AJAX APIs Code Playground that lets you explore all manner of Google APIs in an interactive fashion (including the Google visualisation API, maps and Google Earth APIs, search APIs etc etc):

If it’s charts you want to experiment with, try out the Chart Tools Live Chart Playground (aka the old Chart API):

Any more that I’ve missed?

Google I/O: Gulp…

It’s that Google I/O time of year again, when Google releases more developer headf*ck APIs and code goodness in one go than it does at other times of the year (fanboy? me? Nah….. heh heh;-)

So what do we have this year? I’m guessing Liam will, if he hasn’t already, cover the Google TV announcement (and how much ad revenue with that bring them, if it takes off?), so I’ll take a look a the bits no-one’s been saying much about but that will (IMVHO), make a difference

First up, the Google Prediction API (and who saw that one coming…?! Sigh ;-) I’ve been playing with AI on and off for that 15 years (if you want a quick techie way in, with code examples on tap, Programming Collective Intelligence: Building Smart Web 2.0 Applications is hard to beat) but the novelty of this for me is two-fold: first up, intelligence (in the form of supervised training) on tap, as a service/web app. The Google Prediction API will see a steady increase in the number of people considering how to make their applications more intelligent, or provide a playground, if nothing else, for people who want to be able to train over large data sets. However, because Google owns the training algorithm, you can’t necessarily tune it yourself… It’s worth bearing in mind that Google is a master of casting applications so that they can benefit from supervised training (see for example People Powered Supervised Training Algorithms: Google Does it Again?) so with their weight behind it, the Prediction API could be an early indicator of a path that will lead to the commoditisation of computational intelligence, via industrial scale paid for services (“In the future, we plan to launch a paid version of this API”). In the same way that Amazon started selling web services off the back of infrastructure developed for its core business, Google is just following a similar course of action. Of course, the prediction API may lead to nothing… although along the way it might encourage companies to start putting large amounts of their data into its Google Storage service (which in turn is opening up a front against Amazon S3, maybe?;-)

Secondly, the Google Latitude API. Some commentators are claiming that with the likes of Foursquare and Gowalla already fighting over location, Google is well behind the curve on this one. I’m not so sure… Several location related Google APIs are starting to require users to declare whether the data is coming from a sensor or not (e.g. Google Elevation API: Denoting Sensor Usage) so if you think past location as people related and start to think about telemetry and instrumentation in a wider sense, the Latitude API could be a place where the geo sensor data goes (cf. pachube. If Pachube is ahead of the curve, I wonder if Google will snap them up, maybe as a short term complement to Google PowerMeter?)

Also on the mapping front are Google Styled maps, which let you customise the appearance of Google Maps. CloudMade has been offering this sort of service for some time over OpenStreetMap content, so I hope that rather than sound a death knell for CloudMade, the Styled maps actually sees a wider uptake of CloudMade. (I wonder too if anyone will fire up an extension to Mapstraction, the library that abstracts over a wide variety of map APIs, to cover styling?) If the opportunity arose, would the CloudMade folks make the jump to Google, I wonder?

Third up are the creepy bits… AdSense for AJAX/Search Ads and Gmail contextual gadgets, now available to all developers. AdSense for AJAX lets you pull contextual AdSense ads into your own page, even if the page is filled with dynamic content, by providing you with “the ability to supply hints to help ensure that ads with high relevancy are shown to your users.” AdSense for Search Ads seems to let you pair a Google custom search results panel and an AdSense panel so that you can pull back relevant AdSense ads into a page based on corresponding search results. Way up on my to do list is looking at whether we can use adservers to serve contextual content, so here’s another possible route for trying that out… On the GMail front, if you want to push contextually relevant content to people based on the contents of their email folder, (and why wouldn’t you?!;-)

A Gmail contextual gadget is a gadget that is triggered by clues in Gmail, such as the contents of Subject lines and email messages. A Gmail contextual gadget can be triggered by content in any of the following parts of an email: From, To, CC, Subject, Body, Sent/received timestamps

The gadgets can be deployed either within an Enterprise environment (does this include Google Apps for Edu, I wonder?) or via the marketplace. On seeing these gadgets, my first thought was some sort of phishing like expedition. Could I send emails to people from a particular email address, and then pull in additional related content via the Gadget, or somehow build up a profile of someone via the content of their mailbox in a two pronged attack that identifies them through tracer emails and reconciles this with the results of the content analysis?

I have to admit the list of filter elements shocked me a little:

GMail gadget contextual filters

I can see this being hugely powerful if you think of your email as public goods, at least within the context of a public that exists within an enterprise, but more generally…? Err…? Let’s say I might have concerns… Or have I completely misunderstood how this all works? (See also: Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There. Suppose: I send you email, and you’re running a gadget…)

Anyway – thought for the day: what would a GMail learning environment look like?, where email messages sent to the user contained the course content, one (daily) chunk at a time, and the contextual widget pulled in additional materials based on the lesson/email the student was reading at the time? (NB I think it would be reasonable to assume that Google docs contextual gadgets might be a possibility some time soon?)

Finally, the big news… :-) Google Feed API supports PUSH and Google Apps script gains external triggers. First up – PUSH. You may have noticed that on certain Google search results pages, you get a small area with realtime web results that get pushed to the page almost immediately after they are created. What this means is that you can be pushed updates from a feed as soon as Google spots new content on that feed. What the Feed API now supports is this realtime PUSH updating. Complementing this, we have external triggers on Google Apps script. I haven’t found any documentation on this yet but the promise it that you can trigger Google apps scripts (for example, scripts associated with a spreadsheet?) from a third party site. The apps script documentation site is also announcing Installable Event Handlers which currently “support clock events which allow us to trigger a script based on the time”. For the serverless web developer, being able to run what are essentially cron jobs has always been a problem. But now it seems as if I should be able to run a script according to a particular schedule from with the Google Apps script environment. (I could probably do this in the Google App Engine environment, but I don’t see that as ever being a mass-user environment – it’s too coding programmer techie for mortals…) What does this mean? Well, it means I can tell my spreadsheet to go and grab some fresh data from some location at scheduled times of the day. Remember what I was saying about Pachube….?!

Okay, so that’s my take on this year’s Google I/O… a quirky perspective, maybe, but one that could have more consequences in terms of the way things are done, could be done and might be done (particularly in a realtime/live web environment) than a music store or leanback TV search app… As for the Android announcements… I still don’t have a good feeling for the movile ‘verse…

PS pound to a penny any OU newsletters on this only pick up on the TV bit;-)

PPS Oh yeah, forgot this one… WebM (i.e. video’s gonna change…;-) Gulp….

Hidden Talents of the Google Streetview Car…

Whilst playing with some Google maps last night, I noticed a new control:

Click it, and the browser throws up a request:

For those of you who haven’t seen this sort of thing before, the latest browsers come complete with location aware browsing. In the case of my browser, “Firefox gathers information about nearby wireless access points and your computer’s IP address. Then Firefox sends this information to the default geolocation service provider, Google Location Services, to get an estimate of your location.”

If you’re using a mobile phone, additional cues ares available, such as a GPS fix if your phone is GPS enabled, and cell tower triangulation, where the phone’s location can be detected not only from the current cell the phone is registered with, but also from the signal strength of surrounding cells.

If you accept the location finding, the new Google map control turns out to be a blue dot control…

You can revoke the location aware privilege by going to the site you granted access to, selecting “Page Info” from the Firefox tools menu, and then tweaking the Location Awareness setting:

Adding location awareness to a web page is trivial (e.g. Where are you? Find out with geolocation in Javascript) and is something I suspect that Facebook will soon have a privacy setting for…;-)

Anyway, in order for wifi network detection to be usable, a service is required that can map a network identifier onto a location. Skyhook Wireless is one provider of this service (I don’t think Google has acquired it – yet…), but Google also appears to be building its own…

There are several ways for Google to do this, of course…. If you have an Android phone, then it’s in principle possible for the phone to reconcile GPS data with cell tower and wifi network identifers and signal strengths. And the Google Streetview car? Well it appears that it doesn’t just collect imagery… On Google Street View Car Logging Wifi Networks: “Google’s roving Street View spycam may blur your face, but it’s got your number. The Street View service is under fire in Germany for scanning private WLAN networks, and recording users’ unique Mac (Media Access Control) addresses, as the car trundles along.” In the past, of course, there have also been privacy concerns about Google Street View capturing faces and car number plates. (See also: Large-scale Privacy Protection in Google Street View [PDF]).

Ever one to take an idea and run too far with it, I had a little think around what other sorts of “assist” information Google might be able to capture from Street View. So for example, in December last year (2009) it was announced that Google takes another stab at QR codes. Will it work this time?: “Google announced a broad plan to introduce QR code stickers in the windows of over 100,000 local businesses nationwide.” Hmm…so that means if Street View captures the QR code, it can then reconcile that location with your business…

(Street View captured QR-codes also provides a launchpad for augmented reality ads in Google Maps and Google Earth, e.g. by using the QR-code as the augmented reality registration image. See for example Real-Time Ads Coming to Google Street View?.)

Something else that was announced this week – Google Cloud Print, in which printers become accessible, and fax machines can be laid to rest…

Our goal is to build a printing experience that enables any app (web, desktop, or mobile) on any device to print to any printer anywhere in the world.

The Goog will quickly work out where in the world those printers are, of course… (I can’t wait to see a “Printers near me” option appearing in context menus… Err…;-)

(Just in passing, this also caught my eye this week: Digital Photocopiers Loaded With Secrets. In short, digital photocopiers are scanners, with hard drives. So assuming that you know all those stories about sensitive information leaking from organisations via hard drives on scrapped PCs, well, err..? What happened to your last workplace photocopier?)

Okay, enough loose threads there for you to weave into your own nightmare scenario… @andysc suggested this was all getting a bit like Halting State, so I’m going to track that book (which is new to me) down right now…

See also: So What Do You Think You’re Doing, Sonny?