Tagged: Google

Google’s Appetite for Training Data

A placeholder post – I’ll try to remember to add to this as and when I see examples of Google explicitly soliciting training data from users that can be used to train its AI models…

Locations – “Popular Times”

tesco_extra_ryde_isle_of_wight_-_google_search_and_tesco_extra_ryde_isle_of_wight_-_google_search

For example: Google Tracking How Busy Places are by Looking at Location Histories [SEO by the Sea] which also refers to a patent describing the following geo-intelligence technique: latency analysis.

A latency analysis system determines a latency period, such as a wait time, at a user destination. To determine the latency period, the latency analysis system receives location history from multiple user devices. With the location histories, the latency analysis system identifies points-of-interest that users have visited and determines the amount of time the user devices were at a point-of-interest. For example, the latency analysis system determines when a user device entered and exited a point-of-interest. Based on the elapsed time between entry and exit, the latency analysis system determines how long the user device was inside the point-of-interest.

Dangers of a Walled Garden…

Reading a recent Economist article (The value of friendship) about the announcement last week that Facebook is to float as a public company, and being amazed as ever about how these valuations, err, work, I recalled a couple of observations from a @currybet post about the Guardian Facebook app (“The Guardian’s Facebook app” – Martin Belam at news:rewired). The first related to using Facebook apps to (only partially successfully) capture attention of folk on Facebook and get them to refocus it on the Guardian website:

We knew that 77% of visits to the Guardian from facebook.com only lasted for one page. A good hypothesis for this was that leaving the confines of Facebook to visit another site was an interruption to a Facebook session, rather than a decision to go off and browse another site. We began to wonder what it would be like if you could visit the Guardian whilst still within Facebook, signed in, chatting and sharing with your friends. Within that environment could we show users a selection of other content that would appeal to them, and tempt them to stay with our content a little bit longer, even if they weren’t on our domain.

The second thing that came to mind related to the economic/business models around the app Facebook app itself:

The Guardian Facebook app is a canvas app. That means the bulk of the page is served by us within an iFrame on the Facebook domain. All the revenue from advertising served in that area of the page is ours, and for launch we engaged a sponsor to take the full inventory across the app. Facebook earn the revenue from advertising placed around the edges of the page.

I’m not sure if Facebook runs CPM (cost per thousand) display based ads, where advertisers pay per impression, or follow the Google AdWords model, where advertisers pay per click (PPC), but it got me wondering… A large number of folk on Facebook (and Twitter) share links to third party websites external to Facebook. As Martin Belam points out, the user return rate back to Facebook for folk visiting third party sites from Facebook seems very high – folk seem to follow a link from Facebook, consume that item, return to Facebook. Facebook makes an increasing chunk of its revenue from ads it sells on Facebook.com (though with the amount of furniture and Facebook open graph code it’s getting folk to include on their own websites, it presumably wouldn’t be so hard for them to roll out their own ad network to place ads on third party sites?) so keeping eyeballs on Facebook is presumably in their commercial interest.

In Twitter land, where the VC folk are presumably starting to wonder when the money tap will start to flow, I notice “sponsored tweets” are starting to appear in search results:

ANother twitter search irrelevance

Relevance still appears to be quite low, possibly because they haven’t yet got enough ads to cover a wide range of keywords or prompts:

Dodgy twitter promoted tweet

(Personally, if the relevance score was low, I wouldn’t place the ad, or I’d serve an ad tuned to the user, rather than the content, per se…)

Again, with Twitter, a lot of sharing results in users being taken to external sites, from which they quickly return to the Twitter context. Keeping folk in the Twitter context for images and videos through pop-up viewers or embedded content in the client is also a strategy pursued in may Twitter clients.

So here’s the thought, though it’s probably a commercially suicidal one: at the moment, Facebook and Twitter and Google+ all automatically “linkify” URLs (though Google+ also takes the strategy of previewing the first few lines of a single linked to page within a Google+ post). That is, given a URL in a post, they turn it into a link. But what if they turned that linkifier off for a domain, unless a fee was paid to turn it back on. Or what if the linkifier was turned off if the number of clickthrus on links to a particular domain, or page within a domain, exceeded a particular threshold, and could only be turned on again at a metered, CPM rate. (Memories here of different models for getting folk to pay for bandwidth, because what we have here is access to bandwidth out of the immediate Facebook, Twitter or Google+ context).

As a revenue model, the losses associated with irritating users would probably outweigh any revenue benefits, but as a thought experiment, it maybe suggests that we need to start paying more attention to how these large attention-consuming services are increasingly trying to cocoon us in their context (anyone remember AOL, or to a lesser extent Yahoo, or Microsoft?), rather than playing nicely with the rest of the web.

PS Hmmm…”app”. One default interpretation of this is “app on phone”, but “Facebook app” means an app that runs on the Facebook platform… So for any give app, that it is an “app” implies that that particular variant means “software application that runs on a proprietary platform”, which might actually be a combination of hardware and software platforms (e.g. Facebook API and Android phone)???

So Where Am I Socially Situated on Google+?

I haven’t really entered into the spirit of Google Plus yet – I haven’t created any circles or started populating them, for example, and I post rarely – but if you look at my public profile page you’ll see a list of folk who have added me to their circles…

This is always a risky thing of course – because my personal research ethic means that for anyone who pops their head above the horizon in my social space by linking publicly to one of my public profiles, their public data is fair game for an experiment… (I’m also aware that via authenticated access I may well be able to find grab even more data – but again, my personal research ethic is such that I try to make sure I don’t use data that requires any form of authentication in order to acquire it.)

So, here’s a started for 10: a quick social positioning map generated around who folk who have added me to public circles on Google+ publicly follow… Note that for folk who follow more than 90 people, I’m selecting a random sample of 90 of their friends to plot the graph. The graph is further filtered to only show folk who are followed by 5 or more of the folk who have added me to their circles (bear in mind that this may miss people out because of the 90 sample size hack).

Who folk who put me in a g+ circle follow

Through familiarity with many of the names, I spot what I’d loosely label as an OU grouping, a JISC grouping, an ed-techie grouping and a misc/other grouping…

Given one of the major rules of communication is ‘know your audience’, I keep wondering why so many folk who “do” the social media thing have apparently no interest in who they’re bleating at or what those folk might be interested in… I guess it’s a belief in “if I shout, folk will listen…”?

PS if you want to grab your own graph and see how you’re socially positioned on Google Plus, the code is here (that script is broken… I’ve started an alternative version here). It’s a Python script that requires the networkx library. (The d3 library is also included but not used – so feel free to delete that import…)

Fragment: Looking up Who’s in Whose Google+ Circles…

Just a quick observation about how to grab the lists of circles an individual is in on Google+, and who’s in their circles… From looking at the HTTP calls that are made when you click on the ‘View All’ links for who’s in a person’s circles, and whose circles they are in, on their Google Profile page, we see URLs of the form:

– in X’s circles:
https://plus.google.com/u/0/_/socialgraph/lookup/visible/?o=%5Bnull%2Cnull%2C%22GOOGLEPLUSUSERID%22%5D&rt=j

– in whose circles?
https://plus.google.com/u/0/_/socialgraph/lookup/incoming/?o=%5Bnull%2Cnull%2C%22GOOGLEPLUSUSERID%22%5D&n=1000&rt=j

You can find the GOOGLEPLUSUSERID by looking in the URL of a person’s Google+ profile page. I’m not sure if the &rt=j is required/what it does exactly?

Results are provided via JSON some crappy hacky weird format, with individual records of the form:

[[,,"101994985523872719331"]
,[]
,["Liam Green-Hughes",,,,"4af0a6e759a1b","EIRbEkFnHHwjFFNFCJwnCHxA", "BoZjAo3532KEBnJjHkFxCmRz",,"//lh5.googleusercontent.com/-z4ZRcOrNx7I/AAAAAAAAAAI/AAAAAAAAAAA/cQBO1TSuucI/photo.jpg",,1,
"Milton Keynes",,,"Developer and blogger",0,,[]
]
,[]
]

A scratch API, of a sort, is available form http://html5example.net/entry/tutorial/simple-python-google-plus-api

Note that these connections don’t seem to be available via the Google Social Graph API? (Try popping in the URL to your Google+ profile page and see who it thinks you’re linked to…)

Google Spreadsheets API: Listing Individual Spreadsheet Sheets in R

In Using Google Spreadsheets as a Database Source for R, I described a simple Google function for pulling data into R from a Google Visualization/Chart tools API query language query applied to a Google spreadsheet, given the spreadsheet key and worksheet ID. But how do you get a list of sheets in spreadsheet, without opening up the spreadsheet and finding the sheet names or IDs directly? [Update: I’m not sure the query language API call lets you reference a sheet by name…]

The Google Spreadsheets API, that’s how… (see also GData Samples. The documentation appears to be all over the place…)

To look up the sheets associated with a spreadsheet identified by its key value KEY, construct a URL of the form:

http://spreadsheets.google.com/feeds/worksheets/KEY/public/basic

This should give you an XML output. To get the output as a JSON feed, append ?alt=json to the end of the URL.

Having constructed the URL for sheets listing for a spreadsheet with a given key identifier, we can pull in and parse either the XML version, or the JSON version, into R and identify all the different sheets contained within the spreadsheet document as a whole.

First, the JSON version. I use the RJSONIO library to handle the feed:

library(RJSONIO)
sskey='0AmbQbL4Lrd61dDBfNEFqX1BGVDk0Mm1MNXFRUnBLNXc'
ssURL=paste( sep="", 'http://spreadsheets.google.com/feeds/worksheets/', sskey, '/public/basic?alt=json' )
spreadsheet=fromJSON(ssURL)
sheets=c()
for (el in spreadsheet$feed$entry) sheets=c(sheets,el$title['$t'])
as.data.frame(sheets)

Using a variant of the function described in the previous post, we can look up the data contained in a sheet by the sheet ID (I’m not sure you can look it up by name….?) – I’m not convinced that the row number is a reliable indicator of sheet ID, especially if you’ve deleted or reordered sheets. It may be that you do actually need to go to the spreadsheet to look up the sheet number for the gid, which actually defeats a large part of the purpose behind this hack?:-(

library(RCurl)
gsqAPI = function( key, query,gid=0){ return( read.csv( paste( sep="", 'http://spreadsheets.google.com/tq?', 'tqx=out:csv', '&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ) ) ) }
gsqAPI(sskey,"select * limit 10", 9)

getting a list of sheet names from a goog spreadsheet into R

The second approach is to pull on the XML version of the sheet data feed. (This PDF tutorial got me a certain way along the road: Extracting Data from XML, but then I got confused about what to do next (I still don’t have a good feel for identifying or wrangling with R data structures, though at least I now know how to use the class() function to find out what R things the type of any given item is;-) and had to call on the lazy web to work out how to do this in the end!)

library(XML)
ssURL=paste( sep="", 'http://spreadsheets.google.com/feeds/worksheets/', ssKey, '/public/basic' )
ssd=xmlTreeParse( ssURL, useInternal=TRUE )
nodes=getNodeSet( ssd, "//x:entry", "x" )
titles=sapply( nodes, function(x) xmlSApply( x, xmlValue ) )
library(stringr)
data.frame( sheetName = titles['content',], sheetId = str_sub(titles['id',], -3, -1 ) )

data frame in r

In this example, we also pull out the sheet ID that is used by the Google spreadsheets API to access individual sheets, just in case. (Note that these IDs are not the same as the numeric gid values used in the chart API query language…)

PS Note: my version of R seemed to choke if I gave it https: headed URLs, but it was fine with http:

Google’s Universal User Channel

After a relaxing day spent finishing off Steven Levy’s In the Plex, here are a handful of very quick reflections:

– Google is an AI company: their scope for supervised training algorithms (with training signals coming from user actions and corrections), as well as unsupervised training (the algorithms for which are made available, for a fee, via the Google Prediction API) is ubrivalled;

– Google is a hardware company: when I used to run robotics outreach activities, I used to joke that Lego was the world’s biggest manufacturer of tyres. Nowadays, I’m as likely to quip that at one point in its history at least, Google was the world’s biggest computer manufacturer (not strictly true – I think they’re third or fourth…?)

– Google has unprecedented tracking ability: through Google cookies, Double Click cookies, Google Analytics and the Chrome browser, they have the potential to track a ridiculous amount of web usage… (Are there any services that allow you to use e.g. your DoubleClick cookie to get a view over what data has been collected and stored against that cookie on DoubleClick Google’s databases? That is, can I turn my cookies back on the companies that set them and demand to see what data is associated with them, VRM style?)

– Google has a channel to *huge* audience through AdSense text ads, DoubleClick display ads, and embedded YouTube videos. I’m largely blind to Google AdWords – I avoid right hand side ad-filled columns like the plague – but if Google started putting my upcoming calendar events or priority email headers into the AdSense display box alongside page contextual ads, I might start looking at them as I go to look at my personal messages… Imagine it: the AdSense block, (with a publisher’s permission), including updates from members of your social circle, as well as ads… I guess if the presence of updates (“Personal network messages”, compared to “Sponsored Links”) in the AdSense block increased ad click-thrus, it would make commercial sense anyway..? If I was watching a YouTube video and an urgent/priority message hit my GMail inbox, I could get a lower third alert box appearing in the movie to tell me. And so on…

If you haven’t read it yet, I recommend it: Steven Levy’s In the Plex.

As for me, I’m going to chase this reference that I came across in the book…: David Gelertner’s Mirror Worlds: or The Day Software Puts the Universe in a Shoebox… How it Will Happen and What it Will Mean

Googling the Future – from the Present and the Past

An XKCD cartoon today described Googling the future using search terms such as “in <year>” and “by <year>”:

So I tried it:

Hmm – results from the future?

So I had a play in Google News… could this be a good way of searching forecasts?

By searching the past, we can search for old forecasts of the future…

I leave it as an exercise for the reader to search results from 2006, 2001, and 1991 for the 5, 10 and 20 years forecasts respectively for this year… let me know in the comments if anything interesting turns up;-)

See also: Google Impact…? The “Google Suggest” Factor

PS ANd this: Quantifying the Advantage of Looking Forward, which looks for different countries at the ratio of searches for year+1 and year-1 over the course of a year, then plots the resulting quotient against GDP. The results appear to suggest that there is a correlation between GDP and the forward looking tendency of the population. (But is this right? Do the search volumes get normalised (on Google Trends) by the volume of the first term at the start of the trend period? If the user numbers are growing over the course of the year, might we be skewing the future looking component because of loaded terms at the end of the year?)