Chasing Data – Are You Datablogging Yet?

It’s strange to think that the web search industry is only 15 years or so old, and in that time the race has been run on indexing and serving up results for web pages, images, videos, blogs, and so on. The current race is focused on chasing the mobile (local) searcher, making use of location awareness to serve up ads that are sensitive to spatial context, but maybe it’s data that is next?

(Maybe I need to write a “fear post” about how we’re waking into a world with browsers that know where we are, rather than “just” GPS enabled devices and mobile phone cell triangulation? ;-) [And, err, it seems Microsoft are getting in there too: Windows 7 knows where you are – “So just what is it that Microsoft is doing in Windows 7? Well, at a low level, Microsoft has a new application programming interface (API) for sensors and a second API for location. It uses any of a number of things to actually get the location, depending on what’s available. Obviously there’s GPS, but it also supports Wi-Fi and cellular triangulation. At a minimum.”]

So… data. Take for example this service on the Microsoft Research site: Data Depot. To me, this looks a site that will store and visualiise your telemetry data, or more informally collected data (you can tweet in data points, for example):

Want to ‘datablog’ your running miles or your commute times or your grocery spending? DataDepot provides a simple way to track any type of data over time. You can add data via the web or your phone, then annotate, view, analyze, and add related content to your data.

Services like Trendrr have also got the machinery in place to take daily “samples” and produce trend lines over time from automatically collected data. For example, here are some of the data sources they can already access:

  • Weather details – High and the low temperatures on weather.com for a specific zipcode.
  • Amazon Sales Rank – Sales rank on amazon.com
  • Monster Job Listings – Number of job results from Monster.com for the given query in a specific city.

Now call me paranoid, but I suddenly twigged why I thought the Google announcement about an extension to the Google Visualisation API that will enabl[e] developers to display data from any data source connected to the web (any database, Excel spreadsheet, etc.), not just from Google Spreadsheets could have some consequences.

At the moment, the API will let you pull datatable formatted data from your database into the Google namespace. But suppose the next step is for the API to make a call on your database using a query you have handcrafted; then add in some fear that Google has already sussed out how to Crawl through HTML forms by parsing a form and then automatically generating and posting queries using those forms to find more links from deep within a website, and you can see how giving the Google API a single query on your database would tell them some “useful info” (?!;-) about your database schema – info they could use to scrape and index a little more data out of your database…

Now of course the Viz API service may never extend that far, and I’m sure Google’s T&C’s would guarantee “good Internet citizenry practices”, but the potential for evil will be there…

And finally, it’s probably also worth mentioning that even if we don’t give the Goog the keys to our databases, plenty of us are in the habit of feeding public data stores anyway. For example, there are several sites built specifically around visualising user submitted data, (if you make it public…): Many Eyes and Swivel, for example. And then of course, there’s also Google Spreadsheets, DabbleDB, Zoho sheet etc etc.

The race for data is on… what are the consequences?!;-)

PS see also Track’n’graph, iCharts and widgenie. Or how about Daytum and mycrocosm.

Also related: “Self-surveillance”.

Confused About the Consequences

In the previous couple of posts, I’ve rambled about web apps that will find a book from its cover and a song just by playing it and your online contacts across a myriad of services from your username on a single service.

But today I saw something that brought home to me the consequences of aggregating millions of tiny individual actions, in this case photo uploads to the flickr social photo site.

Form my reading of the post, the purple overlays in the images above – not the blue bounding boxes – are generated automatically by clustering geotagged and placename tagged images and extrapolating a well contoured shape around them.

That is, from the photos tagged “London” [that is, photos that are tagged with London in Yahoo’s WOE service], the algorithm creates the purple “London city” overlay in the above diagram.

For each an every photo upload, there is maybe a tiny personal consequence. For millions of photo uploads, there are consequences like this… (From millions of personal votes cast, there’s the possible consequence of change…) [Update: apparently, flickr received its 3 billionth upload at the start of November…]

And it struck me that even the relatively unsophisticated form of signals intelligence that is traffic analysis was capable of changing the face of war. So what are the consequences of traffic analysis at this scale?

What are the possible consequences? What are we walking into?

(Of course, following a brief moment of “I want to stop contributing to this; I’m gonna kill my computer and go and grow onions somewhere”, I then started wondering: “hmm, maybe if we also mine the info about what camera took each photo, and looked up the price of that camera, we might be able to generate socio-economic overlays over different neighbourhoods, and then… arrghh… stop, no, evil, evil…;-)

So to add to the mix, here’s a couple more things that the web made easy this week. Firstly, the Google Visualisation API was extended so that it could consume data in a simple format from your own data sources. That is, if you allow your own database to output data in a simple tabular structure, the Google visualisation API makes it trivial to generate charts and graphs from that data. Secondly, Google added RSS feed support to their Google alerts service. This makes it easy to subscribe to an RSS feed that will alert you to new results on Google for a particular search. What really surprised me was how, after setting up a couple of alerts, they appeared without me doing anything (or maybe that should be – without me changing something to say “no”?) in my Google Reader account.

Small components is one thing.

Small components loosely coupled is another – and one where many of us see value.

Small components automatically wired together is yet another thing – and one that is increasingly going to happen. A consequence I hadn’t anticipated of setting up a Google RSS alert was that the feed appeared automatically in my feed reader.

Yesterday, an unanticipated consequence of me adding my blog URL to my Google Profile page was that several other URLs I control were automatically suggested to me as things I might want to add to my profile.

Whenever I go into Facebook, the platform suggests a list of people I might know to me, whom I might want to “friend”.

Now this recommendation may be because we share a large number of friends, or it might be that I’ve appeared in the same photograph as some of these people… How would Facebook know? Maybe Mircosoft, their search provider, told them: Why “People” Tags? describes how the beta version of Microsoft Live Photo gallery automatically identifies faces in photos and then prompts you to tag them with people’s names… Google already does this, of course, in Picasa, with its “name tags“.

And finally…a chance clickthru from someone on the Copac developments blog, which lists OUseful.info in the blogroll, alerted me through my blog stats to this post on Spooky Personalisation (should we be afraid?) which discusses the extent to which “adaptive personalisation” may appear “spooky” to the user.

(A serendipitous link discovery for me? Surely… Spooky? Maybe!;-)

And that maybe is going to be an ever more apparent unanticipated consequence of the way in which it’s getting so much easier to glue apps together? Spookiness…

PS see also Does Google Know Too Much? (h/t Ray@B2FXXX)

The Future of Search Is Already Here

One of my favourite quotes (and one I probably misquote – which is a pre-requisite of the best quotes) is William Gibson’s “the future is already here, it’s just not evenly distributed yet”…

Several times tonight, I realised that the future is increasingly happening around me, and it’s appearing so quickly I’m having problems even imagining what might come next.

So here for you delectation are some of the things I saw earlier this evening:

  • SnapTell: a mobile and iPhone app that lets you photograph a book, CD or game cover and it’ll recognise it, tell you what it is and take you to the appropriate Amazon page so you can buy it… (originally via CogDogBlog;

  • Shazam, a music recognition application that will identify a piece of music that’s playing out loud, pop up some details, and then let you buy it on iTunes or view a version of the song being played on Youtube (the CogDog also mentioned this, but it was arrived at tonight independently);

    So just imagine the “workflow” here: you hear a song playing, fire up the Shazam app, it recognises the song, then you can watch someone play a version of the song (maybe even the same version on Youtube.

  • A picture of a thousand words?: if you upload a scanned document onto the web as a PDF document, Google will now have a go at running an OCR service over the document, extracting the text, indexing it and making it searchable. Which means you can just scan and post, flag the content to the Googlebot via a sitemap, and then search into the OCR’d content; (I’m not sure if the OCR service is built on top of the Tesseract OCR code?)
  • barely three months ago, Youtube added the ability to augment videos with captions. With a little bit of glue, the Google translate service will take those captions and translate them into another language for you (Auto Translate Now Available For Videos With Captions):

    “To get a translation for your preferred language, move the mouse over the bottom-right arrow, and then over the small triangle next to the CC (or subtitle) icon, to see the captions menu. Click on the “Translate…” button and then you will be given a choice of many different languages.” [Youtube blog]

Another (mis)quote, this time from Arthur C. Clarke: “any sufficiently advanced technology is indistinguishable from magic”. And by magic, I guess one thing we mean is that there is no “obvious” causal relationship between the casting of a spell and the effect? And a second thing is that if we believe something to be possible, then it probably is possible.

I think I’m starting to believe in magic…

PS Google finally got round to making their alerts service feed a feed: Feed me! Google Alerts not just for email anymore, so now you can subscribe to an alerts RSS feed, rather than having to receive alerts via email. If you want to receive the updates via Twitter, just paste the feed URL into a service like Twitterfeed or f33d.in.

PPS I guess I should have listed this in the list above – news that Google has (at least in the US) found a way of opening up its book search data: Google pays small change to open every book in the world. Here’s the blog announcement: New chapter for Google Book Search: “With this agreement, in-copyright, out-of-print books will now be available for readers in the U.S. to search, preview and buy online — something that was simply unavailable to date. Most of these books are difficult, if not impossible, to find.”

Time to Get Scared, People?

Last week, I posted a couple of tweets (via http://twitter.com/psychemedia) that were essentially doodles around the edge of what services like Google can work out about you from your online activity.

As ever in these matters, I picked on AJCann in the tweets, partly because he evangelises social web tool use to his students;-)

So what did I look at?

  • the Google Social Graph API – a service that tries to mine your social connections from public ‘friendships’ on the web. Check out the demo services…

    For example, here’s what the Google social API can find from Alan’s Friendfeed account using the “My Connections” demo:

    • people he links to on twitter and flickr;
    • people who link to him as a contact on twitter, delicious, friendfeed and flickr;
    • a link picked up from Science of the Invisible (which happens to be one of Alan’s blogs), also picks out his identi.ca identity; adding that URL to the Social Graph API form pulls out more contacts – via foaf records – from Alan’s identi.ca profile;

    The “Site Connections” demo pulls out all sorts of info about an individual by looking at URLs prominently associated with them, such as a personal blog:

    The possible connections reveal Alan’s possible identity on Technorati, Twitter, identi.ca, friendfeed, swurl, seesmic and mybloglog.

  • For anyone who doesn’t know what Alan looks like, you can always do a “face” search on Google images;
  • increasingly, there are “people” search engines out there that are built solely for searching for people. One example is Spock (other examples include pipl, zoominfo and wink [and 123people, which offers and interesting federated search results page]). The Spock “deep web” search turns up links that potentially point to Alan’s friendfeed and twitter pages, his revver videos, slideshare account and so on;
  • Alan seems to be pretty consistent in the username he uses on different sites. This makes it easy to guess his account on different sites, of course – or use a service like User Name Check to do a quick search;

Now I wasn’t going to post anything about this, but today I saw the following on Google Blogoscoped: Search Google Profiles, which describes a new Google search feature. (Didn’t know you had a Google Profile? If you have a Google account, you probably do – http://www.google.com/s2/profiles/me/? And if you want to really scare yourself with what your Google account can do to you, check http://www.google.com/history/… go on, I dare you…)

I had a quick look to see if I could find a link for the new profile search on my profile page, but didn’t spot one, although it’s easy enough to find the search form here: http://www.google.com/s2/profiles. (Maybe I don’t get a link because my profile isn’t public?)

Anyway, while looking over my profile, I thought I’d add my blog URL (http://ouseful.info) to it – and as soon as I clicked enter, got this:

A set of links that I might want to add to my profile – taken in part from the Social Graph API, maybe? Over the next 6 months I could see Google providing a de facto social network aggregation site, just from re-posting to you what they know about your social connections from mining the data they’ve crawled, and linking some of it together…

And given that the Goog can learn a lot about you by virtue of crawling public pages that are already out there, how much more comprehensive will your profile on Google be (and how certain will it be in the profile it can automatically generate around you?) if you actually feed it yourself? (Bear in mind things like health care records exist already…)

PS I just had a look at my own Web History page on Google, and it seems like they’ve recently added some new features, such as “popular searches related to my searches”, and also something on search trends that I don’t fully (or even partially) understand? Or maybe they were already there and I’ve not noticed before/forgotten (I rarely look at my search history…)

PPS does the web know when your birthday is??? Bewar of “Happy Birthday me…”. See also My Web Birthday.

[Have you heard about Google’s ‘social circle’ technology yet? read more]

Amazon Reviews from Different Editions of the Same Book

A couple of days ago I posted a Yahoo pipe that showed how to Look Up Alternative Copies of a Book on Amazon, via ThingISBN. The main inspiration for that hack was that it could be useful to get “as new” prices for different editions of the same book if you’re not so bothered about which edition you get, but you are bothered by the price. (Or maybe you wanted an edition of a book with a different cover…)

It struck me last night that it might also be useful to aggregate the reviews from different editions of the same book, so here’s a hack that will do exactly that: produce a feed listing the reviews for the different editions of a particular book, and label each review with the book it came from via its cover:

The pipe starts exactly as before – get an ISBN, check that the ISBN is valid, then look up the ISBNs of the alternative editions of the book. The next step is to grab the Amazon comments for each book, before annotating each item (that is, each comment) with a link to the book cover that the review applies to; we also grab the ISBN (the ASIN) for each book and make a placeholder using it for the item link and image link:

Then we just create the appropriate URLs back to the Amazon site for that particular book edition:

The patterns are as follows:
– book description page: http://www.amazon.co.uk/exec/obidos/ASIN/ISBN
– book cover image: http://images.amazon.com/images/P/ISBN.01.TZZZZZZZ

Here’s how the nested pipe that grabs the comments works (Amazon book reviews lookup by ISBN pipe): first construct the URL to call the webservice that gets details for a book with a particular ISBN – the large report format includes the reviews:

Grab the results XML and point to the reviews (which are at Items.Item.CustomerReviews.Review):

Construct a valid RSS feed containing one comment per item:

And there you have it – a pipe that looks up the different editions of a particular book using ThingISBN, and then aggregates the Amazon reviews for all those editions.

Time for a TinyNS?

In a comment to Printing Out Online Course Materials With Embedded Movie Links Alan Levine suggests: “I’d say you are covered for people lacking a QR reader device since you have the video URL in print; about all you could is run through some process that generates a shorter link” [the emphasis is mine].

I suspect that URL shortening services have become increasingly popular because of the rise of the blog killing (wtf?!) microblogging services, but they’ve also been used for quite some time in magazines and newspapers. And making use of them in (printed out) course materials might also be a handy thing to do. (Assessing the risks involved in using such services is the sort of thing Brian Kelly may well have posted about somewhere; but see also towards the end of this post.)

Now anyone who knows me knows that my mobile phone is a hundred years old and won’t go anywhere near the interweb (though I can send short emails through a free SMS2email gateway I found several years ago!). So I don’t know if the browsers in smart phones can do this already… but it seems to me a really useful feature for a mobile browser would be something like the Mozilla/Firefox smart keywords.

Smart keywords are essentially bookmarks that are invoked by typing a keyword in the browser address bar and hitting return – the browser will then take you to the desired URL. Think of it like a URL “keyboard shortcut”…

One really nice feature of smart keywords is that they can handle an argument… For example, here’s a smart keyword I have defined in my browser (Flock, which is built from the Firefox codebase).

Given a TinyURL (such as http://tinyurl.com/6nf2z) all I need to type into my browser address bar is t 6nf2z to go there.

Which would seem like a sensible thing to be able to do in a browser on a mobile device… (maybe you already can? But how many people know how to do it, if so?)

(NB To create a TinyURL for the page you’re currently viewing at the click of a button, it’s easiest to use something like the TinyURL bookmarklet.)

Now one of the problems with URL shortening services is that you become reliant on the short URL provider to decode the shortened URL and redirect you to the intended “full length” URL. The relationship between the actual URL and the shortened URL is arbitrary, which is where the problem lies – the shortened URL is not a “lossless compressed” version of the original URL, it’s effectively the assignment of a random code that can be used to look up the full URL in a database owned by the short URL service provider. Cf. the scheme used by services like delicious, which generate an “MD5 hash” of a URL which does decode (usually!) to the original URL (see Pivotal Moments… (pivotwitter?!) for links to Yahoo pipes that decode both TinyURLs and delcious URL encodings).

So this got me thinking – what would a “TinyNS” resolution service look like that sat one level above DNS resolution – the domain name resolution service that takes you from a human readable domain name (e.g. http://www.open.ac.uk) to an IP (internet protocol) address (something like 194.66.152.28).

Could (should) we set up trusted parties to mirror the mapping of shortened URL codes from the different URL shortening services (TinyURL, bit.ly, is.gd and so on) and provide distributed resolution of these short form URLs, just in case the original services go down?

Looking Up Alternative Copies of a Book on Amazon, via ThingISBN

As Amazon improves access to the long tail of books through Amazon’s marketplace sellers and maybe even their ownership of Abebooks, it’s increasingly easy to find multiple editions of the same book. So when I followed a link to a book that Mike Ellis recommended last week (to The Victorian Internet in fact) and found that none of the editions of the book were in stock, as new, on Amazon, I had the tangential thought that it’d be quite handy to have a service that would take an ISBN and then look up the prices for all the various editions of that book on Amazon.

Given an ISBN for a book, there are at least a couple of ways of finding the ISBNs for other editions of the book – the Worldcat xISBN service, and ThingISBN from LibraryThing (now part owned by Amazon through Amazon’s ownership of Abebooks; for who else Amazon owns, see Amazon “Edge Services” – Digital Manufacturing).

So here’s a couple of Yahoo pipes for looking up the alternative editions of a book on the Amazon website, after discovering those editions from ThingISBN.

First of all a pipe that takes an ISBN and looks up alternative editions using ThingISBN:

What this pipe does is construct a URL that calls for the list of alternative ISBNs for a given ISBN. That is, it constructs a URL of the form http://www.librarything.com/api/thingISBN/ISBNHERE, which returns an XML file containing the alternative ISBNs (example), grabs the XML file back using the Fetch Data block, renames the internal representation of the grabbed XML so that the pipe will generate a valid RSS feed, and output the result.

So now we have an RSS feed that contains a list of alternative ISBNs, via ThingISBN, for a given ISBN.

Now to find out how much these books cost on Amazon. For that, we shall find it convenient to construct a pipe that will look up the details of a book on Amazon using the Amazon Associates web service, given an ISBN. (For a brief intro to Amazon Associates web services, see Calling Amazon Associates/Ecommerce Web Services from a Google Spreadsheet.)

Here’s a pipe to do that:

(If you use the AWSzone scratchpad to construct a URL that calls the Amazon web service with a look up for book by ISBN, you can just paste it into the “Base” entry form in the Pipe’s URL Builder block and hit return, and it will explode the arguments into the appropriate slots for you.)

So now we have a pipe that will look up the details of a book on Amazon given its ISBN.

We can now put the ThingISBN pipe and the Amazon ISBN lookup pipe together, to create a compound pipe that will lookup details for all the alternative versions of a particular book, given that particular book’s ISBN:

Okay – so now we have a pipe that takes an ISBN, looks up the alternative ISBNs using ThingISBN, then grabs details for each of those alternatives from Amazon…

Now what? Well, if you use this pipe in your own mashup, you may find that if you construct a URL that calls a pipe with a given ISBN, if you don’t handle the ISBN properly in your own code, you can pass a badly formed ISBN to the pipe. The most common example of this is dropping a leading 0 on the ISBN – so e.g. you pass 441172717 rather than 0441172717.

Now it just so happens that LibraryThing offers another webservice that can correct this sort of error – ISBN check API – and it’s easy enough to create a pipe to call it:

Good – so now we can defensively programme the front end of our pipe to handle badly formed ISBNs by sticking this pipe at the front of the compound pipe that calls ThingISBN and then loops through Amazon calls.

But there’s something we can do at the other end of the pipe too, and that is make use of a ‘slideshow’ feature that Yahoo pipes offers as an interface to the pipe. If the elements of a feed contain image items that are packaged in an appropriate way, the Yahoo pipes interface will automatically create a slidesho of those images.

What this means is that if we package URLs that point to the book cover image of each alternative version of a book, we can get a slideshow of the bookcovers of all the alternative editions of that book.

Here’s just such a pipe:

And here’s the example output:

If you click on the “Get as Badge” option, you can then embed this slideshow on your own website or start page:

For example, here I’ve added the slideshow to my iGoogle page:

Now to my mind, that’s quite a fun (and practical) way of introducing quite a few ideas about webservice orchestration that can be unpacked at a later date. But of course, it’s not very academic, so it’s unlikely to appear in a course near you anytime soon… ;-) But I’d argue that it does stand up as a demo that could be given to show people how much fun this stuff can be to play with, before we inflict SOAP and WS-* on them…