Referrer Traffic from Amazon – WTF?!

As it happens, I have been know to look at my blog stats from time (!), and today I noticed something odd in the referrer stats:

A referral from Amazon. WTF?

The link goes to a book detail page for a book about Wikipedia:

Scrolling down a bit, I found this:

A blog post syndicated in the product page from one of the book’s authors that linked to my post on Data Scraping Wikipedia with Google Spreadsheets.

My immediate thought – is there any way we can blog info about courses that use set textbooks back into the related Amazon product page?

Amazon “Edge Services” – Digital Manufacturing

When is a web service not a web service? When it’s an edge service, maybe?

Last night I was pondering the Amazon proposition, which at first glance broadly seems to break down into:

The retail bit splits down further: physical goods and digital downloads, shipped by Amazon; and marketplace goods, where products from other retailers are listed (using Amazon ecommerce webservices, I guess) and Amazon takes a cut from each sale.

It was while I was looking at the digital downloads that the idea of “edge services” came to mind – web services that result in physical world actions (if you’re familiar with the Terminator movies, think: “Skynet manufacturing”;-) [It seems an appropriate phrase has already been coined: direct digital manufacturing, (DDM) – “the process of going directly from an electronic digital representation of a part to the final product [for example] via additive manufacturing”. See also “Digital Manufacturing — Bridging Imagination and Manufacturing“.]

But first let’s set the scene: just what is Amazon up to in the digital download space?

Quite a lot, as it happens – here’s what they offer directly under the Amazon brand, for example (on the domain):
Amazon MP3 Downloads store, a DRM free music downloads site;
Amazon Video on Demand Store – for movie and TV downloads;
Amazon e-books and docs – download e-books and electronic documents (“eDocs”);
– the Kindle store. If you haven’t heard about it already, Kindle is Amazon’s consumer electronics play, an e-book reader with wi-fi connectivity and a direct line back to the Amazon store;
– and just this week (and what prompted this post initially), Amazon bought up Reflexive, a company that among other things is into the online game distribution business.

And although it doesn’t quite fit into the “digital download” space, don’t forget the person-in-the-machine product Amazon Mechanical Turk, a web service for farming out piece work to real people.

But that’s not all – here, for example, are the companies that I know about that are in the Amazon Group of Companies:
IMDb – The Internet Movie Database (which apparently is now streaming movies and TV programmes for free);
Audible – audio book downloads;
Booksurge – book printing on-demand (just by-the-by, in the UK, Amazon’s Milton Keynes fulfilment centre is about to go into the PoD business (press release);
CreateSpace: PoD plus, I guess? Create print on demand books DVDs and CDs, backed up by online audio and video distribution services.

(Amazon also own Shelfari, a site for users to organise and manage their own online bookshelves, and have a stake in LibraryThing, another service in the same vein, through the acquisition of second-hand, rare and out-of-print book retailer Abebooks.)

UPDATE: And they’ve just bought the Stanza e-book reader.

So here’s where it struck me: Amazon is increasingly capable of turning digital bits into physical stuff. This is good for warehousing, of course – the inventory in a PoD driven distribution service is blanks, not one or two copies of as many long tail books you can fit in the warehouse – though of course the actual process of PoD is possibly a huge bottleneck. And it takes Amazon from retailer, to manufacturer? Or to a retailer with an infinite inventory?

If this is part of the game plan, then maybe we can expect Amazon to buy up the following companies (or companies like them) over the next few months: – personalised business card printing, that’s also moving into more general card printing. Upload your photos (or import them from services like flickr) and then print ’em out… photobox does something similar, though it maybe prints onto a wider range of products than MOO currently does?
Spreadshirt – design (and sell) your own printed T-shirts;
Ponoko, or Shapeways – upload your CAD plans and let their 3D printers go to work fabricating your design;
Partybeans – personalised “candy boxes”. Put your own image on a tin containing your favourite sweets:-)

(For a few more ideas, see Money on Demand: New Revenue Streams for Online Content Publishers.)

That said, Amazon built up it’s retail operation based on reviews and recommendations (“people who bought this, also bought that”). The recommendation engine was (is) one way of surfacing long tail products to potential purchasers. And I’m not convinced that the long tail rec engine will necessarily work on ‘user-generated’ content (although maybe it will scale across to that?!). But if you run an inventoryless operation, then does it matter?! Because maybe you can resell uploaded, user-contributed content to friends and family anyway, and several sales for the price of one upload that way?

Or maybe they’ll move into franchising POD and fab machines, and scale-up manufacturing that way? One thing I keep noticing at conferences and events is that coffee increasingly comes in Starbucks labeled dispensers (Starbucks – For Business). So maybe we’ll start seeing Amazon branded POD and fab machines in our libraries, bookstores and catalogue shops? (Espresso, anyone? Blackwell brews up Espresso: “Blackwell is introducing an on-demand printer the Espresso Book Machine to its 60-store chain after signing an agreement with US owner On Demand Books.“) Also on the coffee front – brand your latte

A few further thoughts:
– if Amazon is deliberately developing a digital manufacturing capacity to supplement it’s retail operation (and find ways of reducing stock levels of “instanced products” (i.e. particular books, or particular DVDs) then is the next step moving into the design and user-contributed content business? Like photo-sharing, or video editing..? How’s the Yahoo share price today, I wonder?! ;-)
– Amazon starts (privacy restrictions allowing, and if it doesn’t already do so) to use services like Shelfari and IMDb (though its playlists) to feed its recommendation engine, and encourages the growth of consumer curated playlists; will it have another go at pushing the Your Media Library service, or will it happily exploit verticals like IMDb and Shelfari?
– will companies running on Amazon webservices that are offering “edge services” start to become acquisition targets? After all, if they’re already running on Amazon infrastructure, that makes integration easier, right? Because the Amazon website itself is built on top of those services (and is itself actually a presentation layer for lots of loosely coupled web services already?) (And if you go into conspiracy mode, was the long term plan always to use Amazon webservices as a way of fostering external business innovation that might then be bought up and rolled up into Amazon itself?!)

There’s a book in there somewhere, I think?!

PS another riff on services at the edge, AWS satellite ground stations: Instead of building your own ground station or entering in to a long-term contract, you can make use of AWS Ground Station on an as-needed, pay-as-you-go basis. You can get access to a ground station on short notice in order to handle a special event: severe weather, a natural disaster, or something more positive such as a sporting event. If you need access to a ground station on a regular basis to capture Earth observations or distribute content world-wide, you can reserve capacity ahead of time and pay even less. AWS Ground Station is a fully managed service. You don’t need to build or maintain antennas, and can focus on your work or research.

[Dec 2020] Or how about this? Amazon Monitron, "an end-to-end system that uses machine learning (ML) to detect abnormal behavior in industrial machinery, enabling you to implement predictive maintenance and reduce unplanned downtime [using] Monitron Sensors to capture vibration and temperature data". So: Amazon start to automate manufacturing analytics, use that to in part bootstrap the development of their own (additive) manufacturing processes, initially for producing own-brand lines (whose adoption is identified from analysis of their marketplace sales data) but then rolled as a service and then sold as industrial equipment (built in part, rep-rap style, from their own manufacturing lines).

I wonder if Jeff Bezos (or Elon Musk..] has ever read this 1981 NASA funded research report: A SELF-REPLICATING, GROWING LUNAR FACTORY.

Mashup Reuse – Are You Lazy Enough?

Late on Friday night, I picked up a challenge (again on Twitter) from Scott Leslie:

After a little more probing, the problem turned out to be to do with pulling a list of URLs from a page on the Guardian blog together into a playlist: The 50 greatest arts videos on YouTube.

As Scott suggested, it would have been far more useful to provide the list as a Youtube playlist. But they didn’t… So was there an easy way of creating one?

Now it’s quite possible that there is a way to programmatically create a playlist via the Youtube gdata API, but here’s a workaround that uses a Grazr widget as a player for a list of Youtube URLs:

So let’s work back from this widget to see how it was put together.

The Grazr widget works by loading in the URL of an OPML or RSS feed, or the URL of an HTML page that contains an autodiscoverable feed:

The URL loaded into the widget is this:

If we load this URL into our normal browser, (and then maybe also “View Source” from the browser Edit menu, or “Page Source” from the browser View menu) this is what the Grazr widget is consuming:

If you know anything about Grazr widgets, then you’ll maybe know that if the feed contains a media enclosure, Grazr will try to embed it in an appropriate player…

So where is the feed the Grazr widget is feeding on actually come from? The answer, a Yahoo pipe. This pipe in fact:

Let’s have a look at how it works – click on the “Edit Pipe” button (or maybe Clone the pipe first to get your own copy of it – though you’ll have to log in and/or create a Pipes user account using your Yahoo account first…):

And here’s what we find:

What this pipe does is Fetch a CSV file from another URL and rename the pipe’s internal representation of the data that was loaded in from the CSV file in such a way that the pipe now represents a valid, if not too informative, RSS feed:

The Loop element is used to run another pipe function (TinyURL to full (preview) URL), whose name suggests that it returns the target (original) URL from a TinyURL:

This pipe block is actually one I created before (and remembered creating) – if you inspect the debug output of the block, you’ll see the TinyURLs have been unpacked to Youtube video page URLs.

(If you want to explore how it works, you can find it here: TInyurl to full (preview) URL.)

The final bit of the pipe renames the attribute that was added to the pipe’s internal feed representation as an enclosure.url, and then rewrites the links from URLs that point to a Youtube video (splash) page to the URL of the video asset itself (a Flash swf file).

So that’s how the pipe works – it takes a CSV file input from somewhere on the web, and generates an RSS feed with a video file enclosure that can be played in a Grazr widget.

So where does the CSV file come from? If we look at the URL that the CSV block is loading in in the pipe we can find out:

Here’s the spreadsheet itself:

And the cells we want are cells A30:A82:

TinyURLs :-)

But where are they coming from?

Hmm, they are CONTINUEing in from cell A1:

The little orange square in the top right of cell A1 in the spreadsheet shows a formula is being used in that cell…

So let’s see what the formula is:

Here it is:

I recognise that URL! ;-) So what this formula does is is load in the The 50 greatest arts videos on YouTube page from the Guardian website, and then pull out all the anchor tags – all the <a> tags… which happen to include the links to the movies which we found at cells A30:A82.

Just looking back at the original page, here’s what it looked like:

And here’s what it really looks like if we view the page source:

So to recap, what appears to be going on is this:

A Google spreadsheet loads in the Guardian web page as an XML document, and strips out the URLs. The top 50 video URLs appear in contiguous cells as TinyURLs. These are published as a CSV file and consumed by a Yahoo pipe. The pipe takes the TinyURLs in the CSV feed, creates an RSS feed from them, unpacks them to their full Youtube URL form, and adds a link to the actual Youtube video asset as a feed enclosure. The feed is then rendered in a Grazr widget that automatically loads an embedded video player when it sees the Youtube video enclosure.

So how did I put this “mashup” together?

Firstly, I looked at the original page that contained the links that Scott (remember Scott?… strains of Alice’s Restaurant etc etc ;-), and knowing that I could use a Grazr widget as a player for a feed that contained Youtube movie enclosures all I had to do was get the URLs into a feed… and so I looked at the 50 top videos for cluses as to whether the links were in a form I could do something with, maybe using the Yahoo pipes ‘impoort HTML’ page scraping block; but the page didn’t look that friendly, so then I View source‘d. And the page structure didn’t look overly helpful either; but the links were there so rather than look t the page too closely, I though (on the off chance) I’d see what they looked like if I just link scraped the page. And knowing that the Google importXML function (with the //a XPATH) is a link scraper, I gave it a try; and the TinyURLs were all blocked together, so I knew I could use ’em by publishing that block of cells via a CSV file. And I remembered I’d created a TinyURL decoder block in Yahoo pipes some time ago, and I remembered creating a Youtube enclosure pipe before too, so I could crib that as well. And so it fell together…

And it fell together because I’d built reuable bits before, and I remembered where I could find them, so l took the lazy solution and wired them up together.

And I think that’s maybe what we need more of it mashups are going to become more widely used – more laziness… Many mashups are bespoke one-off’s because it’s getting easier to build “vertical” disposable mashup solutions. But sometimes there’s even easier to put together if they’re made out of Lego… ;-)

See also: Mashup Recycling: Now this is Green IT! and The Revolution Starts (near) Here

Getting an RSS Feed Out of a Google Custom Search Engine (CSE)

Alan posted me a tweet earlier today asking me to prove my “genius” credentials (heh, heh;-):

As far as I know, Google CSEs don’t offer an RSS output (yet: Google websearch doesn’t either, though rumour has it that it will, soon… so maybe CSEs will open up with opensearch too?)

So here’s a workaround…

If you make a query in a Google CSE – such as the rather wonderful How Do I? instructional video CSE ;-) – you’ll notice in the URL an argument that says &cx=somEGobbleDYGookNumber234sTUfF&cof….

google cse

The characters between cx= and either the end of the URL or an ampersand (&) are the ID of the CSE. In the case of How Do I?, the ID is 009190243792682903990%3Aqppoopa3lxa – almost; the “%3A” is a safe encoding for the web of the character “:”, so the actual CSE ID is 009190243792682903990:qppoopa3lxa. But we can work round that, and work with the encoded CSE ID cut straight from the URL.

Using the Google AJAX search API, you can create a query on any CSE that will return a result using the JSON format (a javascript object that can be loaded into a web page). The Google AJAX search API documentation tells you how: construct a Google AJAX web search query using the root and add a few extra arguments to pull in results from a particular CSE: Web Search Specific Arguments.

JSON isn’t RSS, but we can get it into RSS quite easily, using a Yahoo pipe…

Just paste in the ID of a CSE (or the whole results URL), add your query, and subscribe to the results as an RSS feed from the More Options menu:

The pipe works as follows…

First up, create a text box to let a user enter a CSE ID cut and pasted from a CSE results page URL (this should work if you paste in the whole of the URL of the results page from a query made on your CSE):

Then create the search query input box, and along with the CSE ID use it to create a URL that calls the Google AJAX API:

Grab the JSON data feed from the Google AJAX Search API and translate the results so that the pipe will output a valid RSS feed:

And there you have it – an RSS feed for a particular query made on a particular Google CSE can be obtained from the Get as RSS output on the pipe’s More Options menu.

Viewing Campaign Finance Data In a Google Spreadsheet via the New York Times Campaign Data API

It’s all very well when someone like the New York Times opens up an API to campaign finance data (announced here), but only the geeks can play, right? It’s not like they’re making the data available in a spreadsheet so that more people have a chance to try and do something with it, is it? Err, well, maybe this’ll help? A way of getting data out from the NYT API and into an online Google spreadsheet, where you can do something with it… (you can see an example here: campaign data in a Google Spreadsheet (via The New York Times Campaign Data API)).

Following up on a comment to Visualising Financial Data In a Google Spreadsheet Motion Chart by Dan Mcquillan: “FWIW, there’s a lot of campaigning potential in these tools”, I just had a quick tinker with another Google spreadsheet function – importXML – that pulls XML data into a Google spreadsheet, and in particular pointed it at the newly released New York Times Campaign Finance API.

To use this API you need to get yourself a key, which requires you to register with the New York Times, and also provide a URL for your application: create a new Google spreadsheet, and use the URL for it (it’ll look something like as the URL for your NYT API app.

First up, enter “API KEY” as a label in a cell – I used G22 – and paste your New York Times API Campaign Finance API key for that spreadsheet into the cell next to it – e.g. H32. (My spreadsheet was cobbled together in scratchpad mode, so the cells are all over the place!)

In cell G23 enter the label “Names URL” and in cell H23 the following formula, which constructs a URL for the “Candidate Summaries”:

[aaargghhh – crappy styylesheet… the formula is this, but all on one line:


(The formula concatenates a NYT API calling URL with your API key.)

We’re now going to pull in the list of candidates: in e.g. cell A24 enter the formula:

H23, remember, is a reference to the cell that contains the URL for the API call that grabs all the candidate summaries. The =importXML grabs the XML result from the API call, and the “//candidate” bit looks through the XML file till it finds a “candidate” element. If you look at the XML from the webservice, you’ll see it actually contains several of these:

The spreadsheet actually handles all of these and prints out a summary table for you:

All we did was configure one cell to grab all this information, remember: cell A24 which contains =ImportXML(H23,”//candidate”).

We’re now going to pull in more detailed campaign info for each candidate using the “Candidate Details” API request, which looks something like

(The (lack of) capitalisation of the surname of the candidate in the URL is not necessary – it seems to work okay with “Romney.xml” for example).

To construct the URLs for each candidate, let’s write a formula to construct the URL for one of them.

In my spreadsheet, the Candidate Summary import in cell A24 built a table that filled cells A24:E42, so we’re going to start working in additional columns alongside that table.

You’ll notice that the URL for the detailed report just requires the candidate’s surname, whereas the candidate summary provides the forename as well.

In cell F24, add the following formula: SPLIT(A24, “,”)

What this does is split the contents of cell A24 (e.g. Obama, Barack) at the comma, and populate two cells (in columns F and G) as a result:

If you highlight the cell (F24), click on the little blue square in the bottom right hand corner, and drag down, you should be able to magically copy the formula and so spilt all the candidates’ names.

Now what we need to do is construct the URL that will pull back the detailed campaign information for each candidate. We’ll build the URLs in the next free column along (column H in my spreadsheet – the split name is across columns F and G).

In cell H24, enter the following formula (all on one line – split here for convenience):


H$22 refers to the cell that contains your NYT API key. The $ is there to anchor the row number in the formula, so when you drag the cell to copy it, the copied formula still refers to that exact same cell.

Highlight cell H24 and drag the little blue square down to complete the table and construct URLs for each candidate.

Good stuff :-)

Now we can construct another formula to import the full campaign details for each candidate.

in cell A2 in my spreadsheet, I entered the following:


H24 is the URL for the candidate details API call for the first candidate (Barack Obama in my case).

Hopefully, you should get a row of data pulled into the spreadsheet:

The XML file responsible looks something like this:

Use your eyes and work out what the contents of each cell might refer to:

Click on cell A2, and drag the blue square down to A20. The contents of cell A2 will be copied in a relative way, and should now be importing detailed information for all the candidates.

Now you have the data in a form you can play with :-)

And it should update whenever the New York Times updates the data it exposes through the API.

(I would publish the spreadsheet in all its glory, but then I’d have to give away my API key… Ah, what the heck = here it is: Spreadsheet of Campaign Finances, courtesy of the New York Times. (Maybe I should read the T&C to check this is a fair use…!))

For other New York Times Campaign Data API functions, see the documentation. You should know enough now to be able to work out how to use it…

PS the “//candidate” XPATH stuff can be tricky to get your head round. If anyone wants to post as a comment a list of XPATH routines to pull out different data elements, feel free to do so :-)

If you make use of the data in any interesting ways, please link back here so we can see what you’re getting up to and how you’re using the data…

Link Love for Martin – “I Heart Twitter” Video

Just released, a follow up to Martin’s Edupunk response to my Changing Expectations video:

A Twitter Love Song (Weller)

Now available on Youtube…

PS (Weller)? Hmmm – any relation?!

PPS So it’ll be my turn again… hmm – you’ve been practising with Camtasia, haven’t you Martin? Maybe I’ll have to see what I can do with Jing and Jumpcut or Jaycut?

=GoogleLookup: Creating a Google Fact Engine Directory

Following on from a the previous posts in which I looked at pulling data tables out of wikipedia into a Google spreadsheet and accessing financial data within Google spreadsheets, in this post I’ll show you some of the things you can do with the Google spreadsheet formula “=GoogleLookup”.

If you’ve ever entered a search such as “capital of brazil“, you might have noticed that the Google web search engine will try to treat your query as a factual question, and answer it as such:

The Google spreadsheet =GoogleLookup(“entity”; “attribute”) formula provides a way of asking this sort of factual query within a Google spreadsheet:

Why’s this useful? One thing it seems to be good for is as a way of doing lots of factual look-ups in one go. How so?

Create a new Google spreadsheet and type the following into cell B2:


In cell A2 enter the word France and in cell B1 the word Population.

If you hover over cell A2, now populated with a number corresponding to the population of France, a tooltip will pop up showing the provenance of the data (i.e. where it came from), with a link to the data source.

Select cell A2 (the one containing the formula) and drag it using the small square in the bottom right hand corner of the cell across the row to cell C5. Highlight cells A2 to D2 and drag the small square marker in the bottom right hand corner of cell D2 across to cell D5. The formula will be copied into cells A2 to D5. The “$” symbols in the formula anchor various aspects of the formula so that it will look for the entity in column A of the current row, and the attribute in row 1 of the current column.

If you now add appropriate values into row 1 and column A, the GoogleLookup formula will be lookup the corresponding entity/attribute values (i.e. the corresponding facts).

Although I’ve had a quick look around for a comprehensive directory of allowable Google Lookup attribute values (and the sort of entity they should be paired with), I haven’t managed to find one. So I’ve put together a spreadsheet at where I’ll attempt to collate all the allowable lookup attributes I can find, along with appropriate exemplar entity values.

Using the ‘relative lookup formula’ approach shown above, the tables will show attribute values as column headings, and example entities as row headings.

At the moment I’ve started putting together Geography, Business and Chemistry lookup collections together:

In order to do a “bulk fact lookup”, enter a list of row label entity values into a spreadsheet, a set of attribute (fact) values as column headings, enter an appropriate ‘relativised GoogleLookup formula” into the first table data cell, and then drag it across the set of table data cells: voila – a bulk query made in a single shot across multiple entity and attribute values:-)

If you come across any more GoogleLookup attributes that seem to work, post a couple of example entity/attribute pairs in a comment to this post and I’ll add them to the =GoogleLookup formula attribute directory spreadsheet.

PS Looking at the sample spreadsheet just now, I notice that quite a few cells in the table may be unpopulated for particular entity values. So in a read/write web world, it should be possible to enter a value into the cell and have it logged, err, somewhere? Such as in a Google consensual fact engine?! ;-)

Visualising Financial Data In a Google Spreadsheet Motion Chart

Following on from Data Scraping Wikipedia With Google Spreadsheets, here’s a quick post showing how you can use another handy Google spreadsheet formula:

=GoogleFinance(“symbol”, “attribute”, “start_date”, “end_date”, “interval”)

This function will pull in live – and historical – price data for a stock.

Although I noticed this formula yesterday as I was exploring the “importHTML” formula described in the Wikipedia crawling post, I didn’t have time to have a play with it; but after a quick crib of HOWTO – track stocks in Google Spreadsheets, it struck me that here was something I could have fun with in a motion chart (you know, one of those Hans Rosling Gapminder charts….;-)

NB For the “official” documentation, try here: Google Docs Help – Functions: GoogleFinance)

# Stock quotes and other data may be delayed up to 20 minutes. Information is provided “as is” and solely for informational purposes, not for trading purposes or advice. For more information, please read our Stock Quotes Disclaimer.
# You can enter 250 GoogleFinance functions in a single spreadsheet; two functions in the same cell count as two.

So – let’s have some fun…

Fire up a new Google spreadsheet from, give the spreadsheet a name, and save it, and then create a new sheet within the spreadsheet (click on the “Add new sheet” button at the bottom of the page). Select the new sheet (called “Sheet2” probably), and in cell A1 add the following:

=GoogleFinance(“AAPL”, “all”, “1/1/2008”, “10/10/2008”, “WEEKLY”)

In case you didn’t know, AAPL is the stock ticker for Apple. (You can find stock ticker symbols for other companies on many finance sites.)

The formula will pull in the historical price data for Apple at weekly intervals from the start of 2008 to October 10th. (“all” in the formula means that all historical data will be pulled in on each sample date: opening price, closing price, high and low price, and volume.)

(If this was the live – rather than historical – data, it would be updated regularly, automatically…)

It’s easy to plot this data using a Google chart or a Google gadget:

Google spreadsheet create chart/gadget menu

I’m not sure a bar chart or scatter chart are quite right for historical stock pricing… so how about a line chart:

Et voila:

If you want to embed this image, you can:

If I was using live pricing data, I think the image would update with the data…?

Now create a few more sheets in your spreadsheet, and into cell A1 of this new sheet (sheet3) paste the following:

=GoogleFinance(“IBM”, “all”, “1/1/2008”, “10/10/2008”, “WEEKLY”)

This will pull in the historical price data for IBM.

Create two or three more new sheets, and in cell A1 of each pull in some more stock data (e.g. MSFT for Microsoft, YHOO for Yahoo, and GOOG…)

Now click on Sheet1, which should be empty. Fill in the following title cells by hand across cells A1 to G1:

Now for some magic…

Look at the URL of your spreadsheet in the browser address bar – mine’s “;

That key value – the value between “key=” and “&hl=en_GB#” is important – to all intents and purposes it’s the name of the spreadsheet. Generally, the key will be the characters between “key=” and an “&” or the end of the URL; the “&” means “and here’s another variable” – it’s not part of the key.

In cell B2, enter the following:

=ImportRange(“YOURKEY”, “Sheet2!A2:F42”)

YOURKEY is, err, your spreadsheet key… So here’s mine:

=ImportRange(“p1rHUqg4g423seyxs3O31LA”, “Sheet2!A2:F42”)

What ImportRange does is pull in a range of cell values from another spreadsheet. In this case, I’m pulling in the AAPL historical price data from Sheet2 (but using a different spreadsheet key, I could pull in data from a different spreadsheet altogether, if I’ve made that spreadsheet public).

In cell A2, enter the ticker symbol AAPL; highlight cell A2, click on the square in the bottom right hand corner and drag it down the column – when you release the mouse, the AAPL stock ticker should be added to all the cells above. Label each row from the imported data, and then in the next row, B column, import the data from Sheet 3:

These rows will need labeling “IBM”.

Import some more data if you like and then… let’s create a motion chart (info about motion charts.

Highlight all the cells in sheet1 (all the imported data from the other sheets) and then from the Insert menu select Gadget; from the Gadget panel that pops up, we want a motion chart:

Configure the chart, and have a play [DEMO]:

Enjoy (hit the play button… :-)

PS And remember, you could always export the data from the spreadsheet – though there are probably better API powered ways of getting hold of that data…

PPS and before the post-hegemonic backlash begins (the .org link is broken btw? or is that the point?;-) this post isn’t intended to show how to use the Google formula or the Motion Chart well or even appropriately, it’s just to show how to use it to get something done in a hacky mashery way, with no heed to best practice… the post should be viewed as a quick corridor conversation that demonstrates the tech in a casual way, at the end of a long day…

PS for a version of this post in French, see here: Créer un graphique de mouvement à partir de Google Docs.

Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:


And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…


Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.


And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

Visualising the OU Twitter Network

Readers of any prominent OU bloggers will probably have noticed that we appear to have something of Twitter culture developing within the organisation (e.g. “Twitter, microblogging and living in the stream“). After posting a few Thoughts on Visualising the OU Twitter Network…, I couldn’t resist the urge to have a go at drawing the OpenU twittergraph at the end of last week (although I had hoped someone else on the lazyweb might take up the challenge…) and posted a few teaser images (using broken code – oops) via twitter.

Anyway, I tidied up the code a little, and managed to produce the following images, which I have to say are spectacularly uninteresting. The membership of the ‘OU twitter network’ was identified using a combination of searches on Twitter for “” and “Open University”, coupled with personal knowledge. Which is to say, the membership list may well be incomplete.

The images are based on a graph that plots who follows whom. If B follows A, then B is a follower and A is followed. In the network graphs, an arrow goes from A to B if A is followed by B (so in the network graph, the arrows point to people who follow you. The graph was constructed by making calls to the Twitter API for the names of people an individual followed, for each member of the OU Twitter network. An edge appears in the graph if a person in the OU twitter network follows another person in the OU Twitter network. (One thing I haven’t looked at is to see whether there are individuals followed by a large number of OpenU twitterers who aren’t in the OpenU twitter network… which might be interesting…)

Wordle view showing who in the network has the most followers (the word size is proportional to the number of followers, so the bigger your name, the more people there are in the OU network that follow you). As Stuart predicted, this largely looks like a function of active time spent on Twitter.

We can compare this with a Many Eyes tag cloud showing how widely people follow other members of the OU network (the word size is proportional to the number of people in the OU network that the named individual follows – so the bigger your name, the more people in the OU network you follow).

Note that it may be interesting to scale this result according to the total number of people a user is following:

@A’s OU network following density= (number of people @A follows in OU Twitter network)/(total number of people @A follows)

Similarly, maybe we could also look at:

@A’s OU network follower density= (number of people in OU Twitter network following @A)/(total number of people following @A)

(In the tag clouds, the number of people following is less than the number of people followed; I think this is in part because I couldn’t pull down the names of who a person was following for people who have protected their tweets?)

Here’s another view of people who actively follow other members of the OU twitter network:

And who’s being followed?

These treemaps uncover another layer of information if we add a search…

So for example, who is Niall following/not following?

And who’s following Niall?

I’m not sure how useful a view of the OU Twittergraph is itself, though?

Maybe more interesting is to look at the connectivity between people who have sent each other an @message. So for example, here’s how Niall has been chatting to people in the OU twitter network (a link goes from A to B if @A sends a tweet to @B):

ou personal activer twittermap

We can also compare the ‘active connectivity’ of several people in the OU Twitter network. For example, who is Martin talking to, (and who’s talking to Martin) compared with Niall’s conversations?


As to why am I picking on Niall…? Well, apart from making the point that by engaging in ‘public’ social networks, other people can look at what you’re doing, it’s partly because thinking about this post on ‘Twitter impact factors’ kept me up all night: Twitter – how interconnected are you?.

The above is all “very interesting”, of course, but I’m not sure how valuable it is, e.g. in helping us understand how knowledge might flow around the OU Twitter network? Maybe I need to go away and start looking at some of the social network analysis literature, as well as some of the other Twitter network analysis tools, such as Twinfluence (Thanks, @Eingang:-)

PS Non S. – Many Eyes may give you a way of embedding a Wordle tagcloud…?)