Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:

=ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population”,”table”,1)

And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…

…DEPRECATED

Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.

Lurvely…

And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/. (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

190 comments

  1. redbaiters

    My liege,

    Wow, I think I saw what I think I saw. We really have to start talking about visualizing data in a WPMu/MediaWiki environment. I will be blogging some semantic stuff I really don’t understand shortly, because you inspire me to no end.

  2. Brian

    Oh my goodness, I’m experiencing this familiar “Tony just did something I only half-get but am totally excited about” sensation…

    I gotta get me to a remedial digital literacy class, like, yesterday.

  3. Tom

    That made my day.

    I wish I’d have realized that was possible when I did this with Jim Groom. It would have saved me a lot of time and it’s just such an elegant way to do it.

    Thanks.

    Tom

  4. Pingback: Wikipedia : Google Docs : Yahoo Pipes Mashup - ../learninglab/joss/
  5. Matty Fo

    Amazing mashup. More importantly the =import function in Google spreadsheets is an amazing tool for building data sets for visualization and mashup projects, thanks for some killer knowledge.

  6. Gary

    Been reading your blog for a while and this just made me shake my head… am amazed and do agree with Brian thanks!!!

  7. Pingback: Mashing up the PLE (Tony Hirst) « Doug Clow’s Imaginatively-Titled Blog
  8. Katrineholm Uncensored

    Tony, you did a fine thing indeed! I’d like to see, for example, more transparency in our local government here in Katrineholm, Sweden where all documents (invoices included) with the exception of those involving personal privacy are online. Now, you offer interesting possibilities for information gathering and online publication via Wikipedia and then with Google Docs.

    Really, you did an excellent job Tony. Most impressive.

  9. Marius Hanganu

    Really cool stuff. This is what web 3.0 should be all about. Integrating services.

    The disadvantage in having so many pieces is that you have little control over how the flow should update the end results. If you want to publish data real time, you’d want to see it as fast as possible on google/yahoo maps. This is where some more work is needed.

  10. jq

    Well done. Just a minor correction. This:

    “consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a Yahoo map.”

    should be:

    “consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a GOOGLE map.”

    Or Yahoo then Google map.

  11. chrisco

    Very cool, thanks for sharing :)

    Question/request:

    I would like to do something like that with data from Yahoo Finance (and/or other source) for end of day market data (indexes, trading volume, advancing and declining issues, new highs/lows, etc.).

    Anyone can/care to help me with that, please get in touch through the contact form on my blog (linked to in this comment, I think, but if not go to http://www.chrisco.us). Would be a paid freelance project, of course.

    Thanks,
    Chris

  12. learningspanish

    Wow… I think my head just exploded….

    That’s impressive. I didn’t realize you could pull data in that way from a webpage in Google Spreadsheets. Nice implementation to get it to Google Maps.

  13. stacey

    how do you make sure the data you are using stays current? for example, if i took it from another website and put it on a map.

  14. Tony Hirst

    First of all – thanks for all the comments folks – and apologies for the delay in replying; I’ve been offline traveling to and from – and delivering – a presentation on mashups today… :-)

    @katrineholm Uncensored – does your council have any web pages we could start to scrape?

    @chrisco – I’ll be posting a follow up post in an hour or so that may be of interest?

    @jq ta :-)

    @stacey – the google spreadsheet polls the original HTML page every so often, so updates to the page will update the data in the spreadsheet. (Of course, if the structure of the page is changed – eg by adding or removing another table in the page before the table you want to pull out – the import function will try to import the now incorrect table; screenscraping based mashups like this are always quite brittle in that a change in the page can break the way you try to get data out of it; which is why APIs are better…:-)
    The Yahoo pipe will also cache the data being pulled from the spreadsheet in the csv file.
    So whilst in an ideal world, a change to the wikipedia page would immediatley be reflected each time you refreshed the map, there is “cacheing induced latency” in the google spreadsheet and the Yahoo pipe which could mean the data is potentially minutes or hours out of date… but hopefully no more than that…

  15. Greg

    As I understand it there are two ways: when you want to update something in google spreadsheets you have to click on the cell with the calculation and press enter, so that it refreshes, or use a cron job to run a script that fiddles with the spreadsheet so that it updates/grabs fresh data (don’t ask me if/how that would work)

  16. Jon

    (I just posted this on Brian’s blog, but thought I might as well post it here, given the plaudits that this is attracting…)

    Of course, superficially this is fine. But if you start to look at the data, it’s rather problematic. I was surprised to see that Manchester was listed as smaller than Bristol, of all places. It was only by going to the Wikipedia page that I understood the way in which the statistics were gathered.

    In short, a lot of information has been stripped out in this process. I don’t see much in the way of gain. And in fact, it’s become very misleading.

    Where’s the upside?

    OK, call me a cynic, but…

  17. Jon

    Or to put it another way (and in case I didn’t sound grumpy and fuddy-duddy enough the first time), the result is a USA Today style graphic, with all the apparent virtues of clarity but the real vices of decontextualized data masquerading as information. Worse, it’s framed as a multiple-choice “pop test” as though it had some kind of educational value. It doesn’t. It’s miseducation.

  18. Pingback: Infovore » links for October 15th
  19. Pingback: Visualising Financial Data In a Google Spreadsheet Motion Chart « OUseful.Info, the blog…
  20. Pingback: The NAvAP » Blog Archive » Example of a web mashup at work
  21. Tony Hirst

    @Jon the intention of the post wasn’t to produce an infographic, it was to document a mashup recipe. I was looking for a data table with a list of locations that I could scrape using a Google spreadsheet, then output into a geocoding Yahoo pipe and the wikipedia page was the first thing I came across that fitted the bill…

    Your comment makes an interesting point about the way data gets used, abused and inappropriately visualised, though…

  22. Jon

    Well, your mashup recipe provides an excellent case in point of the ways in which data gets used, abused, and inappropriately visualized.

    The problem is not with the Wikipedia page, but with the way in which the information it contains has been repackaged.

    Again, in devil’s advocate role: where’s the upside here?

  23. Pingback: links for 2008-10-15 « Simply… A User
  24. Tony Hirst

    @Jon “The problem is not with the Wikipedia page, but with the way in which the information it contains has been repackaged.”

    Agreed – as I just posted in a reply on Brian’s Abject blog, I should have explicitly stated the provenance of the data in both the spreadsheet and each description element of the feed generated in the pipe.

    Re: the visualisation being a crap one, my original intention was to visualise the actual population data in some way (I thought about something like this: http://ouseful.open.ac.uk/blogarchive/013580.html then thought better of it), but in the end I ran out of time… As I have done now – 1.40am and it’s been way too long a day to argue any more… ;-)

  25. Pingback: datatube « Greg Is Overcritical Of Himself And Lots Of Other Things.
  26. nzhappygirl

    I second Brian:digital remedial literacy class for me too, because that looks seriously cool.

  27. Pingback: On est une bande de potes, on se scrape la gueule | taggle.org
  28. Chloe

    It would be neat to export the chart or map as an image, and import it into the original Wiki page, so that whenever someone updates the table on the Wiki, the map & chart are automatically updated too. I’m sure there are services which will take a screenshot of a web page. I don’t know how to export the chart, or if Wikipedia allows images to be on another site.

  29. Sarsen56

    I think the phrase is ‘well impressed’. The bad news is I just wasted two days doing some tedious longhand stuff! Thanks for this post.

  30. Pingback: The Great Geek Manual » Link Round-Up: October 15, 2008
  31. dave

    I think it’s worth pointing out that while Google’s importHTML is very cool, for many purposes selecting, cutting and pasting a table from a website into a spreadsheet (Excel, OOCalc, or Google Spreedsheets) in the standard manner will the do the job.

    While less gee-whizz, it’s a bit more approachable, and a strangley overlooked ability. (There are some odd corner cases, particularly if you use Microsoft products, but simply knowing it’s possible usually means you’ll get there in the end. I find Fireox plus OpenOffice gives the best results and most powerful results. A quick Google suggests there’s a Firefox extension for people who find themselves needing to do more)

  32. Jon

    Tony, I’m sorry you see this as arguing.

    Again, however, it would be great if someone, anyone, could tell me the upside.

  33. Pingback: The Wikipedia Blog » Blog Archive » Scraping Wikipedia tables with Google Spreadsheets
  34. Tony Hirst

    @jon i see it as an argument in the sense of – you know – a discussion with different positions that is capable of raising different perspectives and moving the, err,argument on…;-) And the upside is that it was just a sort of coherent way of demonstrating lots of separate techie mashup tricks…

    @dave the problem with cur and paste is that if the original table changes, your cut and pasted data is obsolete (unless the cut and paste cuts’n’pastes a reference to the table that is then imported behind the scenes). A comment on the follow-up post to this one – http://ouseful.wordpress.com/2008/10/16/visualising-financial-data-in-a-google-spreadsheet-motion-chart/ – attempts to work out just how frequently the spreadsheet does update the data.

  35. Brian

    Tony – I’m glad you mentioned that you had posted a comment on my blog – I had yours and three or four others inexplicably in a junk folder, which I hadn’t checked in weeks.

    Jon – I took a crack at describing what I see as the upside in response to your comments on my blog, and linked to an article I wrote last year… I suspect it will fail to convince you, but I hope it will give a sense why this is interesting to some of us.

  36. Jon

    Well, once you can tell me what’s the upside of these “techie mashup tricks,” then that would indeed be “kewel.”

    It would be marvellous if the argument could move on, but it doesn’t seem to me that you are willing to look at the issues here.

    It’s shiny, shiny. And worse than useless: it’s positively unhelpful.

    No wonder people’s heads are exploding. It helps if you don’t actually think about what’s going on.

  37. Ivan

    Jaw-Dropping.

    It’s almost like this web of interconnected services will actually happen after all…

  38. Tony Hirst

    @Jon “It would be marvellous if the argument could move on, but it doesn’t seem to me that you are willing to look at the issues here.”

    I’ve so lost the thread now… anywhere, here are a few answers (hopefully) of a sort:

    1) no provenance – agreed; I touched on that in a previous comment and on Brian’s blog; at a more substantive level, on the point that mashups strip provenance info and context away from data – agreed: we do have a “metadata maintenance” issue here. I’ll come back to this in a moment to suggest that there may be situations where that’s okay… (err, thin ice up ahead, I suspect ;-) But generally, yes – agreed: it’s an issue; only not here, not now. This is where we try to enthuse people – then we can get on to the issues…

    To pull a quote of yours from the comment thread on http://weblogs.elearning.ubc.ca/brian/archives/048641.php “Um, what is the point, then? To do something that’s “kewel”? All surface and no substance? All process and technical trickery and no actual thought or context?”

    2) Part of the point was to just document for myself some whizzy techie wiring for a workshop session I ran yesterday, and rehearse a demo to finish off the session. I’ve sat through that many dull cubed presentations, I’m trying to develop my own presentation skills (a long hard slog if you’ve ever suffered one of my presentations) and figured for the w/s I needed to try and finish with some buzz in the room. Shiny shiny…. but that’s okay, because no-one remembers anything from a presentation, just whether they came away happy, dazed, bored or enthused.

    3) Value generating and information displaying enterprise mashups are arguably on the way (e.g. http://www.forrester.com/Research/Document/Excerpt/0,7211,44213,00.html ); and one reason (IMHO) for this is that the barriers to entry in terms of usability are being lowered by the experimentation that’s going on in the public and beta web. Unfortunately, most people have no idea about what the hell a mashup is – not even the very idea; (this R/W/W post provides one way of trying to explain what mashups are about in an accessible-ish way: http://www.readwriteweb.com/archives/forrester_enterprise_mashups.php ) So the demo was in part to try to get across some the relative easy (and getting easier) process techniques available for putting these things together (pipelines, data transformation, etc), and leave me with an opportunity to say: “so that’s a mashup” at the end of the session.

    I didn’t touch on stripping away the context and maintaining provenance info in the post because that would have BORING and is best saved for a later day – when people are at least thinking about what the consequences of mashups are having seen one possible example.

    @utellit told it as they saw it: “I love this. Learning is so fun! Thanks:)” Good – yes – learning should be fun; rarely is; my view? get enthused and then do the hard work of really getting to grips with it, because you’ll be more motivated…

    4) Most people are turned off by techie stuff, or techie stuff being “too hard” – how many people do you know who don’t have a concept of right-click, or it’s benefits? So I tried to tell a story that embedded some how to’s, and if you abstract away from it and start to think about what’s going on, there is some deeper stuff in there; like what sort of processing is going on in a ‘mashup’, and what the risks are…

    5) Thin ice here: in an enterprise, where the data is hopefully quality data(?!), you can hopefully trust it; and if you learn how to mash it up properly, you can get value from it. The provenance is assumed – there is still a risk wrt loss of context though. Related story – you see lots of web stats reporting averages (averages? wtf? MEANS…); and this are often a total waste of time – the info is in the distribution, and the distribution is often not normal… Context gets dropped from data all the time… and the info processing that goes on is oftentimes lossy… but that’s not a problem if the info doesn’t mislead you into making a bad decision…

    6) as we generate more and more data, we’re gonna have to start learning about visualisation. Many people don;t get the idea of uncovering relationships in data through visualising it; we have to start somewhere with the visualisation education programme; maps are good for that… we have to start somewhere… and if the visualisation is a bad one, in viz 101 that’s maybe not so bad because the idea is to get across the idea of doing visualisations at all. I remember physics at school: each year we were told to forget what we’d done the previous year because it was just an approximation to the truth – and we’d be doing the truth this year…

    And finally, as I understand “kewel”, it’s a way of making fun of (people who use) the word cool who are maybe not the sort of people who can get away with saying cool… so I said “kewel” to suggest that I suspected people would get off on the shiny, shiny of the post… which is ok, because the path to enlightenment is a long one, and it’s important to enthuse people about this stuff in the beginning…

    Phew – too long a comment – If the above doesn’t address any of your points, could you try expressing them in a different way, ‘cos I’m obviously not understanding them…

  39. Pingback: =GoogleLookup: Creating a Google Fact Engine Directory « OUseful.Info, the blog…
  40. Chris McAvoy

    Tony, got your comment on my blog. Thanks for the inspiration. I ended up using Pipes to clean up a feed of market information for the game Eve Online, fed it to a Google Spreadsheet, and use it to calculate profit margins on the production of pretend ships and pretend guns.

    Here’s my pipe and here’s a snapshot of the spreadsheet it feeds.

    Not really a *real* application, but its certainly useful within the pretend world of Eve.

    • Aundreaus Fortiz

      Chris: I love your pipe, not sure how to use it yet though. I started today thinking I would find a way to get up-to date prices for materials needed for POS operation.

      I think where i’m stuck on is where to find the typeID of the item i’m looking for. My spread sheet is set up at the moment but not sure how to pull the info in.

  41. Pingback: » From Wikipedia to Google Mashup mapkist
  42. Jon

    Tony, I think that the basic issue is that what you regard as “BORING” seems to me to be rather fundamental. Indeed, it seems rather clearly to be a way of avoiding asking the question of “the consequences of mashups.”

    Yes, you’ve given an example. And I’m now asking about the consequences of this process. I think that, certainly for this example, they are uniformly negative. Which is why I was asking if there were any upside to this representation of a partial, decontextualized dataset in a new environment. Again, I still see no advantages, while I see many, many disadvantages.

    No doubt you’ll think I’m exaggerating, but I think that this post, and the reaction to it, is profoundly discouraging. I’m sorry to be the drag on the “kewel” party by talking about what’s so damn “BORING.” But it seems to me to be important. I’ve written up some further thoughts here.

  43. Pingback: Weekly linkdump #147 - max - блог разработчиков
  44. Jon

    And perhaps finally, at least for now, a quick response to your points here…

    1) I think the issue is about more than provenance, though providing provenance would certainly be a leap forward. We’re agreed on that. We also agreed that at least some mashups can “strip provenance info and context away from data.” But I should make it clear that I’m not against mashups per se. Some can definitely enhance rather than degrade both data and understanding. But I do think that this question (as to whether there is enhancement or degradation, or more generally the effects of the various transformations undergone in a mashup process) needs to be front and central at all times. It would seem that that’s our basic disagreement: you want to postpone that until some unspecified later date; for now it’s “BORING” and beside the point. Again, I strongly disagree. Postponing such questions is (as I say on my blog) a textbook case of data illiteracy. Moreover, I’m disappointed that enthusiasm is allowed to overwhelm critique so easily.

    2) I like interesting presentations, too. I disagree that thinking critically is necessarily boring. And I don’t like presentations that are whizzy but lack substance.

    3) I’d like to believe that “value generating and information displaying enterprise mashups are arguably on the way.” I think, however, that the links you provided are rather worrying examples of quite the opposite trend: “there is plenty of money to be made selling mashup platforms. . . . Those vendor strategists that move quickly, plan a mashup strategy, and build a partner ecosystem will come out on top”. Ugh. That’s certainly a different definition of “value” than the one I would hope might come out of such tools. Ugh. And learning should be fun. But all that is fun is not learning.

    4) Techie stuff can certainly be bewildering. It makes some people’s heads explode. Hence the vital importance of critical data literacy, precisely so that people “start to think about what’s going on, there is some deeper stuff in there; like what sort of processing is going on in a ‘mashup’, and what the risks are.” I’m all for this.

    5) See above for my fear as to what you may mean by getting “value” from data. Meanwhile, my more basic point is that it’s not so much a question of having quality data or not: I’m happy to accept that for this purposes the Wikipedia data is of sufficient quality. What’s at stake is the ability to interpret that data. Again, that’s what I consider to be the realm of data literacy. I’m sure that some mashups, and some transformations, aid in such interpretation. Others do not. But figuring out which is which is, again, vital. It can’t be postponed.

    6) Finally, I’m all for a “visualisation education programme.” But again, my comparison with the USA Today visualizations is to point out that such education needs to have critique at its core. Not necessarily enthusiasm. Or rather, surely the point is to encourage enthusiasm for critique. That’s what I see my job, at least, as being about. More broadly, as I say, it’s what I see as being the role of the university.

  45. Tony Hirst

    @jon it was being held as spam for some reason (not my doing… not sure I understand how WordPress makes its spam detection decisions?); should be in the comment stream now…

  46. Pingback: Brokekid.net » Blog Archive » Tips & Tricks for Google Apps
  47. Pingback: Brokekid.net » Blog Archive » Monkey Vs Robot - My Weekly Idealist post
  48. Maniquí

    When I was kid, I did this kind of maps when I was at school, by using a table on some schollar book and a map. Who didn’t? What’s new? The auto-magical update through the pipes?

    From the linked page at Forrester:
    “The Mashup Opportunity.
    How To Make Money In The Evolving Mashup Ecosystem.
    Download and print PDF immediately. Price: US $775″

    They will certainly make money if the sell an e-book at that price. :|

    It seems the Web is becoming more and more a cash register, an entertainement place, than a place to share documents and scientific data. Nothing wrong with that, my job at work is creating commercial/entertaining sites for small/medium business.

  49. Pingback: Nodalities » Blog Archive » The Revolution Starts (near) Here
  50. Pingback: Media Literacy: Making Sense Of New Technologies And Media by George Siemens - Oct 18 08 | Geekblog.biz
  51. Pingback: No coding required? « Red56 blog
  52. Pingback: triple|scape » Blog Archive » TWILD for October 18, 2008
  53. playoutintelligence

    BTW: you can do that with MS Excel since Office XP (I think). Data->Import External Data->New Web Query. Click on one or more of the arrows to switch it to a green checkmark and press the “Import” button.

    Works even with password protected websites, Excel will ask you for your login data when you open the spreadsheet then.

  54. Pingback: Daily Del.icio.us
  55. Pingback: Zaubern mit Google Documents | der leumund.
  56. Pingback: Data Scraping Wikipedia with Google Spreadsheets « OUseful.Info, the blog… | clhmedia.com
  57. Pingback: Manipolare i dati sul Web at Anto’stuff
  58. Pingback: Referrer Traffic from Amazon - WTF?! « OUseful.Info, the blog…
  59. Pingback: :: Tom Smith’s: theOTHERblog
  60. Pingback: Terror in the Heartland » ma.gnolia linkdump
  61. Pingback: ollyjackson.co.uk - fine quality hypertexts since 1998
  62. Pingback: ollyjackson.co.uk - fine quality hypertexts since 1998
  63. Pingback: Data Scraping: Using Google Docs to grab table data « Everyday Scripting
  64. pop

    If I have name and address only(Japan Language) , I would like to show location how can do. I try to do many an hour but Y Location not appears.
    Plz Help Me —

  65. Pingback: » Blog Archive » links for 2008-10-23
  66. Pingback: links for 2008-10-30 - schorleblog
  67. Pingback: rascunho » Blog Archive » links for 2008-10-30
  68. DoctorDave

    Has anyone figured out how to use a published Google spreadsheet as an external data source for Excel?

  69. Pingback: Data Scraping Wikipedia « LocalLab : Foire aux Infos
  70. plerzelwupp

    thank you, ESA – I mixed your proposal (geocode by google) with this one (kml by pipes) …. the pipes location extractor works better, when you feed it with longitude and latitude. Google nearly found every of those 52 cities (except of four). The last 4 cities where found, as I feeded google with city and country (not only city).

    Thanks again, Tony and Esa – and many greetings from germany

  71. Pingback: Scraping avec Google SpreadSheets | taggle.org
  72. Richard Fahey

    Thanks so much for this awesome post. The only issue I have is that the wiki I want to scrape data from is behind a firewall and cannot be accessed without signing in. I wonder is there anyway out of this because it’s RSS feeds etc cannot be accessed externally. I’m looking at some mediawiki extension that might push the wiki page data externally so I can then use google spreadsheet, yahoo pipes etc. Any ideas?

  73. Tony Hirst

    @richard
    It won’t be approved of, but if you have a DMZ server that sits on the internal network (ie can see internal URLs) and also faces outside, you can set up a simple proxy that will relay the feed into the public domain…

    Of course, I can’t recommend such practice ;-)

    If the login to the wiki is via http, then you may be able to pass user id and pwd in via URL. That is, if you can get a feed out using something like http://example.com/feed.xml?id=myID&password=secret

  74. Pingback: October 2008’s most-clicked posts to the link list: Data scraping, design experts, copy editors, killing the article, local. - Joe Think
  75. Pingback: Data Sets - A huge list | Austin Rude
  76. Pingback: Google MyMaps Now With RSS (= Easy Geoblogging) « OUseful.Info, the blog…
  77. Pingback: Is Scraping Data Wrong? « Alexander The Great
  78. Pingback: Approxi-mapping Mash-ups, with a Google MyMaps Tidy Up to Follow « OUseful.Info, the blog…
  79. Pingback: Abject Learning » Blog Archive » Is there a remedial data literacy class I should be signing up for?
  80. Pingback: My CETIS 2008 Presentations « OUseful.Info, the blog…
  81. Pingback: Discover “What a Group Are Saying” with Yahoo Pipes and GoogleDocs :: Tom Smith’s: theOTHERblog
  82. Pingback: Data Scraping at Research Pipeline Blog
  83. Pingback: Wikipedia-Tabellen in Google Spreadsheet importieren und in Google-maps darstellen - “, jede x-beliebige Tabelle, =importhtmlQuelladresse;table;n , , , Hinweis die Formel sollte eingetippt werden - wobei Sie die Wikipedia-Adresse einfügen können Wenn
  84. Pingback:   Cool google and wikipedia mashups by andydickinson.net
  85. Pingback: Mastering RSS: Control your Inputs, and Improve your Outputs « Rick Martin
  86. Pingback: News update « Daily Link
  87. Pingback: Mashing Wikipedia Data with Google Spreadsheets | Workplace Learning Today
  88. Pingback: Horse Racing Data - Apuestas Forobet: Foro de apuestas deportivas
  89. Pingback: Tagz | "Data Scraping Wikipedia with Google Spreadsheets « OUseful.Info, the blog…" | Comments
  90. Pingback: 10 great Google hacks | www.amusingourselves.com
  91. Pingback: Ten top Google hacks «10brilliant
  92. Olga Revilla

    Hi,

    I have tried to do the first step (importHTML) in Google Docs and it does not work with comma (,) separator, but changed to semi-colon (;) and worked fine.

    Great article, congratulations!

  93. Pingback: Initial Thoughts on “Mashup Patterns” « OUseful.Info, the blog…
  94. Pingback: Scraping, scripting, hacking « electronic museum
  95. Pingback: Online utilities for the non-hacker « just a dumping ground for now
  96. Pingback: Blind Ape Seo » How to scrape without programming
  97. Edward Metz

    I wanted to ask if someone out there might be able to demonstrate how I could introduce into my map a third data point to display. Right now I have the location and name displaying but also need the date if at all possible. Any ideas?

  98. Pingback: The Simple Way to Scrape an HTML Table: Google Docs « WebGlide - Data-Visualization
  99. Pingback: Speed Up Keyphrase Research with ImportXML function on Google Docs | AUKSEO - Blog from a Search Engine Optimiser based the UK
  100. Pingback: Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets « OUseful.Info, the blog…
  101. Pingback: links for 2010-02-16 « A Web editor's tale
  102. Pingback: links for 2010-02-17 « Onlinejournalismtest's Blog
  103. Pingback: Literary Twist Update « Information Twist
  104. Paul Bradshaw

    This is plain lovely, and Jon’s comments are useful too. I’m curious why the resulting map on both Pipes and Google Maps only shows certain cities/towns?

    • Tony Hirst

      The Yahoo pipes geocoder is a little hit or miss, and sometimes doesn’t disambiguate or identfy placenames so well; if a placename isn’t recognised and geocoded, then that item doesn’t get plotted.

      What I really need to get round to doing is looking at using something like Yahoo Placemaker to see if it’s any more reliable (I don’t think the Pipes Location Extractor is Placemaker… though I’d have to run a few tests to check…)

  105. falko

    I found this article very inspiring. I would love to know if there is a way to flatten nested tables e.g. table>tr>td*3>table>tr>td*2 would become talbe>tr>td*4 (hope you understand this zen html syntax) nested tables are a mess! But I guess this is what data scraping is all about, getting data from information sources that are messy :)

  106. Pingback: Data Mashups with Google Spreadsheets and Yahoo Pipes | BryanPovlinski.com
  107. Pingback: Javascript Charts… « The Journeyler
  108. Pingback: links for 2010-04-19 « A Web editor's tale
  109. Pingback: links for 2010-05-11 « Onlinejournalismtest's Blog
  110. Luke Metcalfe

    That’s great! Otherwise you can go through APIs from services like freebase, but going direct means you get the most recent data.

    We went direct with our Wikipedia Google Maps mashup. http://maps.thefullwiki.org We went through the entire Wikipedia dump and created a map for each article showing each location.

  111. Pingback: An introduction to data scraping with Scraperwiki | Online Journalism Blog
  112. Pingback: A Quick Play with Google Static Maps: Dallas Crime « OUseful.Info, the blog…
  113. Pingback: How to be a data journalist « INCLUSIVE SOCIETY
  114. Pingback: 1001 Medios » Blog Archive » Periodismo de datos: “Everything is data”
  115. Andrea Moro

    Thank you very much for sharing this info. I’m using Yahoo! pipes for quite a while, but I never thought to generate mash-up like this.

    Very useful. I will have a try just to make a bit more practice.

    Enjoy the rest of your day.

  116. Pingback: creating buzzword app part 1 | escapingeggshells
  117. Freelance Unbound

    FYI: The formula for scraping with a Google Spreadsheet has changed (or there is a typo here).

    It’s =importhtml(“”,”table”,N) – ie there’s an “L” after the HTM.

    Super useful post though – thanks…

  118. Pingback: How to be a data journalist « beatsnpeace
  119. Pingback: What are the most interesting things done with Wikipedia data? - Quora
  120. Pingback: How can I get tabular data from Wikipedia into a Google Spreadsheet? - Quora
  121. Pingback: How to be a data journalist | Tech2Crave
  122. jmagnone

    Hi, this is pretty interesting. Thanks for sharing.

    I have been working on a plugin for WordPress that helps to import data into individual WP posts.
    I think you can try it or might be useful. http://wordpress.org/extend/plugins/wpsync/

    I am envisioning to use your technique so I can build a site quickly from data processed with pipes using your approach. Thanks.

  123. Pingback: Scraping tabular data from the web | visurus
  124. Pingback: data scraping « DevXS-UCAS
  125. Pingback: how to scrape data from webpages using googledoc
  126. Pingback: Digital methods workshop | rlturenhout
  127. Pingback: And the most engaging OUseful.info post is… JISC CETIS MASHe
  128. Pingback: Periodismo de datos | webperiodistas
  129. Company Chittagong

    With havin so much content and articles do you ever run into any issues of plagorism or copyright violation?
    My blog has a lot of unique content I’ve either created myself or outsourced but it looks like a lot of it is popping it up all over the internet without my agreement. Do you know any methods to help prevent content from being stolen? I’d genuinely appreciate it.

  130. Pingback: Javascript Charts… | The Journeyler
  131. Pingback: #CAST12 DataViz Sandbox – Resources « OUseful.Info, the blog…
  132. Pingback: Analyzing World and Country Data using R | Janalytics
  133. Pingback: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes « OUseful.Info, the blog…
  134. ibook21.com

    Hiya! I know this is kinda off topic nevertheless I’d figured I’d ask.
    Would you be interested in exchanging links or maybe guest authoring a blog article
    or vice-versa? My site goes over a lot of the same subjects
    as yours and I believe we could greatly benefit from each
    other. If you’re interested feel free to send me an e-mail. I look forward to hearing from you! Excellent blog by the way!

  135. Andy

    Awesome. Now I just need to work out how to do it on a private (ie logon required) site, as I have loads of data i could manipulate that i can’t really get to otherwise, any ideas?

  136. Pingback: WordPress – Weekly Dump (weekly) | ResponsiPaul
  137. Kevin

    Can’t get my pipes to work. any ideas? i can get it in the correct format using this =QUERY(ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population”,”table”,2),”SELECT Col1, Col2, Col3 LABEL Col1 ‘rank’, Col2 ‘city’, Col3 ‘population'”)
    in google docs, but it is not showing up correcting in pipes. Appreciate any help!!

  138. Pingback: Scraping tabular data from the web - Spatialists
  139. Pingback: Veri yönetimi gazeteciliği ve güven değeri #DJCamp2013 @arzugokdai @pinardag @mehvesevin @dagmedyaveri | Dağ Medya
  140. Pingback: Veri Gazeteciliği|| Kendi Wikileaks’inizi Kurmak İçin 10 Ücretsiz Uygulama | Dağ Medya