Extracting Data From Misbehaving RSS/Atom Feeds

A quick recipe prompted by a query from Joss Winn about getting data out of an apparently broken Atom feed: http://usesthis.com/feed/?r=n&n=150

The feed previews in Google Reader okay – http://www.google.com/reader/atom/feed/http://usesthis.com/feed/?r=n&n=150 – and is also viewable in my browser, but neither Google Spreadsheets (via the =importFeed() formula) nor YQL (?!) appear to like it.

[Note: h/t to Joss for pointing this out to me: http://www.google.com/reader/atom/feed/http://usesthis.com/feed/?r=n&n=150 is a recipe for accessing Google Reader’s archive of a feed, and pulling out e.g. n=150 items (r=n is maybe an ordering argument?) Which is to say: here’s a way of accessing an archive of RSS feed items…:-)]

However, Yahoo Pipes does, so a simple proxy pipe normalises the feed and gives us one that is properly formatted:

Sample Yahoo feed proxy - http://pipes.yahoo.com/ouseful/feedproxy

The normalised feed can now be accessed via:

http://pipes.yahoo.com/ouseful/feedproxy?_render=rss&furl=http%3A%2F%2Fusesthis.com%2Ffeed%2F%3Fr%3Dn%26n%3D150

We can also get access to a CSV output:

http://pipes.yahoo.com/ouseful/feedproxy?_render=csv&furl=http%3A%2F%2Fusesthis.com%2Ffeed

The CSV can be imported in to a Google spreadsheet using the =importData() formula:
=importData(“http://pipes.yahoo.com/ouseful/feedproxy?_render=csv&furl=http%3A%2F%2Fusesthis.com%2Ffeed”)

[Gotcha: if you have ?&_render in the URL (i.e. ?&), Spreadsheets can’t import the data…]

Once in the spreadsheet it’s easy enough to just pull out e.g. the description text from each feed item because it all appears in a single column.

Google spreadsheets can also query the feed and just pull in the description element. For example:

=ImportFeed(“http://pipes.yahoo.com/ouseful/feedproxy?_render=rss&furl=http%3A%2F%2Fusesthis.com%2Ffeed”,”items summary”)

(Note that it seemed to time out on me when I tried to pull in the full set of 150 elements in Joss’ original feed, but it worked fine with 15.)

We can also use YQL developer console to just pull out the description elements:

select description from rss where url=’http://pipes.yahoo.com/ouseful/feedproxy?_render=rss&furl=http%3A%2F%2Fusesthis.com%2Ffeed%2F%3Fr%3Dn%26n%3D150

YQL querying an rss feed

YQL then provides XML or JSON output as required.

3 comments

  1. Tony Hirst

    @joss Doh! Will fix… Btw, you mentioned tidying up the feed in Excel. It struck me it’s easy enough to pull out the description column from the text file using a Unix command line expression:

    cut -f 20 -d ‘,’ josstxt.csv > short.csv

    The short.csv file then contains informally structured data containing the different questions and answers. If i wasn’t supposed to be marking, it strikes me that Stanford Data Wrangler [ http://vis.stanford.edu/wrangler/ ] might be an appropriate tool to pull out the answers to each question and structure it more formally?

    As to being able to get stuff out of Google Reader archive… nice:-)

  2. Joss

    Cheers, for that. Text wrangling and data munging is something I wish I could find time to learn.

    Google reader is just one big bloody hoover for sucking up feed enabled sites. It doesn’t matter if you specifically limit your feed to n items. It will pull the lot in if you keep scrolling down the ‘all items’ list. What’s nice is that we’ve found a way to get all the data out, too.