Extracting Data From Misbehaving RSS/Atom Feeds
A quick recipe prompted by a query from Joss Winn about getting data out of an apparently broken Atom feed: http://usesthis.com/feed/
The feed previews in Google Reader okay – http://www.google.com/reader/atom/feed/http://usesthis.com/feed/?r=n&n=150 – and is also viewable in my browser, but neither Google Spreadsheets (via the =importFeed() formula) nor YQL (?!) appear to like it.
[Note: h/t to Joss for pointing this out to me: http://www.google.com/reader/atom/feed/http://usesthis.com/feed/?r=n&n=150 is a recipe for accessing Google Reader's archive of a feed, and pulling out e.g. n=150 items (r=n is maybe an ordering argument?) Which is to say: here's a way of accessing an archive of RSS feed items...:-)]
However, Yahoo Pipes does, so a simple proxy pipe normalises the feed and gives us one that is properly formatted:
The normalised feed can now be accessed via:
We can also get access to a CSV output:
The CSV can be imported in to a Google spreadsheet using the =importData() formula:
[Gotcha: if you have ?&_render in the URL (i.e. ?&), Spreadsheets can't import the data...]
Once in the spreadsheet it’s easy enough to just pull out e.g. the description text from each feed item because it all appears in a single column.
Google spreadsheets can also query the feed and just pull in the description element. For example:
(Note that it seemed to time out on me when I tried to pull in the full set of 150 elements in Joss’ original feed, but it worked fine with 15.)
We can also use YQL developer console to just pull out the description elements:
select description from rss where url=’http://pipes.yahoo.com/ouseful/feedproxy?_render=rss&furl=http%3A%2F%2Fusesthis.com%2Ffeed
YQL then provides XML or JSON output as required.