First Attempt at RSS/Atom Feed Autodetection With BeautifulSoup
I’ve held off playing with BeautifulSoup, the Python screenscraping library, so today I thought I’d have a little play. Here’s a first attempt – RSS feed autodetection. I think the following autodiscovers the the first feed in an an HTML page, if one exists… though with provisos… (the title element must exist in the page, the HTML must be well formed).
import urllib
from BeautifulSoup import BeautifulSoup
url="http://blog.ouseful.info"
data=urllib.urlopen(url)
soup = BeautifulSoup(data)
title=soup.find('title')
# Print the title of the page returned
print title.contents[0].strip()
# Scrape any link elements used for feed URL declaration
alt= soup.find('link', rel="alternate", type="application/rss+xml")
# The feed URL is stored in the href attribute
if alt is not None:
print alt['href'],alt['title']
If that code can be improved (and I’m sure it probably can!) please share a gist or code fragment in the comments below…
The following is my second attempt – it loads in URLs from a local text file, checks they are URLs, displays the URL of the page that was actually returned, and prints the URL of the first feed to be autodetected, if one was:
import urllib,re
from BeautifulSoup import BeautifulSoup
fname="homepageurls.txt"
f=open(fname)
for url in f:
print "Fetching ",url
if url !='':
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url)
for url in urls:
data=urllib.urlopen(url)
if data.geturl() != url:
print data.geturl(),'was actually loaded'
try:
soup = BeautifulSoup(data)
title=soup.find('title')
if title.contents[0]:
print title.contents[0].strip()
alt= soup.find('link', rel="alternate", type="application/rss+xml")
if alt is not None:
print alt['href'],alt['title']
except:
pass
f.close()
For some reason, the autodetect doesn’t always seem to work.. though I’m not sure why…

[...] a quick follow up to the post on using Beautiful Soup for RSS feed autodetection – it struck me that I should be able to do a similar thing with [...]
Feed Autodetection With YQL « OUseful.Info, the blog…
October 22, 2010 at 2:52 pm
Here is a solution with feedparser and beautiful soup that handles most of the use case you can encounter on the internet:
https://gist.github.com/1308133
It’s commented and unittested.
ksamuel
October 24, 2011 at 12:41 am