First Attempt at RSS/Atom Feed Autodetection With BeautifulSoup

I’ve held off playing with BeautifulSoup, the Python screenscraping library, so today I thought I’d have a little play. Here’s a first attempt – RSS feed autodetection. I think the following autodiscovers the the first feed in an an HTML page, if one exists… though with provisos… (the title element must exist in the page, the HTML must be well formed).

import urllib
from BeautifulSoup import BeautifulSoup

url="http://blog.ouseful.info"

data=urllib.urlopen(url)

soup = BeautifulSoup(data)

title=soup.find('title')

# Print the title of the page returned
print title.contents[0].strip()

# Scrape any link elements used for feed URL declaration
alt= soup.find('link', rel="alternate", type="application/rss+xml")

# The feed URL is stored in the href attribute
if alt is not None:
  print alt['href'],alt['title']

If that code can be improved (and I’m sure it probably can!) please share a gist or code fragment in the comments below…

The following is my second attempt – it loads in URLs from a local text file, checks they are URLs, displays the URL of the page that was actually returned, and prints the URL of the first feed to be autodetected, if one was:

import urllib,re
from BeautifulSoup import BeautifulSoup

fname="homepageurls.txt"
f=open(fname)

for url in f:
  print "Fetching ",url
  if url !='':
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url)
    for url in urls:
      data=urllib.urlopen(url)
      if data.geturl() != url:
        print data.geturl(),'was actually loaded'
      try:
        soup = BeautifulSoup(data)

        title=soup.find('title')

        if title.contents[0]:
          print title.contents[0].strip()
  
        alt= soup.find('link', rel="alternate", type="application/rss+xml")
        if alt is not None:
          print alt['href'],alt['title']
      except:
        pass
f.close()

For some reason, the autodetect doesn’t always seem to work.. though I’m not sure why…