OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

First Attempt at RSS/Atom Feed Autodetection With BeautifulSoup

with 2 comments

I’ve held off playing with BeautifulSoup, the Python screenscraping library, so today I thought I’d have a little play. Here’s a first attempt – RSS feed autodetection. I think the following autodiscovers the the first feed in an an HTML page, if one exists… though with provisos… (the title element must exist in the page, the HTML must be well formed).

import urllib
from BeautifulSoup import BeautifulSoup

url="http://blog.ouseful.info"

data=urllib.urlopen(url)

soup = BeautifulSoup(data)

title=soup.find('title')

# Print the title of the page returned
print title.contents[0].strip()

# Scrape any link elements used for feed URL declaration
alt= soup.find('link', rel="alternate", type="application/rss+xml")

# The feed URL is stored in the href attribute
if alt is not None:
  print alt['href'],alt['title']

If that code can be improved (and I’m sure it probably can!) please share a gist or code fragment in the comments below…

The following is my second attempt – it loads in URLs from a local text file, checks they are URLs, displays the URL of the page that was actually returned, and prints the URL of the first feed to be autodetected, if one was:

import urllib,re
from BeautifulSoup import BeautifulSoup

fname="homepageurls.txt"
f=open(fname)

for url in f:
  print "Fetching ",url
  if url !='':
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url)
    for url in urls:
      data=urllib.urlopen(url)
      if data.geturl() != url:
        print data.geturl(),'was actually loaded'
      try:
        soup = BeautifulSoup(data)

        title=soup.find('title')

        if title.contents[0]:
          print title.contents[0].strip()
  
        alt= soup.find('link', rel="alternate", type="application/rss+xml")
        if alt is not None:
          print alt['href'],alt['title']
      except:
        pass
f.close()

For some reason, the autodetect doesn’t always seem to work.. though I’m not sure why…

Written by Tony Hirst

October 21, 2010 at 9:17 pm

Posted in Tinkering

Tagged with

2 Responses

Subscribe to comments with RSS.

  1. [...] a quick follow up to the post on using Beautiful Soup for RSS feed autodetection – it struck me that I should be able to do a similar thing with [...]

  2. Here is a solution with feedparser and beautiful soup that handles most of the use case you can encounter on the internet:

    https://gist.github.com/1308133

    It’s commented and unittested.

    ksamuel

    October 24, 2011 at 12:41 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 150 other followers