First Attempt at RSS/Atom Feed Autodetection With BeautifulSoup

I’ve held off playing with BeautifulSoup, the Python screenscraping library, so today I thought I’d have a little play. Here’s a first attempt – RSS feed autodetection. I think the following autodiscovers the the first feed in an an HTML page, if one exists… though with provisos… (the title element must exist in the page, the HTML must be well formed).

import urllib
from BeautifulSoup import BeautifulSoup

url="http://blog.ouseful.info"

data=urllib.urlopen(url)

soup = BeautifulSoup(data)

title=soup.find('title')

# Print the title of the page returned
print title.contents[0].strip()

# Scrape any link elements used for feed URL declaration
alt= soup.find('link', rel="alternate", type="application/rss+xml")

# The feed URL is stored in the href attribute
if alt is not None:
  print alt['href'],alt['title']

If that code can be improved (and I’m sure it probably can!) please share a gist or code fragment in the comments below…

The following is my second attempt – it loads in URLs from a local text file, checks they are URLs, displays the URL of the page that was actually returned, and prints the URL of the first feed to be autodetected, if one was:

import urllib,re
from BeautifulSoup import BeautifulSoup

fname="homepageurls.txt"
f=open(fname)

for url in f:
  print "Fetching ",url
  if url !='':
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url)
    for url in urls:
      data=urllib.urlopen(url)
      if data.geturl() != url:
        print data.geturl(),'was actually loaded'
      try:
        soup = BeautifulSoup(data)

        title=soup.find('title')

        if title.contents[0]:
          print title.contents[0].strip()
  
        alt= soup.find('link', rel="alternate", type="application/rss+xml")
        if alt is not None:
          print alt['href'],alt['title']
      except:
        pass
f.close()

For some reason, the autodetect doesn’t always seem to work.. though I’m not sure why…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

2 thoughts on “First Attempt at RSS/Atom Feed Autodetection With BeautifulSoup”

  1. Here is a solution with feedparser and beautiful soup that handles most of the use case you can encounter on the internet:


    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    # vim: ai ts=4 sts=4 et sw=4
    """
    Tools to extract feed links, test if they are valid and parse them
    with feedparser, returning content or a proper error.
    """
    import urllib2
    import feedparser
    from BeautifulSoup import BeautifulSoup
    # list of attributes that can have a feed link in the <HEAD> section
    # so we can identify at least one in a page
    FEED_LINKS_ATTRIBUTES = (
    (('type', 'application/rss+xml'),),
    (('type', 'application/atom+xml'),),
    (('type', 'application/rss'),),
    (('type', 'application/atom'),),
    (('type', 'application/rdf+xml'),),
    (('type', 'application/rdf'),),
    (('type', 'text/rss+xml'),),
    (('type', 'text/atom+xml'),),
    (('type', 'text/rss'),),
    (('type', 'text/atom'),),
    (('type', 'text/rdf+xml'),),
    (('type', 'text/rdf'),),
    (('rel', 'alternate'), ('type', 'text/xml')),
    (('rel', 'alternate'), ('type', 'application/xml')),
    )
    def extract_feed_links(html, feed_links_attributes=FEED_LINKS_ATTRIBUTES):
    """
    Return a generator yielding potiential feed links in a HTML page.
    >>> url = urllib2.urlopen('http://www.codinghorror.com/blog/&#39;)
    >>> links = extract_feed_links(url.read(1000000))
    >>> tuple(links)
    (u'http://feeds.feedburner.com/codinghorror/&#39;,)
    """
    soup = BeautifulSoup(html)
    head = soup.find('head')
    links = []
    for attrs in feed_links_attributes:
    for link in head.findAll('link', dict(attrs)):
    href = dict(link.attrs).get('href', '')
    if href:
    yield unicode(href)
    def get_first_working_feed_link(url):
    """
    Try to use the current URL as a feed. If it works, returns it.
    It it doesn't, load the HTML and try to get links from it then
    test them one by one and returns the first one that works.
    >>> get_first_working_feed_link('http://www.codinghorror.com/blog/&#39;)
    u'http://feeds.feedburner.com/codinghorror/&#39;
    >>> get_first_working_feed_link('http://feeds.feedburner.com/codinghorror/&#39;)
    u'http://feeds.feedburner.com/codinghorror/&#39;
    """
    # if the url is a feed itself, returns it
    html = urllib2.urlopen(url).read(1000000)
    feed = feedparser.parse(html)
    if not feed.get("bozo", 1):
    return unicode(url)
    # construct the site url from the domain name and the protocole name
    parsed_url = urllib2.urlparse.urlparse(url)
    site_url = u"%s://%s" % (parsed_url.scheme, parsed_url.netloc)
    # parse the html extracted from the url, and get all the potiential
    # links from it then try them one by one
    for link in extract_feed_links(html):
    if '://' not in link: # if we got a relative URL, make it absolute
    link = site_url + link
    feed = feedparser.parse(link)
    if not feed.get("bozo", 1):
    return link
    return None
    if __name__ == "__main__":
    import doctest
    doctest.testmod()

    view raw

    feeds.py

    hosted with ❤ by GitHub

    It’s commented and unittested.

Comments are closed.

%d bloggers like this: