I’ve held off playing with BeautifulSoup, the Python screenscraping library, so today I thought I’d have a little play. Here’s a first attempt – RSS feed autodetection. I think the following autodiscovers the the first feed in an an HTML page, if one exists… though with provisos… (the title element must exist in the page, the HTML must be well formed).
import urllib from BeautifulSoup import BeautifulSoup url="http://blog.ouseful.info" data=urllib.urlopen(url) soup = BeautifulSoup(data) title=soup.find('title') # Print the title of the page returned print title.contents[0].strip() # Scrape any link elements used for feed URL declaration alt= soup.find('link', rel="alternate", type="application/rss+xml") # The feed URL is stored in the href attribute if alt is not None: print alt['href'],alt['title']
If that code can be improved (and I’m sure it probably can!) please share a gist or code fragment in the comments below…
The following is my second attempt – it loads in URLs from a local text file, checks they are URLs, displays the URL of the page that was actually returned, and prints the URL of the first feed to be autodetected, if one was:
import urllib,re from BeautifulSoup import BeautifulSoup fname="homepageurls.txt" f=open(fname) for url in f: print "Fetching ",url if url !='': urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url) for url in urls: data=urllib.urlopen(url) if data.geturl() != url: print data.geturl(),'was actually loaded' try: soup = BeautifulSoup(data) title=soup.find('title') if title.contents[0]: print title.contents[0].strip() alt= soup.find('link', rel="alternate", type="application/rss+xml") if alt is not None: print alt['href'],alt['title'] except: pass f.close()
For some reason, the autodetect doesn’t always seem to work.. though I’m not sure why…
Here is a solution with feedparser and beautiful soup that handles most of the use case you can encounter on the internet:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
feeds.py
hosted with ❤ by GitHub
It’s commented and unittested.