Feed-detection From Blog URL Lists, with OPML Output
Picking up on Adding Value to the Blog Award Nomination Collections…, here’s a way of generating an OPML feed bundle of categorised feed URLs from a list of tagged blog homepage URLs. What the OPML allows you to do is take a list of URLs, such as the URLs for the blogs nominated in the COmputer Weekly 2010 IT Blog Awards, and subscribe to them all in one go using something like Google Reader. In addition, the OPML is structured so that the feeds are organised in separate “folders” according to the award category that they are nominated in.
So how does it work?
The recipe goes something like this… You will need:
- a CSV file containing two columns: the first column contains a blog URL (that is, the URL of the blog’s HTML homepage), the second column contains the award category:
- err, that’s it…;-)
The code works something like this:
- grab a URL and use the YQL feed autodiscovery custom query to find any RSS or Atom feeds that are auto-discoverable for the blog;
- if necessary, add path information to any auto-discovered feeds that are specified using relative paths (e.g. /feed rather than http://example.com/feed);
- in many cases, a less than informative feed title will be specified in the autodiscovery link title, so use the blog details from feed custom YQL query to get the feed title from the feed itself. In addition, get the alternate HTML page URL from the feed, and use this as the htmlUrl in the OPML outline element.
There’s also a bit of juggling with the tag categories – these are used to create OPML outline elements that contain the outline elements that identify the URLs for blog feeds in that category.
Here’s the code (such as it is) (and as a gist):
import re, urllib, simplejson,csv import xml.sax.saxutils as saxutils from urlparse import urlparse fname="homepageurls2.csv" def opmlFromCSV(fname): fout="test" fo=open(fout+'.xml','w') writeOPMLHeadopenBody(fo) f = csv.reader(open(fname, "rb")) #url="http://ukwebfocus.wordpress.com" first=True curr='' for line in f: url,tag=line if curr!=tag: if first is True: first=False else: closeOPMLoutline(fo) curr=tag openOPMLoutline(fo,tag) handleOPMLitem(fo,url) closeOPMLoutline(fo) closeOPMLbody(fo) fo.close() def handleOPMLitem(fo,url): if url !='': try: urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url) for url in urls: print "testing",url o=url url='http://query.yahooapis.com/v1/public/yql/psychemedia/feedautodetect?url='+urllib.quote(url)+'&format=json' try: data = simplejson.load(urllib.urlopen(url)) if data['query']['count']>'0': print data['query'] if data['query']['count']=='1': l=data['query']['results']['link'] furl=checkPathOnFeedURL(l['href'],o) print "*****",furl,l['title'] handleFeedDetails(fo,furl) else: for r in data['query']['results']['link']: furl=checkPathOnFeedURL(r['href'],o) print furl,r['title'] handleFeedDetails(fo,furl) except: pass except: pass def checkPathOnFeedURL(furl,o): if furl.startswith('/'): x = urlparse(o) furl= 'http://'+x.netloc+furl return furl def writeOPMLHeadopenBody(fo): fo.write('<?xml version="1.0" encoding="UTF-8"?>\n') fo.write('<opml version="1.0">\n<head>\n\t<title>Generated OPML file</title>\n</head>\n\t<body>\n') def closeOPMLbody(fo): fo.write("</body>\n</opml>") def openOPMLoutline(f,t): f.write('\t\t<outline title="'+t+'" text="'+t+'">\n') def closeOPMLoutline(f): f.write('\t\t</outline>\n') def writeOPMLitem(f,htmlurl,xmlurl,title): title=saxutils.escape(title) f.write('\t\t\t<outline text="'+title+'" title="'+title+'" type="rss" xmlUrl="'+xmlurl+'" htmlUrl="'+htmlurl+'"/>\n') def handleFeedDetails(fo,furl): nocomments=True url='http://query.yahooapis.com/v1/public/yql/psychemedia/feeddetails?url='+urllib.quote(furl)+'&format=json' print "Trying feed url",furl try: details=simplejson.load(urllib.urlopen(url)) detail=details['query']['results']['feed'] #print "Acquired",detail for i in detail: if i['link']['rel']=='alternate': title=i['title'].encode('utf-8') hlink=i['link']['href'] print 'Using',hlink, furl,title if nocomments is True: if not (furl.find('/comments')>-1 or title.startswith('Comments for')): writeOPMLitem(fo,hlink,furl,title) else: writeOPMLitem(fo,hlink,furl,title) return except: pass #------- opmlFromCSV(fname)
If you have comments/suggested improvements, they’d be gratefully received. My code is generally, err, a bit quirky, so I’m always grateful when “proper” coders can tidy it up for me…;-)
[UPDATE: Methinks I should create a proper XML doc in the script using an XML library...]
So what does this do again? Well, given a simple text/CSV file with two columns – blog URL and category – it tries to autodiscover an RSS or Atom feed from the page, then calls the feed to find out what the blog is called and get the “official” alternate HTML page URL for the feed. It shoves the feed URL, HTML page URL and blog title into an OPML outline element, and further nests these elements in an outer outline element that identifies the category the feed is in. When you import the OPML file into Google Reader, a separate folder is created for each category, the feeds subscribed to, and placed in the appropriate folder.
In short, given a list of blog HTML URLs, you can easily generate a file that lets you import and subscribe to their feed URLs in one go (subject to the feeds being autodiscoverable from the original URL).
The approach is generalisable, too… For example, it’s easy enough to reuse most of the code to generate and OPML feed bundle that contains the feed URLs for blogs identified from the homepage URLs of people on Twitter (identified from their Twitter biography/details), as grouped together on a particular Twitter list, or who have used a particular hashtag.
That is, we can have a go at automatically generating a blog feed roll for folk on a Twitter list, or using a particular Twitter hashtag. I’d say that was an example of “little l, little d” linked data, wouldn’t you?;-)