OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for October 2010

What Happens If Java Dies?

with 5 comments

Can’t sleep…:-( So here’s one of the thoughts that started gnawing away at me, and that may let me get back to sleep if I post it: what happens if Java dies? if this posturing from Apple about dropping support for OS/X comes to anything…

One of the promises of Java was that it was cross-platform, of course. And for whatever reason(s), Steve Jobs prefers native, rather cross-platform, code running on his machines…

One of the tools I’ve been playing a lot with lately is based on Java, namely the desktop application Gephi, IBM’s Many Eyes is also Java application, though it runs as a Java applet/plugin in a browser. (Hmmm, applet. I bet that narks Jobsworth…) So if Java support on the desktop dies, I’ll be p*****d off…

When my first, sleep deprived thought of “who cares if Java dies, we can just move to cloud services” thought occurred, it was quickly followed by: “hmmm, but what if those cloud services make use of Java clients in the browser…?” The cloud maybe good for some things, but at the end of the day, we need to run clients of some sort.

So if Java dies, where do we run to – or more specifically, what do we run away from? Apple doesn’t like Flash much either (is the same true of Air?), and via YouTube, Google has also been looking for an alternative to Flash for delivery of video streams; Silverlight doesn’t seem to have many takers, and there are already more than a few applications I can’t run in a browser because I don’t run Windows…

So what is a sustainable cross-platform future? Running HTML clients/UIs and pulling on backend services and storage that do run in the cloud (which may even be Java…!;-)?

PS Just by the by, earlier this week I caught a glimpse of the application calls available from an emerging web operating system that appears to be well on the road to development:

Google's web operating system function calls

So can I now go back to sleep, please….?:-(

PS I did get back up to sleep, and woke up with two related half-remembered ideas in mind:
1) The GWT (Google Web Toolkit) “allow(s) you to write AJAX applications in Java and then compile the source to highly optimized JavaScript that runs across all browsers, including mobile browsers for Android and the iPhone”. So how general is the source Java that GWT can cope with?
2) Will we start to see more applications running in their own virtualisation containers, carrying just enough of an operating system of their own to let them do what the application calls for?

Written by Tony Hirst

October 31, 2010 at 5:45 am

Posted in Thinkses

First Dabblings With Scraperwiki – All Party Groups

with 7 comments

Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(

So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.

And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.

Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.

The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

All party groups - directory

Secondly, pages for each group, which are published according to a common template:

APG - individual record

The recipe I needed goes as follows:
- grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
- for each group, grab it’s individual record page and extract the list of 20 qualifying members
- add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)

So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

View source: list of APGs

If you look carefully, you will notice two things:
- the links to the country groups and the subject groups look the same:
<p xmlns=”http://www.w3.org/1999/xhtml” class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>
</p>

<p xmlns=”http://www.w3.org/1999/xhtml” class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>
</p>

- there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=”http://www.w3.org/1999/xhtml”>Section 2: Subject Groups</h2>

Since scraping largely relies on pattern matching, I took the strategy of:
- starting my scrape proper after the Section 2 header:

def fullscrape():
    # We're going to scrape the APG directory page to get the URLs to the subject group pages
    starting_url = 'http://www.publications.parliament.uk/pa/cm/cmallparty/register/contents.htm'
    html = scraperwiki.scrape(starting_url)

    soup = BeautifulSoup(html)
    # We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
    start=soup.find(text='Section 2: Subject Groups')
    # The links we want are in p tags
    links = start.findAllNext('p',"contentsLink")

    for link in links:
        # The urls we want are in the href attribute of the a tag, the group name is in the a tag text
        #print link.a.text,link.a['href']
        apgPageScrape(link.a.text, link.a['href'])

So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.

This is the bit of the page we want to scrape:

APG - qualifying members

The 20 qualifying members’ names are actually contained in a single table row:

APG - qualifying members table

def apgPageScrape(apg,page):
    print "Trying",apg
    url="http://www.publications.parliament.uk/pa/cm/cmallparty/register/"+page
    html = scraperwiki.scrape(url)
    soup = BeautifulSoup(html)
    #get into the table
    start=soup.find(text='Main Opposition Party')
    # get to the table
    table=start.parent.parent.parent.parent
    # The elements in the number column are irrelevant
    table=table.find(text='10')
    # Hackery...:-( There must be a better way...!
    table=table.parent.parent.parent
    print table
    
    lines=table.findAll('p')
    members=[]

    for line in lines:
        if not line.get('style'):
            m=line.text.encode('utf-8')
            m=m.strip()
            #strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
            m=m.replace('-','–')
            m=m.split('–')
            # I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
            try:
                unicode(m[0], "ascii")
            except UnicodeError:
                m[0] = unicode(m[0], "utf-8")
            else:
                # value was valid ASCII data
                pass
            # The split test is another hack: it dumps the party identifiers in the last column
            if m[0]!='' and len(m[0].split())>1:
                print '...'+m[0]+'++++'
                members.append(m[0])
            
    if len(members)>20:
        members=members[:20]
    
    for m in members:
        #print m
        record= { "id":apg+":"+m, "mp":m,"apg":apg}
        scraperwiki.datastore.save(["id"], record) 
    print "....done",apg

So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

Scraped data

That’s the first step then – scraping the data… But so what?

My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

APG memberships

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)

We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

John Stevenson - APG connections

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

APG membership

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….

PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?

Written by Tony Hirst

October 29, 2010 at 12:24 pm

Visualising The Life of a (Code) Repository

with 2 comments

One of the many things that I suppose most people never think about is what makes up the raw ingredients of a software application. The answer is code, of course, and code that’s distributed over hundreds, if not thousands, of files; and not just static files, but files that may get edited repeatedly. By different people. At different times.

Keeping track of all these files, including which are the current ones, and also which are the previous versions (because it’s generally considered good practice to keep copies of all the old versions of your files (and the versions of the files around them that were current at the time!) in case you need to go back to them…) can be a nightmare-ish task, which is why software projects tend to use version control systems or code repositories to manage the various files.

Recently, I’ve come across a couple of visualisation tools that show how various software projects have evolved by virtue of the check-ins to a repository over time…

Firstly, CodeSwarm:

And today, gource:

So I’m wondering…could we do similar things for:
- the life of a wiki? It has page creations, deletions and updates…
- the growth of citation tree from an academic paper (so plot the original paper, the papers that cite it, the papers that cite the papers that cite the original, and so on?)
- the submissions to an open publication repository, such as eprints, with submissions as checkins, and maybe links between nodes when one paper cites another in the repository?
- the evolution of tweets around a hashtag (nodes are tweets containing the hashtag, forks are to other hashtags mentioned in the same tweet)?
- what else? (ideas in the comments, please;-)

Maybe visualisations for these exist (though I’m thinking more of animated tree based representations than things like Heavy Metal Umlaut+;-)?

Maybe there’s a common representation possible that would let us to use the tools developed e.g. for visualising code repository checkins on other sorts of tree (because the structure of the code in a repository, or on-citations from a publication) are trees, right?

PS See also: GLtrail server log visualisation.

PPS Visualising Google Analytics using gource via googalytics api and python [via @aneesha]

Written by Tony Hirst

October 28, 2010 at 10:20 am

Posted in Library, Visualisation

Tagged with , ,

Could Librarians Be Influential Friends? And Who Owns Your Search Persona?

with 13 comments

Every so often, I’ve posted about the erosion of a universal Google ground truth as Google rolls out personalisation features that tweak the ranking of search results presented to you based on what Google knows about you. So with a recent announcement from Bing about its search integration with Facebook, I started wondering: could academic subject librarians (in a professional capacity) start to influence the search results of their charges (students, researchers, academics), simply by developing a strong persona as seen by the search engines, and friending their patrons in a public way also visible to the search engines.

So what exactly did Bing announce? Search Engine Land’s Danny Sullivan described it in Bing, Now With Extra Facebook: See What Your Friends Like & People Search Results as follows: “Bing is now making use of it to show new “Liked By Your Friends” matches and Facebook-powered people search results.” Liked results (when they appear), are currenlty presented in a specially marked out “liked by your Facebook friends” listing (Danny’s post shows some screenshot examples). However:

[o]utside the Liked Results, Facebook’s data is not being used to reshape the “regular” results, the listings found from crawling the web. Rather, traditional ranking factors such as the content on the pages and how people link to them is used — similar to what Google does.

Like Results are also unique to each person. What I get depends upon who my friends are. Someone else, with a different set of friends, will see different links suggested.

One thing is certain. If you haven’t been paying attention to Facebook like buttons, get moving. There’s already some direct benefit in search, and chances are this will grow.

So, the question that immediately came to my mind was: if librarians become Facebook friends of their patrons, and start “Liking” high quality resources they find on the web, might they start influencing the results that are presented to their patrons on particular searches?”

That is: could librarians take on a role of “influential friends” in a particular topic area, much as a subject librarian helps guide a patron in a traditional library? Or how about recasting the idea of the “embedded librarian” as a librarian who is embedded in the network, and who role is essentially to provide SEO services for content they want to help their patrons discover? (This relates to the question: if discovery happens elsewhere, how can librarins influence that discovery? Is SEO of other peoples’ content in some way akin to a weak form of collection development?!)

Where else might this line of thinking take us? If the Goog can track folk signed into a Google Apps for Edu domain, such as open.ac.uk, could that network of people be used to influence search results somehow…?

Just by the by, here are a couple of other examples of how content published or curated by one person might appear in or influence* the results of a person they are socially or organisationally connected to:

- Explore Interesting, Personal Photos on Yahoo! Search describes how their “new ‘Facebook Album Search beta’ feature, [allows you to] find public albums from the friends and family you’re connected to on Facebook (after you have linked your Yahoo! and Facebook accounts)”.

- Is Google Custom Search Influencing Google Web Search? starts to consider how the curation of a custom search engine might influence the discovery or ranking of sites and pages listed the CSE in the general web search context. (Or by extension of the above, maybe CSEs curated by trusted sources in a Google Apps for Edu domain be used to provide additional ranking factors to searches run logged in members of that domain?) If CSEs do influence rankings, maybe CSE development is a form of collection development that can influence the search results of others at a distance (i.e. on Google web search?!)

*I think this is a distinction worth bearing in mind as things play out: the ability for one person to publish content that is directly favourably ranked in another person’s results, versus the ability for one person to directly influence the ranking of third party content that appears in another person’s results.

Search Histories, Personas and Profiles as Intellectual Capital

Given the above, let us suppose that an individual can gain influence over the search results of people they are connected to by virtue of the way they have “touched” the web. If we consider the actual searches made by an individual themselves, this may also have value (as for example when a search engine tunes the results it displays for you based on your persoanl search history). I’ve touched on this before, e.g. in the context of a discussion I had with Martin Weller a couple of years ago (Your search is valuable to us) that crystallised this idea out me that I keep coming back to – that your profile as a search engine user is something of value not only to an individual, but potentially also to an institution or a service. Which is to say: the combination of what a search engine knows about you (incl social circle, things you search for, click on, search history, etc etc) and how it uses that information to tweak your personal search engine ranking factors define a “search engine persona”, which is a valuable knowledge commodity.

I think this question then follows: should institutions develop role-based personas that run searches, Like things on the web and so on, that are the “property” of the institution and inhabited by individuals employed to the role (a user employed as web-embedded Science librarian must use the weblibrarian_science account for example), or the should the liking, research librarian search history and so on be carried out by individuals using their personal Google accounts? In the former case, when an indivudal leaves the role, they also leave behind the persona and the machine advantage it brings (e.g. in terms of pesonal search recommendations) they have developed.

Time was when academics used to leave behind valuable collections of books and papers (valuable in the sense of being a particular collection). We’re now getting to a stage where you if work with machines that learn from your actions, that learning is valuable. So who has a right to it? (I think it wouldnlt be too hard to push this argument into the realm of transhumanism and “downloading”?!)

PS It seems that Google+ may now be influencing personalised search results, tweaking them include public Google+ updates from members of your Google+ Circles: The latest update to Google Social Search: Public Google+ Posts

Written by Tony Hirst

October 27, 2010 at 3:36 pm

Posted in Library, Thinkses

Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

with 7 comments

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I've tweaked the pipe to make it a little more robust...]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
http://pipes.yahoo.com/pipes/pipe.run?_id=f21fb52dc7deb31f5fffc400c780c38d&_render=kml&distance=1&location=YOUR+LOCATION+STRING
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  appid='YOUR_YAHOO_APP_ID_HERE'
  return appid

def placemakerGeocodeLatLon(address):
  encaddress=urllib.quote_plus(address)
  appid=getYahooAppID()
  url='http://where.yahooapis.com/geocode?location='+encaddress+'&flags=J&appid='+appid
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
  else:
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  t=int(num/100)
  page=1
  lat,lon=placemakerGeocodeLatLon(place)
  while page<=t:
    url='http://search.twitter.com/search.json?geocode='+str(lat)+'%2C'+str(lon)+'%2C'+str(1.0*dist)+'km&rpp=100&page='+str(page)+'&q=+within%3A'+str(dist)+'km'
    if term!='':
      url+='+'+urllib.quote_plus(term)

    page+=1
    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      u=i['from_user'].strip()
      if u in tweeters:
        tweeters[u]['count']+=1
      else:
        tweeters[u]={}
        tweeters[u]['count']=1
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:
    	  tags[tag]=1
    	else:
    	  tags[tag]+=1
    	    
  return tweeters,tags

''' Usage:
tweeters={}
tags={}
num=100 #number of search results, best as a multiple of 100 up to max 1500
location='PLACE YOU WANT TO SEARCH AROUND'
term='OPTIONAL SEARCH TERM TO NARROW DOWN SEARCH RESULTS'
tweeters,tags=twSearchNear(tweeters,tags,num,location,searchTerm)
'''

What this code does is:
- use Yahoo placemaker to geocode the address provided;
- search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
- identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
- identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…

PS if you want to see how folk tweeting around a location are socially connected (i.e. whether they follow each other), check out A Bit of NewsJam MoJo – SocialGeo Twitter Map).

Written by Tony Hirst

October 27, 2010 at 1:15 pm

More Python Floundering… Stripping Google Analytics Tracking Codes Out of URLs

with 5 comments

I’m really floundering on the Python front at the moment, so here’s the next thing I want to do but can’t seem to do cleanly (partly because I’m running an old version (2.5.1), mainly because I’m not familiar with Python libraries or how to use Python data structures properly;-)

The problem is easily stated: given a URL that contains Google Analytics tracking garbage, strip it out. So for example, go from this:
http://example.com?q=1&q2=foo&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blah’
and return something like this:
http://example.com?q=1&q2=foo

That is, strip out the arguments in the set:
['utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content']

What I came up with (bearing in mind I’m using Py 2.5.1) is the very horrible:

import cgi, urllib, urlparse

url='http://nds.coi.gov.uk/content/Detail.aspx?ReleaseID=416174&NewsAreaID=2&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+bis-innovation-latest+%28BIS+Innovation+latest%29'

p=urlparse.urlparse(url)

# p[4] contains query string arguments
q=cgi.parse_qs(p[4])
qt={}

stopkeys=['utm_source','utm_medium','utm_campaign','utm_term','utm_content']
for i in q:
  if i not in stopkeys:
    qt[i]=q[i][0]

p2=urllib.urlencode(qt)
p2=(p[0],p[1],p[2],p[3],p2,p[5])
url=urlparse.urlunparse(p2)

print url

As to why I’m posting this and showing off my appalling programming skills and pure cruft code…?

It’s a way of exploring the gulf between the desire to programme and the ability to write code. I want an application or service that can perform a particular function (tidy Googalytics stuff out of a URL), and I can come up with the steps in a programme to do that…

…I’m also enough of a cut and paste dabbler to cobble together something that half works, (i.e. something that’s just about good enough to be getting on with), and that sort of demonstrates a working specification for an actual service I’d like to see.

But it’s not pretty code, it’s not elegant code, and it may not even be code that works properly…

Things like Yahoo Pipes help in this respect, because the pipes UI provides a programming interface that largely means the user can avoid writing code, yet still helps them write programmes that implement pipeline services.

The question is – how do we bridge the gap between people articulating things they want programmes to do, and generating the code to do them?

PS if you can tidy up the above code, bearing in mind the version of Python I’m running imposes certain constraints on the libraries you can use, I will happily accept corrections ;-)

PPS Andy Theyers, aka @offmessage, suggested this approach, which includes the handy isinstance trap to detect/identify whether something is of a particular type (in this case, a list), which is something I must try to remember:

mport cgi
import urllib
import urlparse

DISCARD = [
    'utm_source',
    'utm_medium',
    'utm_campaign',
    'utm_term',
    'utm_content',
    ]
    
def reencode(parsed_list):
    """This is nasty, but necessary in python2.5 because cgi.parse_qs is not
    the exact opposite of urllib.urlencode:
    >>> cgi.parse_qs(p[4])
    {'ReleaseID': ['416174'], 'NewsAreaID': ['2']}
    >>> urllib.urlencode({'ReleaseID': ['416174'], 'NewsAreaID': ['2']})
    {'ReleaseID': ["['416174']"], 'NewsAreaID': ["['2']"]}
    """
    ret = []
    for key, value in parsed_list:
        if isinstance(value, list):
            ret.extend([ (key, item) for item in value ])
        else:
            ret.append((key,value))
    return ret
    
def stripargs(url, discard=None):
    if discard is None:
        discard = DISCARD
    p = urlparse.urlparse(url)
    qs = cgi.parse_qs(p[4])
    qs = [ (k, v) for k, v in qs.items() if k not in discard ]
    qs = reencode(qs)
    new_p = (p[0],p[1],p[2],p[3],urllib.urlencode(qs),p[5])
    return urlparse.urlunparse(new_p)
    
if __name__ == '__main__':
    url='http://nds.coi.gov.uk/content/Detail.aspx?ReleaseID=416174&NewsAreaID=2&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+bis-innovation-latest+%28BIS+Innovation+latest%29'
    print url
    print stripargs(url)

Written by Tony Hirst

October 26, 2010 at 12:14 pm

Posted in Thinkses

Rant About URL Shorteners…

with 9 comments

It’s late, I’m tired, and I have a 5am start… but I’ve confused several people just now with a series of loosely connected idle ranty tweets, so here’s the situation:

- I’m building a simple app that looks at URLs tweeted recently on a twitter list;
- lots of the the URLs are shortened;
- some of the shortened URLs are shortened with different services but point to the same target/destination/long URL;
- all I want to do – hah! ALL I want to do – is call a simple webservice example.com/service?short2long=shorturl that will return the long url given the short URL;
- i have two half solutions at the moment; the first is using python to call the url (urllib.urlopen(shorturl)), then use .getinfo() on the return to look-up the page that was actually returned; then I use Beautiful Soup to try and grab the <title> element for the page so I can display the page title as well as the long (actual) URL; BUT – sometimes the urllib call appears to hang (and I can’t see how to set a timeout/force and except), and sometimes the page is so tatty Beuatiful Soup borks on the scrape;
- my alternative solution is to call YQL with something like select head.title from html where url=”http://icio.us/5evqbm” and xpath = “//html” (h/t to @codepo8 for pointing out the xpath argument); if there’s a redirect, the diagnostics YQL info gives the redirect URL. But for some services, like the Yahoo owned delicious/icio.us shortener, the robots.txt file presumably tells the well-behaved YQL to f**k off, becuase 3rd party resolution is not allowed.

It seems to me that in exchange for us giving shorteners traffic, they should conform to a convention that allows users, given a shorturl, to:

1) lookup the long URL, necessarily, using some sort of sameas convention;
2) lookup the title of the target page, as an added value service/benefit;
3) (optionally) list the other alternate short URLs the service offers for the same target URL.

If I was a militant server admin, I’d be tempted to start turning traffic away from the crappy shorteners… but then. that’s because I’m angry and ranting at the mo…;-)

Even if I could reliably call the short URL and get the long URL back, this isn’t ideal… suppose 53 people all mint their own short URLs for the same page. I have to call that page 53 times to find the same URL and page title? WTF?

… or suppose the page is actually an evil spam filled page on crappyevilmalware.example.com with page title “pr0n t1t b0ll0x”; maybe I see that and don’t want to go anywhere near the page anyway…

PS see also Joshua Schachter on url shorteners

PPS sort of loosely related, ish, err, maybe ;-) Local or Canonical URIs?. Chris (@cgutteridge) also made the point that “It’s vital that any twitter (or similar) archiver resolves the tiny URLs or the archive is, well, rubbish.”

Written by Tony Hirst

October 25, 2010 at 9:56 pm

Posted in Evilness

Tagged with

Backup and Run Yahoo Pipes Pipework on Google App Engine

with 3 comments

Wouldn’t it be handy if you could use Yahoo Pipes as code free rapid prototyping development environment, then export the code and run it on your own server, or elsewhere in the cloud? Well now you can, using Greg Gaughan’s pipe2py and the Google App Engine “Pipes Engine” app.

As many readers of OUseful.info will know, I’m an advocate of using the visual editor/drag and drop feed-oriented programming application that is Yahoo Pipes. Some time ago, I asked the question What Happens if Yahoo Pipes Dies?, partly in response to concerns raised at many of the Pipes workshops I’ve delivered about the sustainability, as well as the reliability, of the Yahoo Pipes platform.

A major issue was that the “programmes” developed in the Pipes environment could only run in that environment. As I learned from pipes guru @hapdaniel, however, it is possible to export a JSON representation of a pipe and so at least grab some sort of copy of a pipes programme. This led to me doodling some ideas around the idea of a Yahoo Pipes Documentation Project, which would let you essentially export a functional specification of a pipe (I think the code appears to have rotted or otherwise broken on this?:-(

This in turn led naturally to Starting to Think About a Yahoo Pipes Code Generator, whereby we could take a description of a pipe and generate a code equivalent version from it.

Greg Gaughan took up the challenge with Pipe2Py (described here) to produce a pipes compiler capable of generating and running Python equivalents of a Yahoo pipe (not all Pipes blocks are implemented yet, but it works well for simple pipes).

And now Greg has gone a step further, by hosting pipe2py on Google App engine so you can make a working Python backup of a pipe in that environment, and run it: Running Yahoo! Pipes on Google App Engine.

As with pipe2py, it won’t work for every Yahoo pipe (yet!), but you should be okay with simpler pipes. (Support for more blocks is being added all the time, and implementations of currently supported blocks also get an upgrade if, as and when any issues are found with them. If you have a problem, or suggestion for a missing block, add a comment on Greg’s blog;-)

(Looking back over my old related posts, deploying to Google Apps also seems to be supported by Zoho Creator.)

Quite by chance, @paulgeraghty tweeted a link to an old post by @progrium on the topic of “POSS == Public Open Source Services: … or User Powered Self-sustaining Cloud-based Services of Open Source Software”:

How many useful bits of cool plumbing are made and abandoned on the web because people realize there’s no true business case for it? And by business case, I mean make sense to be able to turn a profit or at least enough to pay the people involved. Even as a lifestyle business, it still has to pay for at least one person … which is a lot! But forget abandoned … how much cool tech isn’t even attempted because there is an assumption that in order for it to survive and be worth the effort, there has to be a business? Somebody has to pay for hosting! Alternatively, what if people built cool stuff because it’s just cool? Or useful (but not useful enough to get people to pay — see Twitter)?

Well this is common in open source. A community driven by passion and wanting to build cool/useful stuff. A lot of great things have come from open source. But open source is just that … source. It’s not run. You have to run it. How do you get the equivalent of open source for services? This is a question I’ve been trying to figure out for years. But it’s all coming together now …

Enter POSS

POSS is an extension of open source. You start with some software that provides a service (we’ll just say web service … so it can be a web app or a web API, whatever — it runs “in the cloud”). The code is open source. Anybody can fix bugs or extend it. But there is also a single canonical instance of this source, running as a service in the cloud. Hence the final S … but it’s a public service. Made for public benefit. That’s it. Not profit. Just “to be useful.” Like most open source.

Hmmm….. ;-)

Written by Tony Hirst

October 25, 2010 at 11:25 am

Feed-detection From Blog URL Lists, with OPML Output

with one comment

Picking up on Adding Value to the Blog Award Nomination Collections…, here’s a way of generating an OPML feed bundle of categorised feed URLs from a list of tagged blog homepage URLs. What the OPML allows you to do is take a list of URLs, such as the URLs for the blogs nominated in the COmputer Weekly 2010 IT Blog Awards, and subscribe to them all in one go using something like Google Reader. In addition, the OPML is structured so that the feeds are organised in separate “folders” according to the award category that they are nominated in.

So how does it work?

Read the rest of this entry »

Written by Tony Hirst

October 23, 2010 at 6:39 pm

Posted in Tinkering

Tagged with

Blog Details from an RSS/Atom Feed

with one comment

Picking up on Feed Autodetection With YQL, where I described a YQL custom query for autodetecting RSS and Atom feed URLs in a web page given the web page URL, here’s a complementary YQL custom query function which polls a feed URL through the YQL feed normaliser and returns the title and URL of the alternate HTML page for the feed:

select title,link from feednormalizer where url=@url and output='atom_1.0'

You can call the query using this alias:
http://query.yahooapis.com/v1/public/yql/psychemedia/feeddetails?url=FEED_URL&format=json

(Leave the &format=json off if you want amn XML response.)

Here’s an example of that query with a URL instantiated, via a specific query in the YQL developer console:

Feed details via YQL

You’ll notice several alternatives are also given; the HTML page URL is given in the result where rel=”alternate”, which is somewhat reminiscent of the case of feed autodetection in an HTML page, where rel=”alternate” identifies a <link> element that includes the URL for a feed alternative…

It’s now easy enough to do a two-pass procedure where we autodetect a feed URL from an HTML blog homepage using the autodetection query described previously, and then lookup the details of the feed using the query described above.

And why exactly might we want to do this? Because in many HTML docs that do specify an alternate RSS/Atom feed, the title element provided is often something like the uninformative “RSS2.0″, rather than the title of original blog…

Written by Tony Hirst

October 23, 2010 at 5:23 pm

Posted in Tinkering

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 126 other followers