It’s late, I’m tired, and I have a 5am start… but I’ve confused several people just now with a series of loosely connected idle ranty tweets, so here’s the situation:
– I’m building a simple app that looks at URLs tweeted recently on a twitter list;
– lots of the the URLs are shortened;
– some of the shortened URLs are shortened with different services but point to the same target/destination/long URL;
– all I want to do – hah! ALL I want to do – is call a simple webservice example.com/service?short2long=shorturl that will return the long url given the short URL;
– i have two half solutions at the moment; the first is using python to call the url (urllib.urlopen(shorturl)), then use .getinfo() on the return to look-up the page that was actually returned; then I use Beautiful Soup to try and grab the <title> element for the page so I can display the page title as well as the long (actual) URL; BUT – sometimes the urllib call appears to hang (and I can’t see how to set a timeout/force and except), and sometimes the page is so tatty Beuatiful Soup borks on the scrape;
– my alternative solution is to call YQL with something like select head.title from html where url=”http://icio.us/5evqbm” and xpath = “//html” (h/t to @codepo8 for pointing out the xpath argument); if there’s a redirect, the diagnostics YQL info gives the redirect URL. But for some services, like the Yahoo owned delicious/icio.us shortener, the robots.txt file presumably tells the well-behaved YQL to f**k off, becuase 3rd party resolution is not allowed.
It seems to me that in exchange for us giving shorteners traffic, they should conform to a convention that allows users, given a shorturl, to:
1) lookup the long URL, necessarily, using some sort of sameas convention;
2) lookup the title of the target page, as an added value service/benefit;
3) (optionally) list the other alternate short URLs the service offers for the same target URL.
If I was a militant server admin, I’d be tempted to start turning traffic away from the crappy shorteners… but then. that’s because I’m angry and ranting at the mo…;-)
Even if I could reliably call the short URL and get the long URL back, this isn’t ideal… suppose 53 people all mint their own short URLs for the same page. I have to call that page 53 times to find the same URL and page title? WTF?
… or suppose the page is actually an evil spam filled page on crappyevilmalware.example.com with page title “pr0n t1t b0ll0x”; maybe I see that and don’t want to go anywhere near the page anyway…
PS see also Joshua Schachter on url shorteners
PPS sort of loosely related, ish, err, maybe ;-) Local or Canonical URIs?. Chris (@cgutteridge) also made the point that “It’s vital that any twitter (or similar) archiver resolves the tiny URLs or the archive is, well, rubbish.”