It’s late, I’m tired, and I have a 5am start… but I’ve confused several people just now with a series of loosely connected idle ranty tweets, so here’s the situation:
– I’m building a simple app that looks at URLs tweeted recently on a twitter list;
– lots of the the URLs are shortened;
– some of the shortened URLs are shortened with different services but point to the same target/destination/long URL;
– all I want to do – hah! ALL I want to do – is call a simple webservice example.com/service?short2long=shorturl that will return the long url given the short URL;
– i have two half solutions at the moment; the first is using python to call the url (urllib.urlopen(shorturl)), then use .getinfo() on the return to look-up the page that was actually returned; then I use Beautiful Soup to try and grab the <title> element for the page so I can display the page title as well as the long (actual) URL; BUT – sometimes the urllib call appears to hang (and I can’t see how to set a timeout/force and except), and sometimes the page is so tatty Beuatiful Soup borks on the scrape;
– my alternative solution is to call YQL with something like select head.title from html where url=”http://icio.us/5evqbm” and xpath = “//html” (h/t to @codepo8 for pointing out the xpath argument); if there’s a redirect, the diagnostics YQL info gives the redirect URL. But for some services, like the Yahoo owned delicious/icio.us shortener, the robots.txt file presumably tells the well-behaved YQL to f**k off, becuase 3rd party resolution is not allowed.
It seems to me that in exchange for us giving shorteners traffic, they should conform to a convention that allows users, given a shorturl, to:
1) lookup the long URL, necessarily, using some sort of sameas convention;
2) lookup the title of the target page, as an added value service/benefit;
3) (optionally) list the other alternate short URLs the service offers for the same target URL.
If I was a militant server admin, I’d be tempted to start turning traffic away from the crappy shorteners… but then. that’s because I’m angry and ranting at the mo…;-)
Even if I could reliably call the short URL and get the long URL back, this isn’t ideal… suppose 53 people all mint their own short URLs for the same page. I have to call that page 53 times to find the same URL and page title? WTF?
… or suppose the page is actually an evil spam filled page on crappyevilmalware.example.com with page title “pr0n t1t b0ll0x”; maybe I see that and don’t want to go anywhere near the page anyway…
PS see also Joshua Schachter on url shorteners
PPS sort of loosely related, ish, err, maybe ;-) Local or Canonical URIs?. Chris (@cgutteridge) also made the point that “It’s vital that any twitter (or similar) archiver resolves the tiny URLs or the archive is, well, rubbish.”
There seem to have a been a lot of posts recently about URL shorteners/minifiers, such as this or this, which linked back to On URL shorteners by delicious founder Joshua Schachter. I’m not sure if Brian Kelly has done a risk assessment post about it yet, though? ;-)
So what are we to do in the case of URL shorteners going down, or disappearing from the web?
How about this?
When you publish a page, do a lookup using the most popular URL shortener sites to grab the shortened URL for that page from those services, and hard code those URLs into the page as metadata:
Then if a particular URL shortener service goes down, there’s a fallback position available in the form of using the web search engines to track down your page, as long as they index the page metadata?
PS it also strikes me that if a URL service were to go down, it’d be in e.g. Google’s interests to buy up their databases in the closing down fire sale…
PPS annotating every page would potentially introduce overload the URL shortening services, I suspect, so I wonder this: maybe page publishers should inject the meta data into a page if they see incoming referrer traffic coming in to the page from a URL shortening service? So for example, if the server sees incoming traffic to a page from is.gd, it grabs the is.gd short URL for the page and adds it to the page metadata? This is not a million miles away from a short URL trackback service? (Cf. also e.g. things like Tweetback.)
PPPS via Downes: Short URL Auto-Discovery (looks like it’s being offered as an RFC).