Rant About URL Shorteners…

It’s late, I’m tired, and I have a 5am start… but I’ve confused several people just now with a series of loosely connected idle ranty tweets, so here’s the situation:

– I’m building a simple app that looks at URLs tweeted recently on a twitter list;
– lots of the the URLs are shortened;
– some of the shortened URLs are shortened with different services but point to the same target/destination/long URL;
– all I want to do – hah! ALL I want to do – is call a simple webservice example.com/service?short2long=shorturl that will return the long url given the short URL;
– i have two half solutions at the moment; the first is using python to call the url (urllib.urlopen(shorturl)), then use .getinfo() on the return to look-up the page that was actually returned; then I use Beautiful Soup to try and grab the <title> element for the page so I can display the page title as well as the long (actual) URL; BUT – sometimes the urllib call appears to hang (and I can’t see how to set a timeout/force and except), and sometimes the page is so tatty Beuatiful Soup borks on the scrape;
– my alternative solution is to call YQL with something like select head.title from html where url=”http://icio.us/5evqbm&#8221; and xpath = “//html” (h/t to @codepo8 for pointing out the xpath argument); if there’s a redirect, the diagnostics YQL info gives the redirect URL. But for some services, like the Yahoo owned delicious/icio.us shortener, the robots.txt file presumably tells the well-behaved YQL to f**k off, becuase 3rd party resolution is not allowed.

It seems to me that in exchange for us giving shorteners traffic, they should conform to a convention that allows users, given a shorturl, to:

1) lookup the long URL, necessarily, using some sort of sameas convention;
2) lookup the title of the target page, as an added value service/benefit;
3) (optionally) list the other alternate short URLs the service offers for the same target URL.

If I was a militant server admin, I’d be tempted to start turning traffic away from the crappy shorteners… but then. that’s because I’m angry and ranting at the mo…;-)

Even if I could reliably call the short URL and get the long URL back, this isn’t ideal… suppose 53 people all mint their own short URLs for the same page. I have to call that page 53 times to find the same URL and page title? WTF?

… or suppose the page is actually an evil spam filled page on crappyevilmalware.example.com with page title “pr0n t1t b0ll0x”; maybe I see that and don’t want to go anywhere near the page anyway…

PS see also Joshua Schachter on url shorteners

PPS sort of loosely related, ish, err, maybe ;-) Local or Canonical URIs?. Chris (@cgutteridge) also made the point that “It’s vital that any twitter (or similar) archiver resolves the tiny URLs or the archive is, well, rubbish.”

9 comments

  1. Steph Gray

    Yes, true. Is this what BackType manages to do, somehow?

    I once had this challenge with Bit.ly-created links (wanted to retrieve links recently tweeted by my account, get click stats for them, and work out the actual destination).

    The PHP looked something like:

    foreach($item as $i) {
    if ($counter pubDate) ? $i->pubDate : $i->published;

    $link = (strlen($i->link[‘href’])link) : strval($i->link[‘href’]);
    $hash = str_replace(“http://bit.ly/”,””,$link);

    // foreach, grab the click count and the destination+title of the link
    $apiurl = “http://api.bit.ly/stats?version=2.0.1&shortUrl={$link}&login={$username}&apiKey={$apikey}”;
    $statscall = file_get_contents($apiurl);
    $nicejson = $json->decode($statscall);

    $apiurl2 = “http://api.bit.ly/info?version=2.0.1&hash={$hash}&login={$username}&apiKey={$apikey}”;
    $statscall2 = file_get_contents($apiurl2);
    $nicejson2 = $json->decode($statscall2);

    $feedoutput[] = array(
    “title” => strip_tags($i->title),
    “date” => strval($date),
    “description” => strip_tags($i->description),
    “link” => strval($link),
    “hash” => $hash,
    “clicks” => $nicejson->results->userClicks,
    “desturl” => $nicejson2->results->$hash->longUrl
    );
    $counter++;
    }
    }

  2. Terry

    If urllib isn’t going to cut it, and Yahoo is too nice, have you thought about using Mechanize? Getting one page should be very quick, no images or extra stuff to get. I’d then use lxml to get the title, I think it works better than using BeautifulSoup. Good luck!

    • Tony Hirst

      But if twitter opens up its annotations API, it’d be able to carry the long URL there? Links could also be reduced to a single character or pair of characters in a tweet ( e.g. \1 ) that refer to a link annotation carried elsewhere? (Ok, so this wouldn’t work when tweets are sent by SMS, which I guess was the original constraint). I’m also amazed that folk still have to use http://..?

  3. Jeremy

    I like your suggestion that URL shorteners should provide a common way to access essential information. Perhaps this could be along the lines of [shortURL]/title or [shortURL]/url, these being links to simple text representations of the title and original URLs respectively. [shortURL]/info could return some form of array for fuller info, I guess.
    It struck me that it might fit into the agenda of 301works.org (which George Oates kindly brought to my attention in a different exchange). A number of services have signed up to that, and perhaps it could become a forum where conventions of this sort could be established and monitored?

  4. Martin Hawksey

    Ah you’ll be singing (or crying into your soup) when Twitter take a strangle hold on shortened urls in tweets.

    One way around this might be to pass the headache to someone else like using the LongURL service http://longurl.org/api

    Whole list of services supported (unfortunately doesn’t include icio.us which is why I didn’t mention it last night (but I can’t work out how to send links from delicious anyway ;-)

    Martin

  5. Trackback: Tim Schlotfeldt » E-Learning