OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Rant About URL Shorteners…

It’s late, I’m tired, and I have a 5am start… but I’ve confused several people just now with a series of loosely connected idle ranty tweets, so here’s the situation:

- I’m building a simple app that looks at URLs tweeted recently on a twitter list;
– lots of the the URLs are shortened;
– some of the shortened URLs are shortened with different services but point to the same target/destination/long URL;
– all I want to do – hah! ALL I want to do – is call a simple webservice example.com/service?short2long=shorturl that will return the long url given the short URL;
– i have two half solutions at the moment; the first is using python to call the url (urllib.urlopen(shorturl)), then use .getinfo() on the return to look-up the page that was actually returned; then I use Beautiful Soup to try and grab the <title> element for the page so I can display the page title as well as the long (actual) URL; BUT – sometimes the urllib call appears to hang (and I can’t see how to set a timeout/force and except), and sometimes the page is so tatty Beuatiful Soup borks on the scrape;
– my alternative solution is to call YQL with something like select head.title from html where url=”http://icio.us/5evqbm&#8221; and xpath = “//html” (h/t to @codepo8 for pointing out the xpath argument); if there’s a redirect, the diagnostics YQL info gives the redirect URL. But for some services, like the Yahoo owned delicious/icio.us shortener, the robots.txt file presumably tells the well-behaved YQL to f**k off, becuase 3rd party resolution is not allowed.

It seems to me that in exchange for us giving shorteners traffic, they should conform to a convention that allows users, given a shorturl, to:

1) lookup the long URL, necessarily, using some sort of sameas convention;
2) lookup the title of the target page, as an added value service/benefit;
3) (optionally) list the other alternate short URLs the service offers for the same target URL.

If I was a militant server admin, I’d be tempted to start turning traffic away from the crappy shorteners… but then. that’s because I’m angry and ranting at the mo…;-)

Even if I could reliably call the short URL and get the long URL back, this isn’t ideal… suppose 53 people all mint their own short URLs for the same page. I have to call that page 53 times to find the same URL and page title? WTF?

… or suppose the page is actually an evil spam filled page on crappyevilmalware.example.com with page title “pr0n t1t b0ll0x”; maybe I see that and don’t want to go anywhere near the page anyway…

PS see also Joshua Schachter on url shorteners

PPS sort of loosely related, ish, err, maybe ;-) Local or Canonical URIs?. Chris (@cgutteridge) also made the point that “It’s vital that any twitter (or similar) archiver resolves the tiny URLs or the archive is, well, rubbish.”

Written by Tony Hirst

October 25, 2010 at 9:56 pm

Posted in Evilness

Tagged with

9 Responses

Subscribe to comments with RSS.

  1. Yes, true. Is this what BackType manages to do, somehow?

    I once had this challenge with Bit.ly-created links (wanted to retrieve links recently tweeted by my account, get click stats for them, and work out the actual destination).

    The PHP looked something like:

    foreach($item as $i) {
    if ($counter pubDate) ? $i->pubDate : $i->published;

    $link = (strlen($i->link['href'])link) : strval($i->link['href']);
    $hash = str_replace(“http://bit.ly/”,””,$link);

    // foreach, grab the click count and the destination+title of the link
    $apiurl = “http://api.bit.ly/stats?version=2.0.1&shortUrl={$link}&login={$username}&apiKey={$apikey}”;
    $statscall = file_get_contents($apiurl);
    $nicejson = $json->decode($statscall);

    $apiurl2 = “http://api.bit.ly/info?version=2.0.1&hash={$hash}&login={$username}&apiKey={$apikey}”;
    $statscall2 = file_get_contents($apiurl2);
    $nicejson2 = $json->decode($statscall2);

    $feedoutput[] = array(
    “title” => strip_tags($i->title),
    “date” => strval($date),
    “description” => strip_tags($i->description),
    “link” => strval($link),
    “hash” => $hash,
    “clicks” => $nicejson->results->userClicks,
    “desturl” => $nicejson2->results->$hash->longUrl
    );
    $counter++;
    }
    }

    Steph Gray

    October 25, 2010 at 10:06 pm

  2. i may be being really dumb, but why can’t you just curl the short link and grab the page title returned?

    curl -L http://bit.ly/bvpoJE

    brings back the html dom structure you need, right?

    mr c.

    October 25, 2010 at 10:11 pm

  3. If urllib isn’t going to cut it, and Yahoo is too nice, have you thought about using Mechanize? Getting one page should be very quick, no images or extra stuff to get. I’d then use lxml to get the title, I think it works better than using BeautifulSoup. Good luck!

    Terry

    October 26, 2010 at 2:09 am

  4. Ah you’ll be singing (or completely screwed) when twitter put their strangle hold on shortened urls using t.co.

    In the meantime what about using the LongURL API http://longurl.org/api

    Martin

    Martin Hawksey

    October 26, 2010 at 11:42 am

    • But if twitter opens up its annotations API, it’d be able to carry the long URL there? Links could also be reduced to a single character or pair of characters in a tweet ( e.g. \1 ) that refer to a link annotation carried elsewhere? (Ok, so this wouldn’t work when tweets are sent by SMS, which I guess was the original constraint). I’m also amazed that folk still have to use http://..?

      Tony Hirst

      October 26, 2010 at 12:21 pm

  5. I like your suggestion that URL shorteners should provide a common way to access essential information. Perhaps this could be along the lines of [shortURL]/title or [shortURL]/url, these being links to simple text representations of the title and original URLs respectively. [shortURL]/info could return some form of array for fuller info, I guess.
    It struck me that it might fit into the agenda of 301works.org (which George Oates kindly brought to my attention in a different exchange). A number of services have signed up to that, and perhaps it could become a forum where conventions of this sort could be established and monitored?

    Jeremy

    October 26, 2010 at 12:00 pm

    • Thanks for linking to that – will check it out…

      Tony Hirst

      October 26, 2010 at 12:23 pm

  6. Ah you’ll be singing (or crying into your soup) when Twitter take a strangle hold on shortened urls in tweets.

    One way around this might be to pass the headache to someone else like using the LongURL service http://longurl.org/api

    Whole list of services supported (unfortunately doesn’t include icio.us which is why I didn’t mention it last night (but I can’t work out how to send links from delicious anyway ;-)

    Martin

    Martin Hawksey

    October 26, 2010 at 12:08 pm

  7. Tony Hirst lästert (zurecht) über URL Shorteners…

    Tony Hirst ärgert sich gerade über die diversen URL Shortener Services (Kurz-URL-Dienste) wie beispielsweise is.gd oder icio.us. Shortener sind spätestens mit dem Aufkommen von Twitter und dessen Textlängenbegrenzung auf 140 Zeichen populär geworden. S…


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 813 other followers

%d bloggers like this: