More Python Floundering… Stripping Google Analytics Tracking Codes Out of URLs

I’m really floundering on the Python front at the moment, so here’s the next thing I want to do but can’t seem to do cleanly (partly because I’m running an old version (2.5.1), mainly because I’m not familiar with Python libraries or how to use Python data structures properly;-)

The problem is easily stated: given a URL that contains Google Analytics tracking garbage, strip it out. So for example, go from this:
http://example.com?q=1&q2=foo&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blah’
and return something like this:
http://example.com?q=1&q2=foo

That is, strip out the arguments in the set:
[‘utm_source’, ‘utm_medium’, ‘utm_campaign’, ‘utm_term’, ‘utm_content’]

What I came up with (bearing in mind I’m using Py 2.5.1) is the very horrible:

import cgi, urllib, urlparse

url='http://nds.coi.gov.uk/content/Detail.aspx?ReleaseID=416174&NewsAreaID=2&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+bis-innovation-latest+%28BIS+Innovation+latest%29'

p=urlparse.urlparse(url)

# p[4] contains query string arguments
q=cgi.parse_qs(p[4])
qt={}

stopkeys=['utm_source','utm_medium','utm_campaign','utm_term','utm_content']
for i in q:
  if i not in stopkeys:
    qt[i]=q[i][0]

p2=urllib.urlencode(qt)
p2=(p[0],p[1],p[2],p[3],p2,p[5])
url=urlparse.urlunparse(p2)

print url

As to why I’m posting this and showing off my appalling programming skills and pure cruft code…?

It’s a way of exploring the gulf between the desire to programme and the ability to write code. I want an application or service that can perform a particular function (tidy Googalytics stuff out of a URL), and I can come up with the steps in a programme to do that…

…I’m also enough of a cut and paste dabbler to cobble together something that half works, (i.e. something that’s just about good enough to be getting on with), and that sort of demonstrates a working specification for an actual service I’d like to see.

But it’s not pretty code, it’s not elegant code, and it may not even be code that works properly…

Things like Yahoo Pipes help in this respect, because the pipes UI provides a programming interface that largely means the user can avoid writing code, yet still helps them write programmes that implement pipeline services.

The question is – how do we bridge the gap between people articulating things they want programmes to do, and generating the code to do them?

PS if you can tidy up the above code, bearing in mind the version of Python I’m running imposes certain constraints on the libraries you can use, I will happily accept corrections ;-)

PPS Andy Theyers, aka @offmessage, suggested this approach, which includes the handy isinstance trap to detect/identify whether something is of a particular type (in this case, a list), which is something I must try to remember:

mport cgi
import urllib
import urlparse

DISCARD = [
    'utm_source',
    'utm_medium',
    'utm_campaign',
    'utm_term',
    'utm_content',
    ]
    
def reencode(parsed_list):
    """This is nasty, but necessary in python2.5 because cgi.parse_qs is not
    the exact opposite of urllib.urlencode:
    >>> cgi.parse_qs(p[4])
    {'ReleaseID': ['416174'], 'NewsAreaID': ['2']}
    >>> urllib.urlencode({'ReleaseID': ['416174'], 'NewsAreaID': ['2']})
    {'ReleaseID': ["['416174']"], 'NewsAreaID': ["['2']"]}
    """
    ret = []
    for key, value in parsed_list:
        if isinstance(value, list):
            ret.extend([ (key, item) for item in value ])
        else:
            ret.append((key,value))
    return ret
    
def stripargs(url, discard=None):
    if discard is None:
        discard = DISCARD
    p = urlparse.urlparse(url)
    qs = cgi.parse_qs(p[4])
    qs = [ (k, v) for k, v in qs.items() if k not in discard ]
    qs = reencode(qs)
    new_p = (p[0],p[1],p[2],p[3],urllib.urlencode(qs),p[5])
    return urlparse.urlunparse(new_p)
    
if __name__ == '__main__':
    url='http://nds.coi.gov.uk/content/Detail.aspx?ReleaseID=416174&NewsAreaID=2&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+bis-innovation-latest+%28BIS+Innovation+latest%29'
    print url
    print stripargs(url)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

5 thoughts on “More Python Floundering… Stripping Google Analytics Tracking Codes Out of URLs”

  1. From a security point of view you would you not be better using a white-list of elements you DO accept rather than a black-list of things you are trying to filter out?

    As in all these matters nowadays, a properly posed question on StackOverflow is usually the best and fastest bet.

    1. Chatting to @offmessage around this, I came round to thinking that I really do need to start posting in to Stack Overflow, as well as voting more on the answers I make use of from there (I Google into SO *all* the time, and am mindful that I don’t contribute ‘I found this useful’ signals back… which I something that I intend to rectify.)

  2. I think you’re being too hard on yourself. The code’s not that bad. Elegance is really hard to define – but there are a few smells here.

    So, first of all, there are lots of index numbers being used – the only important one is that the p[4] is the query string. So we should try to lose the others.

    Secondly, the code has a symmetry about it that probably should be better expressed. The url is unpacked, and then later packed up again; the query string is similarly unpacked and then packed up again. We should try to get those steps closer together so that the symmetry is obvious.

    I tend to lose track of all the single character variable names, we can make those more expressive.

    And finally, we’re iterating over all the parameters in the query, when the only ones we’re interested in are the ones in the stop_key list. So let’s change that.

    Off the top of my head, something like this would be better:

    def strip_tracking_keys(query):
    tracking_keys=[‘utm_source’,’utm_medium’,’utm_campaign’,’utm_term’,’utm_content’]

    query_hash = cgi.parse_qs(query)

    for stop_key in tracking_keys:
    if query_hash.has_key(stop_key):
    del query_hash[stop_key]

    return urllib.urlencode(query_hash)

    urlparts = list(urlparse.urlparse(url))

    urlparts[4] = strip_tracking_keys(urlparts[4])

    print urlparse.urlunparse(urlparts)

    Is that better? I don’t know. To me it seems a bit easier to get hold of. Elegant, I’m not sure. What do you think?

    It’s worth saying that things like dict comprehension in more modern pythons would perhaps make it even easier to express the intention.

    Anyway, it wasn’t that bad to start with…

    1. Thanks for that – those are some really handy principles for me to bear in mind – some of which I know and occasionally use (meaningful variable names rather than a, aa, aaa, etc;-) – others which I think I half know, but which are made concrete through your articulation of them – like symmetry, and better management of the iterator parameters.

      A couple of other bits of code crept in because of how I interpreted py thrown errors; for example, I couldn’t seem to make an assignment into p[4] (“‘ParseResult’ object does not support item assignment”)?

Comments are closed.