Google Insights for Search (on Youtube too…)

It seems that Google opened up a supercharged variant of Google Trends over the last week or two: Google Insights for Search.

One useful feature the new service offers over the original trends service is the ability to compare the relative volumes for the same search term over several different time periods:

It’s also possible to get a breakdown by geography, or, as with Google Trends, compare volumes for different search terms.

Along with search volume trends, you also get insight into the geographical distribution of where searches are originating from (though this sort of view is always subject to interpretation!), and maybe more interestingly, related search terms and “rising searches” – that is, search phrases that have increased in volume over the specified period.

The URLs appear to be hackable/bookmarkable, too, which means that I can also bookmark them in Trendspotting (which I really need to tinker with on the templates front, at least to display inline graphs on the most recent entry, and maybe offer a preview link, too…).

I have to admit I probably wouldnlt have posted about this were it not for the fact that some of the insight views have also appeared on Youtube, at least for personally uploaded videos:

And here are the views…

Viewing by geography:

Relative popularity by geographical region:

How people came to view the movie… (i.e. “discovery”):

And finally, viewer demographics:

It’ll be interesting to see where Google go with their data products; as well as Google Insights for Search (and Google Trends), there’s also Google Analytics, Feedburner (which hasn’t yet been integrated into Blogspot – which is lacking on any stats/data tools, I think?) and Google Webmaster tools.

(There are also tools relating to Adsense/Adwords as well, of course, including this one I just found – a keyword recommender for a given URL: Google Adwords: Keyword Tool.)

And then, of course, there are all the Google visualisation widgets that are starting to appear for Google Spreadsheets, as well as around the Google visualisation API

Library Analytics (Part 1)

Having had a wonderful time at ILI2007 last year (summary of my talk, according to Brian Kelly – “For most of the people, most of the time, Google’s good enough – get over it…”, though I like to think I was actually talking about the idea of search hubs), I’ve joined forces with Hassan Sheikh from the OU Library on a paper this year’s ILI2008 on the topic of using Google analytics to track user behaviour on the Library website…

First up, it’s probably worth pointing out the unique organisation of the OU, because this impacts on the way the Library website is used.

The OU is a distance learning organisation with tens of thousands active, offsite students; a campus, which is home to teaching academics (course writers), researchers, “academic related” services (software developers, etc.), and administrators; several regional offices; and part-time Associate Lecturers (group tutors), who typically work from home, although they may also work full- or part-time for other educational institutions.

The Library is a “trad” Library, in that it is home to books and a physical journal collection (as well as an OU course materials archive and several other collections) that are typically used by on-campus academics and researchers. The Library has also been quite go-ahead in obtaining online access to journal, ebook, image and reference collections – online access means that these services can be delivered to our student body (whereas the physical collections are used in the main by OU academic and research staff…. I assume…!;-)).

Anyway, to ease myself back into thinking about “Library Analytics”, (I haven’t looked at the Library stats for several months now), here are some warm-up exercises/starting point observations I made, for whatever they’re worth… (i.e. statements of the bleedin’ obvious;-)

Firstly, can we segment users into onsite and offsite users? (I’m pretty sure Hassan was running separate reports for these different gorups, but if he is, I don’t have access to them…)

Even from just the headline report, it appears that a ‘just about significant’ amount of traffic is coming from the intranet.

Just to get my eye in, is this traffic coming from the OU campus at Walton Hall? If we look at the intranet as the traffic source, and segment according to the Network Location of the user (that is, the IP network they’re on), we can see the traffic predominantly local:

By the by, if I’m reading the following report correctly, we can also see that most of the intranet traffic is incoming from the intranet homepage…

And as you might expect, this traffic comes on weekdays…

So here’s a working assumption then (and one that we could probe later for real insight in any principled cases where it doesn’t hold true!): most referrals from the OU intranet occur Monday to Friday, from onsite users, via the intranet homepage.

Secondly, how well is the Library front page working? Whilst not as quick to read as a heat map, the Google Analytics site overlay can provide a quick way way of summarising the most popular links on a page (notwithstanding it’s faults, such as appearing not to disambiguate certain links…)

A quick glimpse suggests the search links need dumping, and more real estate should be given over to the “Journals” and “Databases” links that are currently in the left hand sidebar, and which get 20% and 19% of the click-thrus respectively. Despite the large areas of the screen given over to the image-based navigation, they aren’t pulling much traffic. (That said, if we segment the users it might well be the case that the images in the middle of the page disproportionately attract clicks from certain sorts of user? I don’t think it’s possible to segment this out in the general report, however? For that, I guess we need to define some separate reports that are pre-segmented according to referrer?)

Just chasing the traffic a little more, I wonder if there are a few, popular databases or whether traffic is distributed over all of them equally? The Library databases page is pretty horrible – a long alphabetical list of databases – so can the analytics suggests ways of helping people find the pages they want?

So how are things distributed?

Well – it seems like some databases are more popular than others… but just how true is that observation…?

Let’s do a bit more drilling to see what people are clicking through to from the databases pages… I have to admit that here I start to get a bit confused, because the analytics are giving me two places where databases are being reached from, whereas I can only find one of the paths on the website…

Here’s the one I can find – traffic from:
http://library.open.ac.uk/find/databases/index.cfm:

And here’s what I can’t find on the website – traffic from:
http://library.open.ac.uk/databases/database/:

They both identify the same databases as most popular though, though which databases those are I’ll leave for another day…because as you’ll see in a minute, this might be false popularity…

Why? Well let’s just see where the traffic for one of the most popular databases is coming from over the sample period I’ve been playing with:

Any idea why the traffic isn’t coming from the OU, but is coming form other HEIs???

Well, I happen to know that Bath, Brighton and Durham are used for OU residentlal schools, so I suspect that residential school students, after a reminder about the OU online Library services, are having a play, and maybe even participating in some information literacy activities that the OU Library trainers (as well as some of the courses) run at residential school…

Data – don’t ya just love it…? ;-) It sets so many traps for you to fall into!

Library Analytics (Part 2)

In Library Analytics (Part 1), I did a few “warm-up” exercises looking at the OU LIbrary website from a Google Aanlytics perspective.

In this post, I’m going to do a little more scene setting, looking at how some of Google Analytics visual reports can provide a quick’n’dirty, glanceable display of the most heavily trafficked areas of the Library website, along with the most significant sources of traffic.

It may seem like the observations are coming from all over the place, but there is a method to the madness as I’ll hopefully get round to describing in a later post!

Whilst these reports are pretty crude, they do provide a good starting point for taking the first steps in a series of more refined questions which, in turn, will hopefully start to lead us towards some sort of insight about which areas of the website are serving which users and maybe even for what purpose… And as that rather clunky sentence suggests, this is likely to be quite a long journey, with the likelihood of more than a few wrong turns!

Most Popular Pages
Here’s a glimpse of the most heavily trafficked pages for the Library website – just to check there are no ‘artefacts’ arising from things like residential schools, I’ve compared the data for two consecutive two month periods (the idea being that if the bimonthly averages are similar, we can hope that this is a reasonably fair ‘steady state’ report of the state of the site).

Most significant pages (“by eye” – that is, using the pie chart display):

To view the proportions excluding the homepage, we can filter the report using a regular expression:

What this does is exclude the “/” page – that is, the library homepage. (IMHO, some understanding of regular expressions is a core information skill ;-)

A bar graph allows us to compare the bimonthly figures – they seem to be reasonably correlated (we could of course do a rank correlation, or similar, to see if the top pages ordering is really the same…):

So to summarise – the top pages (homepage aside) are (from the URLs):

  • Journals
  • Databases
  • eResources

The eResources URL actually refers to the subject collection (“Online collections by subject”) page.

The top three pages are all linked to from the same navigation area on the OU Library website homepage – the left-hand navigation sidebar:

The eResources link (that is, the subject collections/online collections by subject page) is actually the Your subject link.

Going forward, a good thing to find out at the next level down would be to see which are the most popular databases, journals and resource collections and maybe check that these are in line with Library expectations.

We might also want to explore the extent to which different user segments (students, researchers etc.) use the different areas of the site in similar or different ways. (Going deeper into the analysis (i.e. to a deeper level of user segmentation), we might even want to track the behaviour of students on different courses (or residential school maybe?) and report these findings back to the appropriate course team.)

Top Content areas
The previous report gave the top page views on the site – but what are the most heavily used “content areas”? The Library site is, in places, reasonably disciplined in its use of a hierarchical URL structure, so by the using the content drilldown tool, we should be able to see which are the most heavily used areas of the website:

The “/find” page/path element is a bit of kludge, really, (as a note to self: explore the use of this page in some detail…)

If we drill down into the content being hit below http://www.open.ac.uk/find/*, we find that the eresources area (i.e. subject collections/Your subject) is actually a hotbed of activity:

(Note that in the above figure, the “/” refers to the “/find homepage” http://www.open.ac.uk/find/ rather than Library website homepage http://www.open.ac.uk/.)

So what can we say? The front page is driving lots of traffic to database, journal and subject collection/”Your subject” areas, and lots of activity is going on in the subject area in particular.

Questions we might want to bear in mind going forward – how well does activity in different subject areas compare?

Top traffic sources
Again using pie chart display, we can looking at the top traffic sources by eye:

Again, let’s just check (by eye) that the bi-monthly reports are “typical”:

(It’s interesting to see the College of Law cropping up in there… Do we run a course from a learning environment on that domain, I wonder?)

learn.open.ac.uk is the Moodle/VLE domain, so it certainly seems like traffic is coming in from there, which is a Good Thing:-). From the previous post, we can guess that most of the intranet traffic is coming from people onsite at the OU – i.e. they’re staff or researchers.

Just to check it’s the students that are coming in from the VLE, rather than OU staff, we can use the technique used in the previous post in this series, (where we found that most intranet sourced traffic is coming from the OU campus) to check the Network Location view of users referred from learn.open.ac.uk:

So, we can see that the learn.open.ac.uk traffic is in the main not coming from the OU campus (network location: open university), which is as we’d expect, because we have no significant numbers of onsite undergraduate students.

In a traditional university library, you;d maybe expect way more traffic to be coming from onsite computer facilities, and in that case you may be able to find a way of segmenting users according to how they are accessing the network – via personal wifi connected laptops, for example, or public access machines in the library itself.

(Just by the by, I don’t know whether the ISP data is valuable (particularly if you look at analytics from the http://www.open.ac.uk domain, which gets way more traffic than the library) in terms of being information we can sell to ISPs or use as the basis for exploring a partnership with a particular ISP?)

Okay, that’s enough for today, a bit of a ramble again, but we’re trying to get our eye in, right, and see what sorts of questions we might be able to ask, whilst checking along the way that the bleedin’ obvious actually is…;-)

And today’s insight? The inconsistency in naming around “Your Subject”, “Online Collections by Subject”, http://library.open.ac.uk/find/eresources etc makes reading the report tricky. This could be addressed by using a filter to rewrite the URLs etc as displayed in the report, but it also indicates possible confusion for users in the site design itself? There’s also a recurrence of the potential confusion around http://library.open.ac.uk/databases and http://library.open.ac.uk/find/databases, that I picked up on in the previous post?

A second insight? The content drilling view helps show where most of the onsite activity is taking place – in the collections by subject area.

Library Analytics (Part 3)

In this third post of an open-ended series looking at the OU Library website under Google analytics, I’ll pick out some ‘headline’ reports that describe the most popular items in one of the most popular content areas identified in Library Analytics (Part 2): databases.

Most Popular Databases
I can imagine that a headline report that everyone will go “ooh” about (notwithstanding the fact that the report is more likely to be properly interesting when you start to segment out the possibly different databases being looked at by different user segments;-) is the list of “top databases” (produced by filtering the top content report page on URLs that contain the term “database”)

So how do we work out what those database URLs actually point to? Looking at the HTML of the http://library.open.ac.uk/find/databases page, here’s where the reference to the most popular database crops up:

<a title=”This link opens up a new browser window” target=”_blank” href=”/find/databases/linking.cfm?id=337296″ onClick=”javascript:pageTracker._trackPageview(‘/databases/database/337296’);”>LexisNexis Butterworths</a>

The implied URL http://ouseful.open.ac.uk/databases/database/337296 doesn’t actually go anywhere real… it’s an artefact created for the analytics tracking (though it does contain the all important internal OU Library database ID (337296 in this case)).

It is possible to create a set of ‘rewrite’ rules that will map these numerical database IDs onto the name of the database. Alternatively, I’m guessing that when the database collection page is written, the HTML could track against dtabase name, rather than ID (e.g. using a construction along the lines of onClick=”javascript:pageTracker._trackPageview(‘Database: LexisNexis Butterworths’);”).

For now though, here’s a quick summary of the top 5 databases, worked out by code inspection!

  1. LexisNexis Butterworths (337296);
  2. JSTOR (271892);
  3. Westlaw (338947)
  4. PsycINFO (208607)
  5. Academic Search Complete (403673)

Just to show you what I mean by things being more interesting when you start to segment the most popular databases by identifiable, here’s a comparison of the referral source for users looking at Academic Search Complete (403673), PsycINFO (208607) and Westlaw (338947).

Firstly, Academic Search Complete:

In this case, there is a large amount of traffic coming from the intranet. Bearing in mind a comment on the first post, this traffic may be coming from personal bookmarks?

I may be in the error bar (i.e. outlier), but I do almost all my research / library work at home – but I log into the OU and go onto the library via the “my links” bit set to the OU journals and OU databases www page. So that would show as in intranet user? but I work remotely.

I could be wrong of course – so that’s one question to file away for a later day…

Secondly, PsycINFO (208607), the Content Detail report for which is easily enough found by searching on the Content Detail report page:

Here’s the source of traffic that spends some time looking at PsycINFO:

Here, we find a different effect. Most of the identifiable traffic is coming from direct links or the VLE, and the intranet is nowhere to be seen.

Note however the large amount of direct/unidentifiable traffic – this could hide a multitude of sins (and mask a multitude of user origins), so we should just remain wary and open to the idea we may have been misled!

So how can we try to gain an insight into that direct referral traffic (the traffic that arises from people typing the URL directly into their browser, or clicking on a browser bookmark)?

Well, to check that the traffic isn’t coming from direct traffic/bookmarks from users on the OU network other than via the intranet, we can look at the Network Location segment:

No sign of open university there in any significant numbers – so it seems that PsycINFO is more of a student resource than an onsite researcher resource.

Thirdly, Westlaw (338947). Who’s using this database?

It seems here that the majority single referrer is actually the College of Law.

We can segment against network location just to check the direct traffic isn’t coming from users on campus via browser bookmarks:

But some of it is coming from the College of Law? Hmmm.. Could that be a VPN thing, I wonder, or do they have an actual physical location?

Summary
So what insight(s) have we picked up in this post? Firstly, a dodgy ranking of most popular databases (dodgy in that the databases appear to be used by different constituencies of user). Secondly, a crude technique for getting a feel for who the users of a particular database are, based on original source/referrer and network location segmentations.

I guess there’s also a recommendation – that the buyer or owner of each database checks out the analytics to see if the users appear to be who they expect…!

And finally, to wrap this part up, it’s worth being sceptical no matter what precautions you put in place when trying to interpret the results; for example: How Does Google Analytics Track Conversion Referrals?.

Library Analytics (Part 4)

One of the things I fondly remember about doing physics at school was being told, at the start of each school year, how what we had been told the previous year was not quite exactly true, and that this year we would actually learn how the world worked properly

And so as this series of posts of about “Library Analytics” continues (that is, this series about the use of web analytics as applied to public Library websites), I will continue to show you examples of headline reports I have found initially compelling (or not), and then show why they are not quite right, and actually confusing at best, or misleading at worst…

Most Popular Journals
In the previous post in this series, we saw the most popular databases that were being viewed from the databases page. Is the same possible from the journals area? A quick look at the report for the find/journals/journals page suggests that such a report should be possible, but something is broken:

From the small amount of data there, the most popular journals/journal collections were as follows:

  1. JSTOR (271892);
  2. Academic Search Complete (403673);
  3. Blackwell Synergy (252307);
  4. ACM Digital Library (208448);
  5. IEEE Xplore (208545);

As with the databases, segmenting the traffic visiting these collections may provide insight as to which category of user (researcher, student) and maybe even which course is most active for a particular collection.

But what happened to the reporting anyway? Where has all the missing data gone?

I just had a quick look – the reporting from within the Journals area doesn’t currently appear to be showing anything…. err, oops?

Looking at the javascript code on each journal link:
onClick=”javascript:urchinTracker(‘/find/journals/journals/323808’)”
it’s not the same structure as the working code on the databases pages (which you may recall from the previous post in this series uses the tracking function pageTracker._trackPageview).

Looking at which tracking script is being used on the journals page (google-analytics.com/ga.js), I think the pageTracker._trackPageview function should be being used. urchinTracker is a function from an old tracking function. Oops… I wonder whether anyone has been (not) looking at Journal use indicators lately (or indeed, ever…?!)

Where is Journal Traffic Coming From (source location)?
So what sites are referring traffic to the journals area?

Well it looks as if there’s a lot of direct traffic coming in (so it may be worth looking at the network location report to see if we can tunnel into that), but there’s also a good chunk of traffic coming from the VLE (learn.open.ac.uk). It’d be handy to know which courses were sending that traffic, so we’ll just bear that in mind as a question for a later post.

Where is Journal Traffic Coming From (network locations)?
To get a feel for how much of the traffic to the journals “homepage” is coming from on campus (presumably OU researchers?) we can segment the report for the journals homepage according to network location.

The open university network location corresponds to traffic coming in from an OU domain. This report potentially gives us the basis for an “actionable” report, and maybe even a target… That is, to increase the number of page views (if not the actual proportion of traffic from on campus – we may be wanting to grow absolute traffic numbers from the VLE too) from the OU domain, as a result of increasing the number of researchers looking up journals from the Library journals homepage whilst at work on campus.

At this point, it’s probably a good a time as any to start to think about how we might interpret data such ‘number of pages per visit’, ‘average time on site’ and bounce rate (see here for some definitions).

Just looking at the numbers going across the columns, we can see that there are different sorts of groupings of the numbers.

ip pools and open university have pages/visit around 12, an average time on site tending towards 4 minutes, about 16% new visits in the current period (down from 36% in the first period, so people keep coming back to the site, which is good, though it maybe means we’re not attracting so many new visitors), and a bounce rate a bit less than 60%, down from around 70% in the earlier period (so fewer people are entering at the journals page and then leaving the site immediately).

Compare this to addresses ip for home clients and greenwich university reports, where there are just over 1 page per visit, only a few seconds on site, hardly any new visits (which I don’t really understand?) and a very high bounce rate. These visitors are not getting any value from the site at all, and are maybe being misdirected to it? Whatever the case, their behaviour is very different to the open university visitors.

Now if I was minded to, I’d run this data through a multidimensional clustering algorithm to see if there were some well defined categories of user, but I’m not in a coding mood, so maybe we’ll just have a look to see what patterns are visually identifiable in the data.

So, taking the top 20 results from the most recent reporting period shown above, lets upload it to Many Eyes and have a play (you can find the data here).

First up, let’s see if we can spot trending in time on site and pages/visit (which is exactly what we’d expect, of course) (click through the image to see the interactive visualisation on Many Eyes):

Okay – so that looks about right; and the higher bounce rates seem to be correspond to low average time on site/low pages per visit, which is what we’d expect too. (Note that by hovering over a data point, we can see which network location the data corresponds to.)

We can also see how the scatterplot gives us a way of visualising 3 dimensions at the same time.

If we abuse the histogram visualisation, we have an easy way of looking at which network locations have a high bounce rate, or time on site (a visual equivalent of ‘sort on column’, I guess? ;-)

Finally, a treemap. Abusing this visualisation gives us a way of comparing two numerical dimensions at the same time.

Note that using network location here is not necessarily that interesting as a base category… I’m just getting my eye in as to what Many Eyes visualisations might be useful! For the really interesting insights, I reckon a grand or two per day, plus taxes and expenses, should cover it ;-) But to give a tease, here’s the raw data relating to the Source URLs for trafifc that made it to the Journals area:

Can you see any clusters in there?! ;-)

Summary
Okay – enough for now. Take homes are: a) the wrong tracking function is being used on the journals page; b) the VLE is providing a reasonable an amount of traffic to the journals area of the Library website, though I haven’t identified (yet!) exactly which courses are sourcing that traffic; c) Many Eyes style visualisations may provide a glanceable, visual view over some of the Google Analytics data.

Library Analytics (Part 5)

Another day, another Library Analytics post… Today, a quick glimpse at another popular content area on the OU Library website, the “Subject Resource Collections” that dangle off http://library.open.ac.uk/find/eresources/.

Most Popular Subject Resource Collections
The distribution of visits to subject resource collections is pretty flat, as the following report shows:

That said, the most popular categories are:

  1. the law/law collection:

  2. the Law_legislation page:

  3. the Psychology collection;
  4. the Education collection;
  5. the Science – General collection.

Thinking back to the previous post in this series, and the example of using Many Eyes to visualise multiple data dimensions at the same time, a similar technique might be useful here just to check that each resource is attracting similar usage stats in respect of time on site, average pages per visit, bounce rates, etc.?

Just by the by, if we look at the Entrance Source for traffic that ends up on the selector page for Psychology eresources, we can see that most of the traffic is coming in from the VLE.

The College of Law appears to be providing most of the Law/Law traffic though…

Going forwards, it would probably be useful for the collections whose traffic sourced from the VLE to try to identify which courses were providing that traffic. This information might then provide the basis for “KPIs” relating to the performance of particular Library resources on a particular course.

Onsite Search Behavior
One of the optional reports on Google Analytics (that is, one that needs to be enabled) is tracking of onsite search behaviour using the website’s own search tool. Popular search terms identified by this report may well indicate failures in support for navigation-through-browsing – in the case of the OU Library site, it seems that information about “Athens” isn’t the easiest thing to find just by clicking…

The following report is particularly interesting from a trends point of view:

The step change at the end of March, with the higher incidence of internal search terms prior to then, suggests a change in user behaviour (given that all the other reports have been showing pretty steady traffic numbers over the whole period). I’m guessing – and this is checkable – that there was a Library website redesign at the end of March, although step changes (particularly in the case of users segmented by course, if such a thing were possible) might also be indicative of participation in scheduled Library related activities within a course in presentation. I’ll try to post a bit more about that at a later date…)

Another informative report describes the proportion of visits in which the user engages in onsite search. Users tend to navigate websites either by browsing (clicking on links) or by search. A high incidence of serch may indicate weaknesses in navigation design via clickable links. So how does the Library website appear do?

Well – it seems that users are clicking their way to pages rather than searching for them… (though this may in turn reflect issues with discovery and design of the search page…!)

The Help Page
Another source of information about how well the site is working for visitors is to look at usage around the Help page. I’m not going to go into this page in any depth, but here’s an inkling of what sorts of information we might be able to extract from it…

Who’s looking at how to cite a reference?

Seems like Google traffic is high up here? So maybe another role for the Library website is outreach, in the sense of informal education? And maybe the “How to cite a reference” page would be a good place to place a link to the free Safari info skills minicourse, and an ad for TU120 Beyond Google? ;-)

Library Analytics (Part 6)

In this post, I’m going to have a quick look at some filtered reports I set up a few days ago to see if they are working as I expected.

What do I mean by this? Well, Google Analytics lets you create filters that can be used to create reports for a site that focus on a particular area of the website or user segment.

At their simplest, filters work in one of two main ways (or a combination of both). Firstly, you can filter the report so that it only covers activity on a subset of the website as a whole (such as all pages along the path http://library.open.ac.uk/find/databases). Secondly, you can filter the report so that it only covers traffic that is segmented according to user characteristics (such as users arriving from a particular referral source).

Here are a couple of examples: firstly, a filter that will just report on traffic that has been referred from the VLE:

Using this filter will allow us to create a report that tracks how the Library website is being used by OU students.

Another filter in a similar vein lets us track just the traffic arriving from the College of Law:

A second type of filter allows us to just provide a report on activity within the eresources area of the Library website:

Note that multiple filters can be applied to a single report profile, so I could for example create a report profile that just looked at activity in the Journals area of the website (by applying a subdirectory filter) that came from users on the OU campus (by also applying a user segment filter).

So how does this help?

If we assume there are several different user types on the Library website (students, onsite researchers, students on partner courses (such as with the College of Law), users arriving from blond Google searches, and so on), then we can use filters to create a set of reports, each one covering a different user segment. Adding all the separate reports together would give us the “total” website report that I was using in the first five posts in this series. Looking at each report separately allows us to understand the different needs and behaviours of the different user types.

Although it is possible to segment reports from the whole site report, as I have shown previously, segmenting the report ‘on the way in’ through the application of one or more filters allows you to use the whole raft of Google Analytics reports to look at a particular segment of the data as a whole.

So for example, here’s a view of the report filtered by referrer (college of law):

Where is the traffic from the College of Law landing?

Okay – it seems like all the traffic is coming in to one page on the Library website from the College of Law?! Now this may or may not be true (there may be a single link on the College of Law website to the OU Library), it may or may not reflect an error in the way I have crafted the rule. One to watch…

How about the report filtered by users referred from the VLE?

This report looks far more natural – users are entering the site at a variety of locations, presumably from different links in the VLE.

Which is all well and good – but it would be really handy if we knew which courses the students were coming from, and/or which VLE pages were actually sending the traffic.

The way to do this is to capture the whole referrer URL (not just the “http://learn.open.ac.uk&#8221; part) and report this as a user defined value, something we can do with another filter:

Segmenting the majority landing page data (the Library homepage) by this user defined value gives the following report:

The full referrer URLs are, in the main, really nasty Moodle URLs that obfuscate the course behind an arbitrary resource ID number.

Having a quick look at the pages, the top five referrers over the short sample period the report has been running (and a Bank Holiday weekend at that!) are:

  1. EK310-08: Library Resources (53758);
  2. E891-07J: Library Resources (36196);
  3. DD308-08: Library Resources (54466);
  4. DD303-08: Library Resources (49710);
  5. DXR222-08E: Library Resources (89798);

If we knew all the VLE pages in a particular course that linked to the Library website, we could produce a filtered report that just recorded activity on the Library website that came from that course on the VLE.

Library Analytics (Part 7)

In the previous post in this series, I showed how it’s possible to identify traffic referred from particular course pages in the OU VLE, by creating a user defined variable that captured the complete (nasty) VLE referrer URL.

Now I’m not definitely sure about this, but I think that the Library provides URLs to the VLE via an RSS feed. That is, the Library controls the content that appears on the Library Resources page when a course makes such a page available.

In the Googe Analytics FAQ answer How do I tag my links?, a method is described for adding additional tags to a referrer URL that Google Analytics can use to segment traffic referred from that URL. Five tags are available (as described in Understanding campaign variables: The five dimensions of campaign tracking):

Source: Every referral to a web site has an origin, or source. Examples of sources are the Google search engine, the AOL search engine, the name of a newsletter, or the name of a referring web site.
Medium: The medium helps to qualify the source; together, the source and medium provide specific information about the origin of a referral. For example, in the case of a Google search engine source, the medium might be “cost-per-click”, indicating a sponsored link for which the advertiser paid, or “organic”, indicating a link in the unpaid search engine results. In the case of a newsletter source, examples of medium include “email” and “print”.
Term: The term or keyword is the word or phrase that a user types into a search engine.
Content: The content dimension describes the version of an advertisement on which a visitor clicked. It is used in content-targeted advertising and Content (A/B) Testing to determine which version of an advertisement is most effective at attracting profitable leads.
Campaign: The campaign dimension differentiates product promotions such as “Spring Ski Sale” or slogan campaigns such as “Get Fit For Summer”.

(For an alternative description, see Google Analytics Campaign Tracking Pt. 1: Link Tagging.)

The recommendation is that campaign source, campaign medium, and campaign name should always be used (I’m not sure if Google Analytics requires this, though?)

So here’s what I’m proposing: how about we treat a “course as campaign”? What are sensible mappings/interpretations for the campaign variables?

  • source: the course?
  • medium: the sort of link that has generated the traffic, such as a link on the Library resources page?
  • campaign: the mechanism by which the link got into the VLE, such as a particular class of Library RSS feed or the addition of the link by a course team member?

By creating URLs that point back to the Library website for the display in the VLE tagged with “course campaign” variables, we can more easily track (i.e. segment) user activity on the Library website that results from students entering the Library site from that link referral.

Where course teams upload Library URLs themselves, we could maybe provide a “URL Generator Tool” (like the “official” Tool: URL Builder) that will accept a library URL and then automatically add the course code (source), a campaign flag saying the link was course team uploaded, a medium flag saying the link is provided as part of assessment, or further information. The “content” variable might capture a section number in the course, or information about what activity in particular the resource related to?

For example, the tool would be able to create something like:
http://learn.open.ac.uk/mod/resourcepage/view.php?id=36196&utm_source=E891-07J&utm_medium=Library%2Bresource&utm_campaign=Library%2BRSS%2Bfeed

Annotating links in this way would allow Library teams to see what sorts of link (in terms of how they get into the VLE) are effective at generating traffic back to the Library, and could also enable the provision of reports back to course teams showing how effectively students on a particular course are engaging with Library resources from links on the VLE course pages.

The Tesco Data Business (Notes on “Scoring Points”)

One of the foundational principles of the Web 2.0 philosophy that Tim O’Reilly stresses relates to “self-improving” systems that get better as more and more people use them. I try to keep a watchful eye out for business books on this subject – books about companies who know that data is their business; books like the somehow unsatisfying Competing on Analytics, and a new one I’m looking forward to reading: Data Driven: Profiting from Your Most Important Business Asset (if you’d like to buy it for me… OUseful.info wishlist;-).

So as part of my summer holiday reading this year, I took away Scoring Points: How Tesco Continues to WIn Customer Loyalty, a book that tells the tale of the Tesco Loyalty Card. (Disclaimer: the Open University has a relationship with Tesco, which means that you can use Tesco clubcard points in full or part payment of certain OU courses. It also means, of course, that Tesco knows far, far more about certain classes of our students than we do…)

For those of you who don’t know of Tesco, it’s the UK’s dominant supermarket chain, taking a huge percentage of the UK’s daily retail spend, and is now one of those companies that’s so large it can’t help but be evil. (They track their millions of “users” as aggressively as Google tracks theirs.) Whenever you hand over your Tesco Clubcard alongside a purchase, you get “points for pounds” back. Every 3 months (I think?), a personalised mailing comes with vouchers that convert points accumulated over that period into “cash”. (The vouchers are in nice round sums – £1, £2.50 and so on. Unconverted points are carried over to the convertable balance in next mailing.) The mailing also comes with money off vouchers for things you appear to have stopped purchasing, rewards on product categories you frequently buy from, or vouchers trying to entice you to buy things you might not be in the habit of buying regularly (but which Tesco suspects you might desire!;-)

Anyway, that’s as maybe – this is supposed to be a brief summary of corner-turned pages I marked whilst on holiday. The book reads a bit like a corporate briefing book, repetitive in parts, continually talking up the Tesco business, and so on, but it tells a good story and contains more than a few a gems. So here for me were some of the highlights…

First of all, the “Clubcard customer contract”: more data means better segmentation, means more targeted/personalised services, means better profiling. In short, “the more you shop with us, the more benefit you will accrue” (p68).

This is at the heart of it all – just like Google wants to understand it’s users better so that it can serve them with more relevant ads (better segmentation * higher likelihood of clickthru = more cash from the Google money machine), and Amazon seduces you with personal recommendations of things it thinks you might like to buy based on your purchase and browsing history, and the purchase history of other users like you, so Tesco Clubcard works in much the same way: it feeds a recommendation engine that mines and segments data from millions of people like you, in order to keep you engaged.

Scale matters. In 1995, when Tesco Clubcard launched, dunhumby, the company that has managed the Clubcard from when it was still an idea to the present day, had to make do with the data processing capabilities that were available then, which meant that it was impossible to track every purchase, in every basket, from every shopper. (In addition, not everything could be tracked by the POS tills of the time – only “the customer ID, the total basket size and time the customer visited, and the amount spent in each department” (p102)). In the early days, this meant data had to be sampled before analysis, with insight from a statistically significant analysis of 10% of the shopping records being applied to the remaining 90%. Today, they can track everything.

Working out what to track – first order “instantaneous” data (what did you buy on a particular trip, what time of day was the visit) or second order data (what did you buy this time you didn’t buy last time, how long has it been between visits) – was a major concern, as were indicators that could be used as KPIs in the extent to which Clubcard influenced customer loyalty.

Now I’m not sure to what extent you could map website analytics onto “store analytics”, but some of the loyalty measures seem familiar to me. Take, for example, the RFV analysis (pp95-6) :

  • Recency – time between visits;
  • Frequency – “how often you shop”
  • Value – how profitable is the customer to the store (if you only buy low margin goods, you aren’t necessarily very profitable), and how valuable is the store to the customer (do you buy your whole food shop there, or only a part of it?).

Working out what data to analyse also had to fit in with the business goals – the analytics needed to be actionable (are you listening, Library folks?!;-). For example, as well as marketing to individuals, Clubcard data was to be used to optimise store inventory (p124). “The dream was to ensure that the entire product range on sale at each store accurately represented, in selection and proportion, what the customers who shopped there wanted to buy.” So another question that needed to be asked was how should data be presented “so that it answered a real business problem? If the data was ‘interesting’, that didn’t cut it. But adding more sales by doing something new – that did.” (p102). Here, the technique of putting data into “bins” meant that it could be aggregated and analysed more efficiently in bulk and without loss of insight.

Returning to the customer focus, Tesco complemented the RFV analysis with the idea of “Loyalty Cube” within which each customer could be placed (pp126-9).

  • Contribution: that is, contribution to the bottom line, the current profitability of the customer;
  • Commitment: future value – “how likely that customer is to remain a customer”, plus “headroom”, the “potential for the customer to be more valuable in the future”. If you buy all your groceries in Tesco, but not your health and beauty products, there’s headroom there;
  • Championing: brand ambassadors; you may be low contribution, low commitment, but if you refer high value friends and family to Tesco, Tesco will like you:-)

By placing individuals in separate areas of this chart, you can tune your marketing to them, either by marketing items that fall squrely within that area, or if you’re feeling particularly aggressive, by trying to move them from through the differnt areas. As ever, it’s contextual relevancy that’s the key.

But what sort of data is required to locate a customer within the loyalty cube? “The conclusion was that the difference between customers existed in each shopper’s trolley: the choices, the brqnds, the preferences, the priorities and the trade-offs in managing a grocery budget.” (p129).

The shopping basket could tel a lot about two dimensions of the loyalty cube. Firstly, it could quantify contribution, simply by looking at the profit margins on the goods each customer chose. Second, by assessing the calories in a shopping basket, it could measure the headroom dimension. Just how much of a customer’s food needs does Tesco provide?

(Do you ever feel like you’re being watched…?;-)

“Products describe People” (p131): one way of categorising shoppers is to cluster them according to the things they buy, and identify relationships between the products that people buy (people who buy this, also tend to buy that). But the same product may have a different value to different people. (Thinking about this in terms of the OU Course Profiles app, I guess it’s like clustering people based on the similar courses they have chosen. And even there, different values apply. For example, I might dip into the OU web services course (T320) out of general interest, you might take it because it’s a key part of your professional development, and required for your next promotion).

Clustering based on every product line (or SKU – stock keeping unit) is too highly dimensional to be interesting, so enter “The Bucket” (p132): “any significant combination of products that appeared from the make up of a customer’s regular shopping baskets. Each Bucket was defined initially by a ‘marker’, a high volume product that had a particular attribute. It might typify indulgence, or thrift, or indicate the tendency to buy in bulk. … [B]y picking clusters of products that might be bought for a shared reason, or from a shared taste” the large number of Buckets required for the marker approach could be reduced to just 80 Buckets using the clustered products approach. “Every time a key item [an item in one of the clusters that identifes a Bucket] was scanned [at the till],it would link that Clubcard member with an appropriate Bucket. The combination of which shoppers bought from which Buckets, and how many items in those Buckets they bought, gave the first insight into their shopping preferences” (p133).

By applying cluster analysis to the Buckets (i.e. trying to see which Buckets go together) the next step was to identify user lifestyles (p134-5). 27 of them… Things like “Loyal Low Spenders”, “Can’t Stay Aways”, “Weekly Shoppers”, “Snacking and Lunch Box” and “High Spending Superstore Families”.

Identifying people from the products they buy and clustering on that basis is one way of working. But how about defining products in terms of attributes, and then profiling people based on those attributes?

Take each product, and attach to it a series of appropriate attributes, describing what that product implicitly represented to Tesco customers. Then buy scoring those attributes for each customer based on their shopping behaviour, and building those scores into an aggregate measurement per individual, a series of clusters should appear that would create entirely new segments. (p139)

(As a sort of example of this, brand tags has a service that lets you see what sorts of things people associate with corporate brands. I imagine a similar sort of thing applies to Kellogs cornflakes and Wispa chocolate bars ;-)

In the end, 20 attributes were chosen for each product (p142). Clustering people based on the attributes of the products they buy produces segments defined by their Shopping Habits. For these segments to be at their most useful, each customer should slot neatly into a single segment, each segment needs to be large enough to be viable for it to be acted on, as well as being distinctive and meaningful. Single person segments are too small to be exploited cost effectively (pp148-9).

Here a few more insights that I vaguely seem to remember from the book, that you may or may not think are creepy and/or want to drop into conversation down the pub:-)

  • calorie count – on the food side, calorie sellers are the competition. We all need so many calories a day to live. If you do a calorie count on the goods in someone’s shopping basket, and you have an idea of the size of the household, you can find out whether someone is shopping elsewhere (you’re not buying enough calories to keep everyone fed) and maybe guess when a copmetitor has stolen some of your business or when someone has left home. (If lots of shoppers from a store stop buying pizza, maybe a new pizza delivery service has started up. If a particular family’s basket takes a 15% drop in calories, maybe someone has left home)?
  • life stage analysis – if you know the age, you can have a crack at the life stage. Pensioners probably don’t want to buy kids’ breakfast cereal, or nappies. This is about as crude as useful segmentation gets – but it’s easy to do…
  • Beer and nappies go together – young bloke has a baby, has to go shopping for the first time in his life, gets the nappies, sees the beers, knows he won’t be going anywhere for the next few months, and gets the tinnies in… (I think that was from this book!;-)

Anyway, time to go and read the Tesco Clubcard Charter I think?;-)

PS here’s an interesting, related, personal tale from a couple of years ago: Tesco stocks up on inside knowledge of shoppers’ lives (Guardian Business blog, Sept. 2005) [thanks, Tim W.]

PPS Here are a few more news stories about the Tesco Clubcard: Tesco’s success puts Clubcard firm on the map (The Sunday Times, Dec. 2004), Eyes in the till (FT, Nov 2006), and How Tesco is changing Britain (Economist, Aug. 2005) and Getting an edge (Irish Times, Oct 2007) which both require a login, so f**k off…).

PPPS see also More remarks on the Tesco data play/, although having received at takedown notice at the time from Dunnhumby, the post is less informative than in was when originally posted…

So What Do You Think You’re Doing, Sonny?

A tweet from @benjamindyer alerted me to a trial being run in Portsmouth where “behavioural analytics” are being deployed on the city’s CCTV footage in order to “alert a CCTV operator to a potential crime in the making” (Portsmouth gets crime-predicting CCTV).

I have to say this reminded me a little, in equal measures, of Phillip Kerr’s A Philosophical Investigation, and the film Minority Report, both of which explore, in different ways, the idea of “precrime”, or at least, the likelihood of a crime occurring, although I suspect the behavioural video analysis still has some way to go before it is reliable…!

When I chased the “crime predicting CCTV” story a little, it took me to Smart CCTV, the company behind the system being used in Portsmouth.

And seeing those screenshots, I wondered – wouldn’t this make for a brilliant bit of digital storytelling, in which the story is a machine interpretation of life going on, presented via a series of automatically generated, behavioural analysis subtitles, as we follow an unlikely suspect via the CCTV network?

See also: CCTV hacked by video artists, Red Road, Video Number Plate Recognition (VNPR) systems, etc. etc.

PS if you live in Portsmouth, you might as well give up on the idea of privacy. For example, add in a bit of Path Intelligence, “the only automated measurement technology that can continuously monitor the path that your shoppers or passengers take” which is (or at least, was) running in Portsmouth’s Gunwharf Quays shopping area (Shops track customers via mobile phone), and, err, erm… who knows?!

PPS it’s just so easy to feed paranoia, isn’t it? Gullible Twitter users hand over their usernames and passwords – did you get your Twitterank yet?! ;-)