OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Tinkering’ Category

Simple Map Making With Google Fusion Tables

A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…

Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:

data in a fusion table

Now let’s map it:-)

First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…

tweak the column

Then we say we want the column to be recognised as a column type:

change name make location

Computer says yes, highlighting the location type cells with a yellow background:

fusion table.. yellow...

As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)

map tab...

Once the points have been geocoded, they’re placed onto a map:


We can now publish the map in preparation for sharing it with the world…

publish map

We need to change the visibility of the map to something folk can see!

privacy and link

Public on the web, or just via a shared link – your choice:

make seeable

Here’s my map:-)

The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.

Written by Tony Hirst

May 1, 2013 at 9:03 pm

A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page

Here’s a quick walkthrough of how to use OpenRefine to prepare a simple data file. The original data can be found on a web page that looks like this (h/t/ The Gravesend Reporter):

polling station list

Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.

Here’s how I want the data to look after I’ve cleaned it:

data in a fusion table

Can you see what needs to be done? Somehow, we need to:

- remove the blank lines;
– generate a second column containing the name of the ward each address applies to;
– remove the colon from the ward name;
– remove the rows that contained the original ward names.

If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:

polling stations

We can also paste the data into a new OpenRefine Project:

paste data into OpenRefine

We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:

OpenRefine parse line data

But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?

OpenRefine data in - more cleaning required

Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.

cleaning data part 1

Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.

open refine leave the data the same

If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.

delete the filter

Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.

fill down

If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.

reset filter, star rows, then remove them...

Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…

we're going to tidy

…by replacing the colon with nothing (an empty string).

tidy the column

Here’s the data – neat and tidy:-)

Neat and tidy...

To finish, let’s export the data.

prepare to export

How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).

upload to fusion table

And here it is:-)

data in a fusion table

So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)

Written by Tony Hirst

May 1, 2013 at 8:23 pm

A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context

Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.

To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):

  • you can’t start following someone on Twitter until you join Twitter;
  • follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
  • starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.

A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)

There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).

So with low follower numbers, where can we get our timestamps from?

In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).

If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.

This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.

(I’m making this up as I’m going along…;)

If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.

Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.

Got that?!;-) (I’m still making this up as I’m going along…!)

We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.

If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2″, or more simply “fo_i makes friends with B somewhen between t1 and t2″.

Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].

If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)

We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).

For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:

if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]

Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]

If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:

- set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
– encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
– consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
– if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.

I think I need to go and walk the dog again.

PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.

Written by Tony Hirst

April 22, 2013 at 2:35 pm

Posted in Thinkses, Tinkering

By Me, on the Scraperwiki Blog: Glue Logic and Flowable Data

Regular readers will know I quite often make use of Scraperwiki for grabbing datasets and hosting views over scraped scraped data. A few days ago, I contributed a guest post to the Scraperwiki blog:

As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool: joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.

Read the whole thing here: Glue Logic and Flowable Data.

PS I hope to write more about “flowable data”, feeds, and feed enrichment in a later post here on OUseful.info.

Written by Tony Hirst

March 6, 2013 at 1:13 pm

Posted in elsewhere, Tinkering

Tagged with

Further Dabblings with the Cloudworks API

Picking up on A Couple of Proof of Concept Demos with the Cloudworks API, and some of the comments that came in around it (thanks Sheila et al:-), I spent a couple more hours tinkering around it and came up with the following…

A prettier view, stolen from Mike Bostock (I think?)

prettier view d3js force directed layout

I also added a slider to tweak the layout (opening it up by increasing the repulsion between nodes) [h/t @mhawksey for the trick I needed to make this work] but still need to figure this out a bit more…

I also added in some new parameterised ways of accessing various different views over Cloudworks data using the root https://views.scraperwiki.com/run/cloudworks_network_d3js_force_directed_view_pretti/

Firstly, we can make calls of the form: ?cloudscapeID=2451&viewtype=cloudscapecloudcloudscape

cloudworks cloudscapes by cloud

This grabs the clouds associated with a particular cloudscape (given the cloudscape ID), and then constructs the network containing those clouds and all the cloudscapes they are associated with.

The next view uses a parameter set of the form cloudscapeID=2451&viewtype=cloudscapecloudtags and displays the clouds associated with a particular cloudscape (given the cloudscape ID), along with the tags associated with each cloud:

cloudworks cloudscape cloud tags

Even though there aren’t many nodes or edges, this is quite a cluttered view, so I maybe need to rethink how best to visualise this information?

I’ve also done a couple of views that make use of follower data. For example, here’s how to call on a view that visualises how the folk who follow a particular cloudscape follow each other (this is actually the default if no viewtype is given) -

cloudworks cloudscape innerfollowers

And here’s how to call a view that grabs a particular user’s clouds, looks up the cloudscapes they belong to, then graphs those cloudscapes and the people who follow them: ?userID=1174&viewtype=usercloudcloudscapefollower

cloudworks followers of cloudscapes containing a user's clouds

Here’s another way of describing that graph – followers of cloudscapes containing a user’s clouds.

The optional argument filterNdegree=N (where N is an integer) will filter the diaplayed network to remove nodes with degree <=N. Here’s the above example, but filtered to remove the nodes that have degree 2 or less: ?userID=1174&viewtype=usercloudcloudscapefollower&filterNdegree=2

cloudworks graph filtered

That is, we prune the graph of people who follow no more than two of the cloudscapes to which the specified user has added a cloud. In other words, we depict folk who follow at least three of the cloudscapes to which the specified user has added a cloud.

(Note that on inspecting that graph it looks as if there is at least one node that has degree 2, rather than degree 3 and above. I’m guessing that it originally had degree 3 or more but that at least one of the nodes it was connected to was pruned out? If that isn’t the case, something’s going wrong…)

Also note that it would be neater to pull in the whole graph and filter the d3.js rendered version interactively, but I don’t know how to do this?

However…I also added a parameter to the script that generates the JSON data files from data pulled from the Cloudworks API calls that allows me to generate a GEXF network file that can be saved as an XML file (.gexf suffix, cf. Visualising Networks in Gephi via a Scraperwiki Exported GEXF File) and then visualised using a tool such as Gephi. The trick? Add the URL parameter &format=gexf (the (optional) default is &format=json) [example].

gephiview of cloudworks graph

Gephi, of course, is a wonderful tool for the interactive exploration of graph-based data sets…. including a wide range of filters…

So, where are we at? The d3.js force directed layout is all very shiny but the graphs quickly get cluttered. I’m not sure if there are any interactive parameter controls I can add, but at the moment the visualisations border on the useless. At the very least, I need to squirt a header into the page from the supplied parameters so we know what the visualisation refers to. (The data I’ve played with to date – which has been very limited – doesn’t seem to be that interesting either from what I’ve seen? But maybe the rich structure isn’t there yet? Or maybe there is nothing to be had from these simple views?)

It may be worth exploring some other visualisation types to see if they are any more legible, at least, though it would be even more helpful if they were simply more informative ;-)

PS just in case, here’s a link to the Cloudworks API documentation.

PPS if there are any terms of service associated with the API, I didn’t read them. So if I broke them, oops. But that said – such is life; never ever trust that anybody you give data to will look after it;-)

Written by Tony Hirst

January 17, 2013 at 11:00 pm

A Couple of Proof of Concept Demos With the Cloudworks API

Via a tweet from @mhawksey in response to a tweet from @sheilmcn, or something like that, I came across a post by Sheila on the topic of Cloud gazing, maps and networks – some thoughts on #oldsmooc so far. The post mentioned a prototyped mindmap style browser for Cloudworks, created in part to test out the Cloudworks API.

Having tinkered with mindmap style presentations using the d3.js library in the browser before (Viewing OpenLearn Mindmaps Using d3.js; the app itself may well have rotted by now) I thought I’d have a go at exploring something similar for Cloudworks. With a promptly delivered API key by Nick Freear, it only took a few minutes to repurpose an old script to cast a test call to the Cloudworks API into a form that could easily be visualised using the d3.js library. The approach I took? To grab JSON data from the API, construct a tree using the Python networkx library, and drop a JSON serialisation of the network into a templated d3.js page. (networkx has a couple of JSON export functions that will create tree based and graph/network based JSON data structures that d3.js can feed from.

Here’s the Python fragment:


import urllib2,json, networkx as nx
from networkx.readwrite import json_graph

id=cloudscapeID #need logic




#print entities

#I seem to remember issues with non-ascii before, though maybe that was for XML? Hmmm...
def ascii(s): return "".join(i for i in s.encode('utf-8') if ord(i)<128)

def graphRoot(DG,title,root=1):
    return DG,root

def gNodeAdd(DG,root,node,name):
    return DG,node


#This simple example just grabs a list of clouds associated with a cloudscape
for c in entities['items']:
#We're going to use the tree based JSON data format to feed the d3.js mindmap view
jdata = json_graph.tree_data(DG,root=1)
#print json.dumps(jdata)

#The page template is defined elsewhere.
#It loads the JSON from a declaration in the Javascript of the form: jsonData=%(jdata)s
print page_template % vars()

The rendered view is something along the lines of:


You can find the original code here.

Now I know that: a) this isn’t very interesting to look at; and b) doesn’t even work as a navigation surface, but my intention was purely to demonstrate a recipe from getting data out of the Cloudworks API and into a d3.js mindmap view in the browser, and it does that. A couple of obvious next steps: i) add in additional API calls to grow the tree (easy); ii) linkify some of the nodes (I’m not sure I know who to do that at them moment?)

Sheila’s post ended with a brief reflection: “I’m also now wondering if a network diagram of cloudscape (showing the interconnectedness between clouds, cloudscapes and people) would be helpful ? Both in terms of not only visualising and conceptualising networks but also in starting to make more explicit links between people, activities and networks.”

So here’s another recipe, again using networkx but this time dropping the data into a graph based JSON format and using the d3.js force based layout to render it. What the script does is grab the followers of a particular cloudscape, grab each of their followers, and then graph how the followers of a particular cloudscape follow each other.

Because I had some problems getting the data into the template, I also used a slightly different wiring approach:

import urllib2,json,scraperwiki,networkx as nx
from networkx.readwrite import json_graph

id=cloudscapeID #need logic



def ascii(s): return "".join(i for i in s.encode('utf-8') if ord(i)<128)

def getUserFollowers(id):

    #print results
    for r in results['items']: f.append(r['user_id'])
    return f



#Seed graph with nodes corresponding of followers of a cloudscape
for c in entities['items']:

#construct graph of how followers of a cloudscape follow each other
for c in entities['items']:
    for followerid in followers:
        if followerid in followerIDs:

scraperwiki.utils.httpresponseheader("Content-Type", "text/json")

#Print out the json representation of the network/graph as JSON
jdata = json_graph.node_link_data(DG)
print json_graph.dumps(jdata)

In this case, I generate a JSON representation of the network that is then loaded into a separate HTML page that deploys the d3.js force directed layout visualisation, in this case how the followers of a particular cloudscape follow each other.


This hits the Cloudworks API once for the cloudscape, then once for each follower of the cloudscape, in order to construct the graph and then pass the JSON version to the HTML page.

Again, I’m posting it as a minimum viable recipe that could be developed as a way of building out Sheila’s idea (though the graph definition would probably need to be a little more elaborate, eg in terms of node labeling). Some work on the graph rendering probably wouldn’t go amiss either, eg in respect of node sizing, colouring and labeling.

Still, what do you expect in just a couple of hours?!;-)

Written by Tony Hirst

January 16, 2013 at 10:08 pm

Posted in Tinkering

Tagged with ,

WordPress Stats in R

A trackback from Martin Hawksey’s recent post on Analysing WordPress post velocity and momentum stats with Google Sheets (Spreadsheet), which demonstrates how to pull WordPress stats into a Google Spreadsheet and generates charts and reports therein, reminded me of the WordPress stats API.

So here’s a quick function for pulling WordPress reports into R.

#Wordpress Stats
#Wordpress Stats API docs (from http://stats.wordpress.com/csv.php)

#You can get a copy of your API key (required) from Akismet:
#Login with you WordPress account: http://akismet.com/account/
#Resend API key: https://akismet.com/resend/

#Required parameters: api_key, blog_id or blog_uri.
#Optional parameters: table, post_id, end, days, limit, summarize.

#api_key     String    A secret unique to your WordPress.com user account.
#blog_id     Integer   The number that identifies your blog. Find it in other stats URLs.
#blog_uri    String    The full URL to the root directory of your blog. Including the full path.
#table       String    One of views, postviews, referrers, referrers_grouped, searchterms, clicks, videoplays.
#post_id     Integer   For use with postviews table.
#end         String    The last day of the desired time frame. Format is 'Y-m-d' (e.g. 2007-05-01) and default is UTC date.
#days        Integer   The length of the desired time frame. Default is 30. "-1" means unlimited.
#period      String    For use with views table and the 'days' parameter. The desired time period grouping. 'week' or 'month'
#Use 'days' as the number of results to return (e.g. '&period=week&days=12' to return 12 weeks)
#limit       Integer   The maximum number of records to return. Default is 100. "-1" means unlimited. If days is -1, limit is capped at 500.
#summarize   Flag      If present, summarizes all matching records.
#format      String    The format the data is returned in, 'csv', 'xml' or 'json'. Default is 'csv'.
#NOTE: some of the report calls I tried didn't seem to work properly?
#Need to build up a list of tested calls to the API that actually do what you think they should?

wordpress.getstats.demo=function(apikey, blogurl, table='postviews', end=Sys.Date(), days='12', period='week', limit='', summarise=''){
  #default parameters gets back last 12 weeks of postviews aggregated by week
    '&',summarise, #set this to 'summarise=T' if required
  #Martin's post notes that JSON appears to work better than CSV
  #May be worth doing a JSON parsing version?

#Use the URL of a WordPress blog associated with the same account as the API key


getDomain=function(url) str_match(url, "^http[s]?://([^/]*)/.*?")[, 2]

#We can pull out the domains clicks were sent to or referrals came from


#Scruffy bar chart - is there a way of doing this sorted chart using geom_bar? How would we reorder x?
ggplot(c)+geom_bar(aes(x=reorder(Var1,Freq),y=Freq),stat='identity')+theme( axis.text.x=element_text(angle=-90))

ggplot(c)+geom_bar(aes(x=reorder(Var1,Freq),y=Freq),stat='identity')+theme( axis.text.x=element_text(angle=-90))

(Code as a gist.)

I guess there’s scope for coming up with a set of child functions that pull back specific report types? Also, if we pull in the blog XML archive and extract external links from each page, we could maybe start to analyse we pages are sending traffic where? (Of course, you can use Google Analytics to do this more efficiently, for hosted WordPress blogs don’t support Google Analytics (for no very good reason that I can tell…?)

PS for more WordPress tinkerings, see eg How OUseful.Info Posts Link to Each Other…,which links to a Python script for extracting data from WordPress blog export files that show how blogs posts in a particular WordPress blog link to each other.

Written by Tony Hirst

January 9, 2013 at 11:50 am

Posted in Anything you want, Rstats, Tinkering

Tagged with


Get every new post delivered to your Inbox.

Join 795 other followers