Archive for the ‘Tinkering’ Category
To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info
In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)
From the Scraperwiki database, we need to interrogate the API:
To do this, we’ll use a database query language – SQL.
What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:
select * from support where support in (select candinit from support)
Note that “support” refers to two things here – these are columns:
select * from support where support in (select candinit from support)
and these are the table the columns are being pulled from:
select * from support where support in (select candinit from support)
Here’s the result of Runing the query:
We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).
There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:
More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?
In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)
This is what we’ll be working towards:
If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.
Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…
When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).
We can download a CSV representation of the data from the scraper directly:
The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.
To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.
Start Gephi, select Data Laboratory tab and then New Project from the File menu.
You should now see a toolbar that includes an “Import Spreadsheet option”:
Import the CSV file as such, identifying it as an Edges Table:
You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…
You should now see the data has been loaded in to Gephi…
If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.
Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:
If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)
However, there are also other groupings in which one candidate may support another:
These connections may allow us to see grouping of candidates supporting each other along party lines.
One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (out degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:
Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:
That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:
We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:
Let’s reset the colours and go back to colourings along party lines:
If we go to the Preview view, we can generate a prettified view of the network:
In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:
Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)
To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:
Just that one then…
Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.
Here’s the result:
And again we can go into Preview mode to generate a nice vectorised version of the graph:
This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).
So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.
If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.
PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)
A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…
Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:
Now let’s map it:-)
First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…
Then we say we want the column to be recognised as a column type:
Computer says yes, highlighting the location type cells with a yellow background:
As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)
Once the points have been geocoded, they’re placed onto a map:
We can now publish the map in preparation for sharing it with the world…
We need to change the visibility of the map to something folk can see!
Public on the web, or just via a shared link – your choice:
The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.
Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.
Here’s how I want the data to look after I’ve cleaned it:
Can you see what needs to be done? Somehow, we need to:
- remove the blank lines;
- generate a second column containing the name of the ward each address applies to;
- remove the colon from the ward name;
- remove the rows that contained the original ward names.
If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:
We can also paste the data into a new OpenRefine Project:
We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:
But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?
Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.
Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.
If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.
Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.
If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.
Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…
…by replacing the colon with nothing (an empty string).
Here’s the data – neat and tidy:-)
To finish, let’s export the data.
How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).
And here it is:-)
So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)
A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context
Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.
To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):
- you can’t start following someone on Twitter until you join Twitter;
- follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
- starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.
A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)
There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).
So with low follower numbers, where can we get our timestamps from?
In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).
If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.
This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.
(I’m making this up as I’m going along…;)
If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.
Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.
Got that?!;-) (I’m still making this up as I’m going along…!)
We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.
If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2″, or more simply “fo_i makes friends with B somewhen between t1 and t2″.
Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].
If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)
We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).
For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:
if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]
Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]
If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:
- set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
- encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
- consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
- if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.
I think I need to go and walk the dog again.
PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.
Regular readers will know I quite often make use of Scraperwiki for grabbing datasets and hosting views over scraped scraped data. A few days ago, I contributed a guest post to the Scraperwiki blog:
As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool: joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.
Read the whole thing here: Glue Logic and Flowable Data.
PS I hope to write more about “flowable data”, feeds, and feed enrichment in a later post here on OUseful.info.
Picking up on A Couple of Proof of Concept Demos with the Cloudworks API, and some of the comments that came in around it (thanks Sheila et al:-), I spent a couple more hours tinkering around it and came up with the following…
A prettier view, stolen from Mike Bostock (I think?)
I also added a slider to tweak the layout (opening it up by increasing the repulsion between nodes) [h/t @mhawksey for the trick I needed to make this work] but still need to figure this out a bit more…
I also added in some new parameterised ways of accessing various different views over Cloudworks data using the root https://views.scraperwiki.com/run/cloudworks_network_d3js_force_directed_view_pretti/
Firstly, we can make calls of the form: ?cloudscapeID=2451&viewtype=cloudscapecloudcloudscape
This grabs the clouds associated with a particular cloudscape (given the cloudscape ID), and then constructs the network containing those clouds and all the cloudscapes they are associated with.
The next view uses a parameter set of the form cloudscapeID=2451&viewtype=cloudscapecloudtags and displays the clouds associated with a particular cloudscape (given the cloudscape ID), along with the tags associated with each cloud:
Even though there aren’t many nodes or edges, this is quite a cluttered view, so I maybe need to rethink how best to visualise this information?
I’ve also done a couple of views that make use of follower data. For example, here’s how to call on a view that visualises how the folk who follow a particular cloudscape follow each other (this is actually the default if no viewtype is given) -
And here’s how to call a view that grabs a particular user’s clouds, looks up the cloudscapes they belong to, then graphs those cloudscapes and the people who follow them: ?userID=1174&viewtype=usercloudcloudscapefollower
Here’s another way of describing that graph – followers of cloudscapes containing a user’s clouds.
The optional argument filterNdegree=N (where N is an integer) will filter the diaplayed network to remove nodes with degree <=N. Here’s the above example, but filtered to remove the nodes that have degree 2 or less: ?userID=1174&viewtype=usercloudcloudscapefollower&filterNdegree=2
That is, we prune the graph of people who follow no more than two of the cloudscapes to which the specified user has added a cloud. In other words, we depict folk who follow at least three of the cloudscapes to which the specified user has added a cloud.
(Note that on inspecting that graph it looks as if there is at least one node that has degree 2, rather than degree 3 and above. I’m guessing that it originally had degree 3 or more but that at least one of the nodes it was connected to was pruned out? If that isn’t the case, something’s going wrong…)
Also note that it would be neater to pull in the whole graph and filter the d3.js rendered version interactively, but I don’t know how to do this?
However…I also added a parameter to the script that generates the JSON data files from data pulled from the Cloudworks API calls that allows me to generate a GEXF network file that can be saved as an XML file (.gexf suffix, cf. Visualising Networks in Gephi via a Scraperwiki Exported GEXF File) and then visualised using a tool such as Gephi. The trick? Add the URL parameter &format=gexf (the (optional) default is &format=json) [example].
Gephi, of course, is a wonderful tool for the interactive exploration of graph-based data sets…. including a wide range of filters…
So, where are we at? The d3.js force directed layout is all very shiny but the graphs quickly get cluttered. I’m not sure if there are any interactive parameter controls I can add, but at the moment the visualisations border on the useless. At the very least, I need to squirt a header into the page from the supplied parameters so we know what the visualisation refers to. (The data I’ve played with to date – which has been very limited – doesn’t seem to be that interesting either from what I’ve seen? But maybe the rich structure isn’t there yet? Or maybe there is nothing to be had from these simple views?)
It may be worth exploring some other visualisation types to see if they are any more legible, at least, though it would be even more helpful if they were simply more informative ;-)
PS just in case, here’s a link to the Cloudworks API documentation.
PPS if there are any terms of service associated with the API, I didn’t read them. So if I broke them, oops. But that said – such is life; never ever trust that anybody you give data to will look after it;-)