Archive for the ‘Tinkering’ Category
[An old post, rescued from the list of previously unpublished posts...]
Although I use OpenRefine from time time, one thing I don’t tend to use it for is screenscraping HTML web pages – I tend to write Python scripts in Scraperwiki to do this. Writing code is not for everyone, however, so I’ve brushed off my searches of the OpenRefine help pages to come up with this recipe for hacking around with various flavours of company data.
The setting actually comes from OpenOil’s Johnny West:
1) given the companies in a particular spreadsheet… for example “Bayerngas Norge AS” (row 6)
2) plug them into the Norwegian govt’s company registry — http://www.brreg.no/ (second search box down nav bar on the left) – this gives us corporate identifier… so for example… 989490168
3) plug that into purehelp.no — so http://www.purehelp.no/company/details/bayerngasnorgeas/989490168
4) the Aksjonærer at the bottom (the shareholders that hold that company) – their percentages
5) searching OpenCorporates.com with those names to get their corporate identifiers and home jurisdictions
6) mapping that back to the spreadsheet in some way… so for each of the companies with their EITI entry we get their parent companies and home jurisdictions
Let’s see how far we can get…
To start with, I had a look at the two corporate search sites Johnny mentioned. Hacking around with the URLs, there seemed to be a couple of possible simplifications:
- looking up company ID can be constructed around
http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS – the link structure has changed since I originally wrote this post, correct form is now http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS&orgform=0&fylke=0&kommune=0&barebedr=false [h/t/ Larssen in the comments.]
- http://www.purehelp.no/company/details/989490168 (without company name in URL) appears to work ok, so can get there from company number.
Loading the original spreadsheet data into OpenRefine gives us a spreadsheet that looks like this:
So that’s step 1…
We can run step 2 as follows* – create a new column from the company column:
* see the end of the post for an alternative way of obtaining company identifiers using the OpenCorporates reconciliation API…
Here’s how we construct the URL:
The HTML is a bit of a mess, but by Viewing Source on an example page, we can find a crib that leads us close to the data we require, specifically the fragment detalj.jsp?orgnr= in the URL of the first of the href attributes of the result links.
Using that crib, we can pull out the company ID and the company name for the first result, constructing a name/id pair as follows:
[value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() , value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlText() ].join('::')
The first part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() – pulls out the company ID from the first search result, extracting it from the URL fragment.
The second part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlText() – pulls out the company name from the first search result.
We place these two parts into an array and then join them with two colons: .join('::')
This keeps thing tidy and allows us to check by eye that sensible company names have been found from the original search strings.
We can now split the name/ID pair column into two separate columns:
And the result:
The next step, step 3, requires looking up the company IDs on purehelp. We’ve already see how a new column can be created from a source column by URL, so we just repeat that approach with a new URL pattern:
(We could probably reduce the throttle time by an order of magnitude!)
The next step, step 4, is to pull out shareholders and their percentages.
The first step is to grab the shareholder table and each of the rows, which in the original looked like this:
The following hack seems to get us the rows we require:
BAH – crappy page sometimes has TWO companyOwnership IDs, when the company has shareholdings in other companies as well as when it has shareholders:-(
So much for unique IDs… ****** ******* *** ***** (i.e. not happy:-(
Need to search into table where “Shareholders” is specified in top bar of the table, and I don’t know offhand how to do that using the GREL recipe I was taking because the HTML of the page is really horrible. Bah…. #ffs:-(
Question, in GREL, how do I get the rows in this not-a-table? I need to specify the companyOwnership id in the parent div, and check for the Shareholders text() value in the first child, then ideally miss the title row, then get all the shareholder companies (in this case, there’s just one; better example):
<div id="companyOwnership" class="box"> <div class="boxHeader">Shareholders:</div> <div class="boxContent"> <div class="row rowHeading"> <label class="fl" style="width: 70%;">Company name:</label> <label class="fl" style="width: 30%;">Percentage share (%)</label> <div class="cb"></div> </div> <div class="row odd"> <label class="fl" style="width: 70%;">Shell Exploration And Production Holdings</label> <div class="fr" style="width: 30%;">100.00%</div> <div class="cb"></div> </div> </div>
For now I’m going to take a risky shortcut and assume that the Shareholders (are there always shareholders?) are the last companyOwnership ID on the page:
We can then generate one row for each shareholder in OpenRefine:
(We’ll need to do some filling in later to cope with the gaps, but no need right now. We also picked up the table header, which has been given it’s own row, which we’ll have to cope with at some point. But again, no need right now.)
For some reason, I couldn’t parse the string for each row (it was late, I was confused!) so I hacked this piecemeal approach to try to take them by surprise…
value.replace(/\s/,' ').replace('<div class="row odd">','').replace('<div class="row even">','').replace('<form>','').replace('<label class="fl" style="width: 70%;">','').replace('<div class="cb"></div>','').replace('</form> </div>','').split('</label>').join('::')
Using the trick we previously applied to the combined name/ID column, we can split these into two separate columns, one for the shareholder and the other for their percentage holding (I used possibly misleading column names below – should say “Shareholder name”, for example, rather than shareholding 1?):
We then need to tidy the two columns:
Note that some of the shareholder companies have identifiers in the website we scraped the data from, and some don’t. We’re going to be wasteful and throw the info away that links the company if it’s there…
value.replace('<div class="fr" style="width: 30%;">','').replace('</div>','').strip()
We now need to do a bit more tidying – fill down on the empty columns in the shareholder company column and also in the original company name and ID [actually - this is not right, as we can see below for the line Altinex Oil Norway AS...? Maybe we can get away with it though?], and filter out the rows that were generated as headers (text facet then select out blank and Fimanavn).
This is what we get:
We can then export this file, before considering steps 5 and 6, using the custom exporter:
Select the columns – including the check column of the name of the company we discovered by searching on the names given in the original spreadsheet… these are the names that the shareholders actually refer to…
And then select the export format:
Here’s the file: shareholder data (one of the names at least appears not to have worked – Altinex Oil Norway AS). LOoking at the data, I think we also need to take the precaution of using .strip() on the shareholder names.
Here’s the OpenRefine project file to this point [note the broken link pattern for brreg noted at the top of the post and in the comments... The original link will be the one used in the OpenRefine project...]
Maybe export on a filtered version where Shareholding 1 is not empty. Also remove the percentage sign (%) in the shareholding 2 column? ALso note that Andre is “Other”… maybe replace this too?
In order to get the OpenCorporates identifiers, we should be able to just run company names through the OpenCorporates reconciliation service.
Hmmm.. I wonder – do we even have to go that far? From the Norwegian company number, is the OpenCorporates identifier just that number in the Norwegian namespace? So for BAYERNGAS NORGE AS, which has Norwegian company number 989490168, can we look it up directly on OpenCorporates as http://opencorporates.com/companies/no/989490168? It seems like we can…
This means we possibily have an alternative to step 2 – rather than picking up company numbers by searching into and scraping the Norwegian company register, we can reconcile names against the OpenCorporates reconciliation API and then pick up the company numbers from there?
As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?
Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.
If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…
Ah ha – it’s an embedded map from a Google Fusion Table…
We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:
The description data is curtailed, but we can see the full description on the card view:
Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.
This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.
Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:
And as if by magic, we have recreated a workable table:-)
If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?
If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?
What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?
Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:
Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.
If we cast these values to a number:
we can then use a numeric facet to help us explore the amounts.
In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:
- Summary: Lib Dems in Government have secured a £73,000 grant for …
- Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
- Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
- Summary: Lib Dems in Government have given £982,000 to …
Here’s an improved regular expression for parsing out some more of these amounts:
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)
So now we can start to identify some of the bigger grants…
More to add? eg around:
- ...have secured a £150,000 grant...
- Summary: Lib Dems have given a £1,571,000 grant...
- Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
- Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...
Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?
In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:
- £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
- £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant
But that’s for another day…
If you want to play with the data yourself, you can find it here.
To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info
In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)
From the Scraperwiki database, we need to interrogate the API:
To do this, we’ll use a database query language – SQL.
What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:
select * from support where support in (select candinit from support)
Note that “support” refers to two things here – these are columns:
select * from support where support in (select candinit from support)
and these are the table the columns are being pulled from:
select * from support where support in (select candinit from support)
Here’s the result of Runing the query:
We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).
There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:
More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?
In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)
This is what we’ll be working towards:
If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.
Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…
When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).
We can download a CSV representation of the data from the scraper directly:
The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.
To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.
Start Gephi, select Data Laboratory tab and then New Project from the File menu.
You should now see a toolbar that includes an “Import Spreadsheet option”:
Import the CSV file as such, identifying it as an Edges Table:
You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…
You should now see the data has been loaded in to Gephi…
If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.
Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:
If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)
However, there are also other groupings in which one candidate may support another:
These connections may allow us to see grouping of candidates supporting each other along party lines.
One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (out degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:
Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:
That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:
We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:
Let’s reset the colours and go back to colourings along party lines:
If we go to the Preview view, we can generate a prettified view of the network:
In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:
Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)
To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:
Just that one then…
Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.
Here’s the result:
And again we can go into Preview mode to generate a nice vectorised version of the graph:
This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).
So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.
If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.
PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)
A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…
Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:
Now let’s map it:-)
First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…
Then we say we want the column to be recognised as a column type:
Computer says yes, highlighting the location type cells with a yellow background:
As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)
Once the points have been geocoded, they’re placed onto a map:
We can now publish the map in preparation for sharing it with the world…
We need to change the visibility of the map to something folk can see!
Public on the web, or just via a shared link – your choice:
The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.
Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.
Here’s how I want the data to look after I’ve cleaned it:
Can you see what needs to be done? Somehow, we need to:
- remove the blank lines;
- generate a second column containing the name of the ward each address applies to;
- remove the colon from the ward name;
- remove the rows that contained the original ward names.
If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:
We can also paste the data into a new OpenRefine Project:
We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:
But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?
Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.
Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.
If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.
Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.
If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.
Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…
…by replacing the colon with nothing (an empty string).
Here’s the data – neat and tidy:-)
To finish, let’s export the data.
How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).
And here it is:-)
So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)
A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context
Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.
To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):
- you can’t start following someone on Twitter until you join Twitter;
- follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
- starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.
A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)
There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).
So with low follower numbers, where can we get our timestamps from?
In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).
If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.
This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.
(I’m making this up as I’m going along…;)
If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.
Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.
Got that?!;-) (I’m still making this up as I’m going along…!)
We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.
If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2″, or more simply “fo_i makes friends with B somewhen between t1 and t2″.
Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].
If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)
We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).
For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:
if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]
Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]
If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:
- set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
- encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
- consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
- if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.
I think I need to go and walk the dog again.
PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.
Regular readers will know I quite often make use of Scraperwiki for grabbing datasets and hosting views over scraped scraped data. A few days ago, I contributed a guest post to the Scraperwiki blog:
As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool: joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.
Read the whole thing here: Glue Logic and Flowable Data.
PS I hope to write more about “flowable data”, feeds, and feed enrichment in a later post here on OUseful.info.
Picking up on A Couple of Proof of Concept Demos with the Cloudworks API, and some of the comments that came in around it (thanks Sheila et al:-), I spent a couple more hours tinkering around it and came up with the following…
A prettier view, stolen from Mike Bostock (I think?)
I also added a slider to tweak the layout (opening it up by increasing the repulsion between nodes) [h/t @mhawksey for the trick I needed to make this work] but still need to figure this out a bit more…
I also added in some new parameterised ways of accessing various different views over Cloudworks data using the root https://views.scraperwiki.com/run/cloudworks_network_d3js_force_directed_view_pretti/
Firstly, we can make calls of the form: ?cloudscapeID=2451&viewtype=cloudscapecloudcloudscape
This grabs the clouds associated with a particular cloudscape (given the cloudscape ID), and then constructs the network containing those clouds and all the cloudscapes they are associated with.
The next view uses a parameter set of the form cloudscapeID=2451&viewtype=cloudscapecloudtags and displays the clouds associated with a particular cloudscape (given the cloudscape ID), along with the tags associated with each cloud:
Even though there aren’t many nodes or edges, this is quite a cluttered view, so I maybe need to rethink how best to visualise this information?
I’ve also done a couple of views that make use of follower data. For example, here’s how to call on a view that visualises how the folk who follow a particular cloudscape follow each other (this is actually the default if no viewtype is given) -
And here’s how to call a view that grabs a particular user’s clouds, looks up the cloudscapes they belong to, then graphs those cloudscapes and the people who follow them: ?userID=1174&viewtype=usercloudcloudscapefollower
Here’s another way of describing that graph – followers of cloudscapes containing a user’s clouds.
The optional argument filterNdegree=N (where N is an integer) will filter the diaplayed network to remove nodes with degree <=N. Here’s the above example, but filtered to remove the nodes that have degree 2 or less: ?userID=1174&viewtype=usercloudcloudscapefollower&filterNdegree=2
That is, we prune the graph of people who follow no more than two of the cloudscapes to which the specified user has added a cloud. In other words, we depict folk who follow at least three of the cloudscapes to which the specified user has added a cloud.
(Note that on inspecting that graph it looks as if there is at least one node that has degree 2, rather than degree 3 and above. I’m guessing that it originally had degree 3 or more but that at least one of the nodes it was connected to was pruned out? If that isn’t the case, something’s going wrong…)
Also note that it would be neater to pull in the whole graph and filter the d3.js rendered version interactively, but I don’t know how to do this?
However…I also added a parameter to the script that generates the JSON data files from data pulled from the Cloudworks API calls that allows me to generate a GEXF network file that can be saved as an XML file (.gexf suffix, cf. Visualising Networks in Gephi via a Scraperwiki Exported GEXF File) and then visualised using a tool such as Gephi. The trick? Add the URL parameter &format=gexf (the (optional) default is &format=json) [example].
Gephi, of course, is a wonderful tool for the interactive exploration of graph-based data sets…. including a wide range of filters…
So, where are we at? The d3.js force directed layout is all very shiny but the graphs quickly get cluttered. I’m not sure if there are any interactive parameter controls I can add, but at the moment the visualisations border on the useless. At the very least, I need to squirt a header into the page from the supplied parameters so we know what the visualisation refers to. (The data I’ve played with to date – which has been very limited – doesn’t seem to be that interesting either from what I’ve seen? But maybe the rich structure isn’t there yet? Or maybe there is nothing to be had from these simple views?)
It may be worth exploring some other visualisation types to see if they are any more legible, at least, though it would be even more helpful if they were simply more informative ;-)
PS just in case, here’s a link to the Cloudworks API documentation.
PPS if there are any terms of service associated with the API, I didn’t read them. So if I broke them, oops. But that said – such is life; never ever trust that anybody you give data to will look after it;-)
Via a tweet from @mhawksey in response to a tweet from @sheilmcn, or something like that, I came across a post by Sheila on the topic of Cloud gazing, maps and networks – some thoughts on #oldsmooc so far. The post mentioned a prototyped mindmap style browser for Cloudworks, created in part to test out the Cloudworks API.
Having tinkered with mindmap style presentations using the d3.js library in the browser before (Viewing OpenLearn Mindmaps Using d3.js; the app itself may well have rotted by now) I thought I’d have a go at exploring something similar for Cloudworks. With a promptly delivered API key by Nick Freear, it only took a few minutes to repurpose an old script to cast a test call to the Cloudworks API into a form that could easily be visualised using the d3.js library. The approach I took? To grab JSON data from the API, construct a tree using the Python networkx library, and drop a JSON serialisation of the network into a templated d3.js page. (networkx has a couple of JSON export functions that will create tree based and graph/network based JSON data structures that d3.js can feed from.
Here’s the Python fragment:
The rendered view is something along the lines of:
You can find the original code here.
Now I know that: a) this isn’t very interesting to look at; and b) doesn’t even work as a navigation surface, but my intention was purely to demonstrate a recipe from getting data out of the Cloudworks API and into a d3.js mindmap view in the browser, and it does that. A couple of obvious next steps: i) add in additional API calls to grow the tree (easy); ii) linkify some of the nodes (I’m not sure I know who to do that at them moment?)
Sheila’s post ended with a brief reflection: “I’m also now wondering if a network diagram of cloudscape (showing the interconnectedness between clouds, cloudscapes and people) would be helpful ? Both in terms of not only visualising and conceptualising networks but also in starting to make more explicit links between people, activities and networks.”
So here’s another recipe, again using networkx but this time dropping the data into a graph based JSON format and using the d3.js force based layout to render it. What the script does is grab the followers of a particular cloudscape, grab each of their followers, and then graph how the followers of a particular cloudscape follow each other.
Because I had some problems getting the data into the template, I also used a slightly different wiring approach:
import urllib2,json,scraperwiki,networkx as nx from networkx.readwrite import json_graph id=cloudscapeID #need logic typ='cloudscape' urlstub="http://cloudworks.ac.uk/api/" urlcloudscapestub=urlstub+"cloudscapes/"+str(id) urlsuffix=".json?api_key="+str(key) ctyp="/followers" url=urlcloudscapestub+ctyp+urlsuffix entities=json.load(urllib2.urlopen(url)) def ascii(s): return "".join(i for i in s.encode('utf-8') if ord(i)<128) def getUserFollowers(id): urlstub="http://cloudworks.ac.uk/api/" urluserstub=urlstub+"users/"+str(id) urlsuffix=".json?api_key="+str(key) ctyp="/followers" url=urluserstub+ctyp+urlsuffix results=json.load(urllib2.urlopen(url)) #print results f= for r in results['items']: f.append(r['user_id']) return f DG=nx.DiGraph() followerIDs= #Seed graph with nodes corresponding of followers of a cloudscape for c in entities['items']: curruid=c['user_id'] DG.add_node(curruid,name=ascii(c['name']).strip()) followerIDs.append(curruid) #construct graph of how followers of a cloudscape follow each other for c in entities['items']: curruid=c['user_id'] followers=getUserFollowers(curruid) for followerid in followers: if followerid in followerIDs: DG.add_edge(curruid,followerid) scraperwiki.utils.httpresponseheader("Content-Type", "text/json") #Print out the json representation of the network/graph as JSON jdata = json_graph.node_link_data(DG) print json_graph.dumps(jdata)
In this case, I generate a JSON representation of the network that is then loaded into a separate HTML page that deploys the d3.js force directed layout visualisation, in this case how the followers of a particular cloudscape follow each other.
This hits the Cloudworks API once for the cloudscape, then once for each follower of the cloudscape, in order to construct the graph and then pass the JSON version to the HTML page.
Again, I’m posting it as a minimum viable recipe that could be developed as a way of building out Sheila’s idea (though the graph definition would probably need to be a little more elaborate, eg in terms of node labeling). Some work on the graph rendering probably wouldn’t go amiss either, eg in respect of node sizing, colouring and labeling.
Still, what do you expect in just a couple of hours?!;-)