Pondering Bibliographic Coupling and Co-citation Analyses in the Context of Company Directorships
Over the last month or so, I’ve made a start reading through Mark Newman’s Networks: An Introduction, trying (though I’m not sure how successfully!) to bring an element of discipline to my otherwise osmotically acquired understanding of the techniques employed by various network analysis tools.
One distinction that made a lot of sense to me came from the domain of bibliometrics, specifically between the notions of bibliographic coupling and co-citation.
Co-citation
The idea of co-citation will be familiar to many – when one article cites a set of other articles, those other articles are “co-cited” by the first. When the same articles are co-cited by lots of other articles, we may have reason to believe that they are somehow related in a meaningful way.
In graph terms, we might also represent this as simpler graph within which edges between two articles indicate that they have been co-cited by documents within a particular corpus, with the weight of each edge representing the number of documents within that corpus that have co-cited them.
Bibliographic coupling
Bibliographic coupling is actually an earlier notion, describing the extent to which two works are related by virtue of them both referencing the same other work.
Again, in graph terms, we might think of a simpler undirected network in which edges between two articles act as an indicator that they have cited or referenced the same work, with the weight of the edge representing the number of documents that they have co-cited.
A comparison of co-citation and bibliographic coupling networks shows one to be “retrospective” and the other to be “forward looking”. The articles referenced in bibliographic coupling network can be generated directly from a corpus set of articles, and to this extent bibliographic coupling looks to the past. In a co-citation network, the edges that connect two articles can only be generated when a future published article cites them both.
Co-citation, Bibliographic Coupling and Company Director Networks
For some time I’ve been tinkering with the notion of co-director networks, using OpenCorporates data as a data source (eg Mapping Corporate Networks With OpenCorporates). What I’ve tended to focus on are networks built up from active companies and their current directors, looking to see which companies are currently connected by virtue of currently sharing the same directors. On the to do list are timelines showing the companies that a particular director has been associated with, and when, as well as directorial appointments and terminations within a particular company.
In both co-citation and bibliographic analyses, the nodes are the same type of thing (that is, works that are citated, such as articles). A work cites a work. (Note: does author co-citation analysis rely on mappings from works to cited authors, or citing authors to cited authors?). In company-director networks, we have bipartite representation, with directors and companies representing the two types of node and where edges connect companies and directors but not companies and companies or directors and directors; unless a company is a director, but we generally fudge the labelling there.
If we treat “companies that retain directors” as “articles that cite other articles”:
- under a “co-citation” style view, we generate links between companies that share common directors;
- under a “bibliographic coupling” style view, we generate links between directors of the same companies.
I’ve been doing this anyway, but the bibliographic coupling/co-citation distinction may help me tighten it up a little, as well as improving ways of calculating and analysing these networks by reusing analyses described by the bibliometricians?
Pondering the “future vs. past” distinction, the following also comes to mind:
- at the moment, I am generating networks based on current directors of active companies;
- could we construct a dynamic (temporal?) hypergraph from hyperedges that connect all the directors associated with a particular company at a particular time? If so, what could we do with this graph?! (As an aside, it’s probably worth noting that I know absolutely nothing about hypergraphs!)
I’ve also started wondering about ‘director pathways’ in which we define directors as nodes (where all we require was that a person was a director of a company at some time) and directed “citation” edges. These edges would go from one director to other director nodes under the condition that the “citing” director was appointed to a particular company within a particular time period t1..t2 before the appointment to the same company of a “cited” director. If one director follows another director into more than one company, we increase the weight of the edge accordingly. (We could maybe also explore modes in which edge weights represent the amount of time that two directors are in the same company together.)
The aim is… probably pointless and not that interesting. Unless it is… The sort of questions this approach would allow us to ask would be along the lines of: are there groups of directors whose directorial appointments follow similar trajectories through companies; or are there groups of directors who appear to move from one company to another along with each other?
Notes on Narrative Science and Automated Insights
In October 2009, the New York Times Media Decoder blog picked up on a story that had been doing the rounds about a research project called Stats Monkey from the Intelligent Information Laboratory at Northwestern University. The Robots Are Coming!, it declared, with the immediate rejoinder, Oh, They’re Here. Using play by play baseball data, Stats Monkey produced human readable reports of a baseball game, formulaic admittedly, but good enough, particularly when complemented by quotes from a post-match press conference report. Mechanical churnalism complementing data-driven analysis, cast into prose. (It’s worth noting that the Media Decoder post itself is little more than a restatement of what was presumably the Stats Monkey website blurb at the time.)
In April 2010, Bloomberg Businessweek Magazine asked Are Sportswriters Really Necessary?, describing how Narrative Science, a company that incorporated at the start of that year and spun out off the back of the Stats Monkey project had teamed up with the Big Ten Network to produce automatically generated sports reports, a relationship that presumably continues to this day.
A year later, and Forbes magazine produced a report in June 2011 about GameChanger and Narrative Science: Fulfilling the Heretofore Unrealized Demand for Stilted Stories About Children’s Games, describing a tie-up between Narrative Science and GameChanger, a company that produces a scorekeeping app that allows sports fans, parents and coaches to capture data about a match.
(What other companies/apps are out there for crowdsourcing sports analytics in this way, I wonder?)
Using GameChanger data and narrative Science story generation tools, it was possible to automate the creation of match reports for small number audiences. I don’t know if these stories used to be freely accessible, but today the match reports appear to take the form of paywalled notion of recap stories.
Paywall aside, examples of other stories generated by Narrative Science using GameChanger data can be found using a simple web search on the phrase “Powered by Narrative Science and GameChanger Media”
You can also just search for the byline, as for example it appears in this report:
In passing, it’ll be interesting to see how automatically generated stories start to feed into the glitch aesthetic (h/t @danmcquillan for introducing me to this phrase and the related notion of the new aesthetic in his presentation at #opentech last week).
September 2011 saw a media outlook report from Mediabistro’s Media Jobs Daily noting that Narrative Science’s ‘Robot Journalists’ Now Tackling Real Estate. The story links through to a page on Builder Online that provides a summary report of housing data for various US cities.
What this example, and the GameChanger example, show is how the generation of timely text stories can be automated on top of the regularly updated datasets. The use of natural language interpretive text to describe patterns observed in the underlying data presumably also has SEO benefits.
That same month, September 2011, saw another stats-to-insight company, again emerging from the automated interpretation of sports data, renaming itself from StatSheet to Automated Insights. Today, Statsheet continues to publish game recaps combining short natural language summaries with statistical charts, all of which are presumably automatically generated. Within a year, the parent company, Automated Insights had scaled up and begun publishing recaps for Yahoo!’s fantasy sports matches.
More recently, Automated Insights have started producing realtime content feeds to support sports commentators – Real-time Insights for MLB – as well as feeding consumers via the stat.us powered Twitter feeds.
(See also: yseop, a French company that generates automated reports from data. [Any more?])
Fast forward to the start of 2013, and Narrative Science started publishing human readable prose reports based on US schools data (ProPublica: How To Edit 52,000 Stories at Once). They’re also doing a lot more work with financial reporting, for example with Forbes as well as for financial services clients, as this interview with Narrative Science’s Stuart Frankel describes.
Generating human readable reports from Google Analytics data and dashboards also appears to be a hot topic, with both Narrative Science (Automated Insight From Google Analytics With Quill) and Automated Insights (With Site Ai, Automated Insights Provides A Cliffs Notes Version Of Your Web Analytics) recently developing tools around this topic.
What I thought was particularly interesting about the ProPublica example was how it suggests a possible widespread future use of “automatically generated insight” pulling out headline interpretations from open data sets, as touched on in this great introductory technical presentation by Narrative Science’s Larry Adams (which also happens to mention the possibility of Narrative Science offering platform services via an API…? It also mentions work with the NHS?):
At one point during that presentation, Larry Adams suggests that Narrative Science use small set of narrative templates or story types (“the horserace” for example, or “top 10″) to frame the construction of their stories, as well as mentioning the sorts of feature that they look for within a data set (trends and changes in trends, for example, or outliers). Another presentation, this time by Narrative Science’s Kris Hammond also hints at some of the features they look for in data: “inflexion points, trends, correlations”.
So what sorts of techniques might we use ourselves to start generating the insights that we might be able to work up into simple narrative sentences, at least for starters?
Top 10, bottom 5 are easy pickings if we can rank the data somehow. I thought this trick for detecting inflexions by coding a time series symbolically and then using a regular expression to detect features was really interesting: Finding patterns in time series using regular expressions. And I wonder, how does the OpenSecrets anomaly tracker define the anomalies it detects?
Other posts you might be interested in:
- The Tesco Data Business – Notes on “Scoring Points”
- More Remarks on the Tesco Data Play
Are We Just Google’s Lab Rats?
There are some interesting comments relating to my previous post on Google Lock-In Lock-Out in a comment thread on OSnews: Why Google gets so much credit. Here are some of my own lazy Sunday morning notes/thoughts relating to that, and other comments…
- killing Google Reader does not kill RSS/there was no “malicious intent” mapping out the Reader/RSS strategy:
A nice phrase in an #opentech talk yesterday was that we (technologists and engineers and data scientists, for example) have to “act responsibly”. Google Reader helped popularise feed reading when some of us were hopeful for its future (“We ignore RSS at OUr Peril”), and as such attracted many readers away from other clients (myself included), with the result that competition was harder (“compete against Google? Hmm… maybe not…”). Google Reader’s infrastructure and unofficial APIs enabled folk to build services off the back of the Google Reader infrastructure turning it into de facto infrastructure for other peoples’ applications and services. (Remember: the Google Maps API was unofficial at first). There aren’t many OPML bundlers out there, for example, but for hackers into appropriating tech Google Reader is one. Since I moved away from Google Reader (to theoldreader) I haven’t used Flipboard so much, which as far as I was concerned was using Reader essentially as infrastructure. Caveat emptor, I guess, for developers building on top of other companies services (as many Twitter and Facebook app developers keep discovering).
With Feedburner, Google bought up a service that acted as a proxy, taking public syndication feeds, instrumenting them with analytics, and then encouraging the people taking up the syndicated content to subscribe to the Feedburner feed. Where RSS and Atom were designed to support syndication between independent parties, Feedburner – and then Google – insinuated itself between those parties. By replacing self-controlled feeds as the subscription endpoint with Google controlled endpoints, publishers gave up control of their syndication infrastructure. With Google losing interest in open syndication feeds as it pursues its own closed content network agenda, we are faced with a situation whereby Google can potentially trash a widespread syndication infrastructure that would have remained resilient if Google hadn’t insinuated itself into it. Or if we hadn’t been so stupid as to simplistically accept it’s overtures.
Hmmm… thinks… do we need a Google users’ motto? Don’t be stupid perhaps…?!
I applaud Google for developing the services it does, getting them to scale and opening up API access. But as these services become de facto infrastructure, the question of how Google acknowledges any responsibility, that flows from this (even if this responsibility is incorrectly assumed) becomes an issue. Responsibilities arise in other areas too, of course. Such as taxation and corporate transparency. But that’s another issue. (Would Google act differently if its motto was “Be responsible” or “Act responsibly” rather than “Don’t be evil”? It strikes me that “Act responsibly” could work as a motto for both companies and their users?)
It seems to me that with Google+, Google is not adopting open syndication standards in two ways: not using it “internally”, and not making feeds publicly available. There may be good technical reasons for the first, but by the second Google is *not allowing* its community members to participate in a open content syndication network/system. Google’s choice, but I’m not playing.
Google is not killing the open standards by closing off access to them in commercial licensing terms, but it may contribute to stifling their adoption by adopting alternative standards that others feel they have to adopt because of the influence Google has on web traffic.
Consider this other way of looking at it – Google is presumably trying to get other parties to adopt WebP by developing it as an openstandard. Google assumes that it can drive adoption of this as a web standard by adopting it itself. In terms of argumentation, it doesn’t follow that by not adopting something Google can prevent it being adopted, (i.e. not adopting or by stopping its own use of a standard, Google kills it generally) but people follow bad logic all the time (and if they follow Google for their technology choices, or have a technology model based on being parasitic on Google infrastructure, Google’s dropping of a standard effectively kills it for those people) …
- control of what we see
Google makes money by putting ad-links in front of eyeballs that people click on. By presenting “relevant” ads, Google presumably tries to maximise the click-thru rate so that it can make more money per displayed link.
To encourage you to spend your attention on pages that Google controls, Google has adopted the idea that by presenting you (and me; us) with “relevant” content, we are likely to remain engaged. With Google web search, the relevance of search results supposedly attracts us back to the Google search tool. With services such as Google now, Google pre-emptively tries to present you with information it thinks you need, presumably based on predictive models of sequences of action that other people (or you yourself) have demonstrated in the past.
I’m not really up on behavioural psychology models, but I have a vague memory that intermittent reinforcement schedules were demonstrated to be one of the more effect modes of behaviourist training/operant conditioning. So I wonder: how effective are predictive intermittent positive reinforcement schedules. (You get the idea, right? We’re pigeons that peck at Android phones and Google is the experimenter trying to get us to peck the right way, by reinforcing us every now and again by satisfying out intent. That is, has there been in a flip away from Google using us to provide reinforcement training signals to its algorithms in to a situation in which we have become Google’s experimental lab rats that are coupled in a series of ongoing experiments that train us and its algorithms, jointly, together, to maximise… something…)
There is a danger, I think, in Google chasing the “relevance” thing too far, seeing the maximisation of whatever conversion metrics it decides on as being a sign that it has “got things right” for us, that it is satisfying our “intent”. And if operant conditioning does influence the way we behave, maybe we do actually need to start thinking about what the machine algorithms are training us to do. Are training us to do. Training us.
Google’s stated aim is to “organize the world’s information and make it universally accessible and useful”.
- Through web search, it started to organise information it presented to use through search results that were more appealingly ranked (seemed “more relevant”) than the other search engines did.
- Through personalised search, it started to organise the way it presented results to each of us individually.
- Through web tracking, it presents us with information – adverts – organised in a way it presumably thinks are more personally meaningful to use (but maximising what metic exactly? More likely to cause us to act in a particular way, as measured by whether we click the link, or linger on a page, or engage in a particular behaviour that can be captured – for model building and exploitation purposes – by web tracking algorithms?)
- Through Google Now, and the new Google image gallery tools, Google is seeking to organise our information (we’re part of the world, right?) on our behalf and present it back to us in a way that the Google algorithms decide.
The old photos in a drawer back at my family home are sorted howsoever (by whatever algorithm “use” and random access results in). Now they’ll be sorted by Google. Maybe the algorithms are similar. Or maybe they’re not. What would be evil, I think, was if the ranking algorithms that are used to decide the order in which organic information is presented us start to be influenced by the algorithms that are tied to advertising or marketing, that is, to algorithms that are used to try to maximise the extent to which we are influenced in accord with the goals, beliefs, desires and intents of others (with a hat tip there to agent logic and the theories of intelligent software agents).
At the moment I believe that Google believes it is trying to develop algorithms that benefit us personally, in an utilitarian way. But I’m not sure what function it is they are maximising or how they think it maps onto any personal theories or preferences we may have about what is “accessible” and “useful”. I guess we might also ask whether “accessible” and “useful” are the road to a Good Life (because in the end this comes down to philosophy and ethics, doesn’t it?) or whether we should be “organising the world’s information” with some other purpose in mind?
PS Just by the by, it’s worth noting that the educational arena is seeking to use learning analytics to instrumentalise our behaviour and engagement within learning systems and contexts for our, erm, learning benefit. (Measured how?)
Google Lock-In Lock-Out
As John Naughton feels obliged to remind folk every now and again, the web is not the internet. Because we all know that for many people, Facebook apparently is. Or Google is.
And as anyone following my tweets over the last year or two will know, I’ve started finding Google more and more irksome.
It’s not just that the one or two people I know who use Google Plus (Google+?) are now all but lost to me as sources of neat ideas because I don’t do Gooplus and it doesn’t do RSS…
It’s not just because Google is shutting down the Google Reader backbone that powers a lot of RSS and Atom syndication feed services (and leaves me wondering: how long is Feedburner for this world? Maybe it’s time to start moving your feeds and trying to get folk off that piece of infrastructure…)…
It’s not just that geocoding done within Fusion Tables is not exported – if you look at a KML feed from Google Fusion Tables, you’ll find there’s no lat-long data there. To get a geo-view, you need to stick in Google Fusion Tables or wire the feed into Google Earth, which will then “initiate geocoding of location descriptions while viewing [the] KML file”…
It’s not just that Google is deprecating gadgets from spreadsheets, which as Martin points out means that if I want to visualise data in a spreadsheet all I’m going to be left with is Google’s crappy charts…
It’s not just that Google moved away from using CalDav to support calendar interoperability… (announcement: “CalDAV API will become available for whitelisted developers, and will be shut down for other developers on September 16, 2013. Most developers’ use cases are handled well by Google Calendar API, which we recommend using instead.”)
It’s not just that Google is moving away from using the XMPP instant messaging protocol (and nor, I think, making a move towards using MQTT?)…
It’s not just that Google will be using your photos to create photos you never took and presumably offer them up via your image gallery in favour of photos it thinks aren’t up to scratch…
Though I’m sure that Google wouldn’t start pushing images in just the WebP image format so that you’d feel obliged to use Chrome…
And also in the browser, I’m sure Google wouldn’t start using Google Public DNS as a Chrome default setting. (Is the same true of Chromebook? Presumably folk connected to Google Fiber use Google Public DNS?) But does it use SPDY as a default? How about on Android?
It’s not just that Google will tag your social media posts using tags you might never use yourself, and as it does so altering the externalised memory embodied by that post…
It’s not just that as web search gets increasingly personalised and localised, we lose any sense of Google ground truth; I’m not quite sure how the info-skills trainers are going to address this when training a motley crew of different learners to discover a particular resource other than by using known-item search strategies (which sort of misses the point). Or maybe it’s right that a cohort of students should all get different results when they run ostensibly the same search?
Hmmm.. thinks: if personalised/localised search could be reduced to raw search phrase (whatever I put in the search box) plus a set of invisible search limits that reflect the personalisation/localisation tweaks applied to my search, how might my hidden/invisible search limits compare with yours?
It’s not just that Google uses tax efficient corporate structures to minimise its tax bill, because lots of companies do that…
It’s not just any one of these things, taken on its own merits… it’s all of them taken together…
“Embrace, extend, extinguish”… where have we heard that before?
Drip; drip; drip…
PS see also M. Wunsch on The Great Google Goat Rodeo
PPS Although not an open standard, I forgot this one – Google dropped support for the closed Microsoft ActiveSync protocol (see also Google Sync End of Life)
Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language
There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners, we can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.
To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).
The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.
To query the data, I’ve popped up a simple query form on Scraperwiki: Google Spreadsheet Explorer
To use the explorer, you need to:
- provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
- preview the table headings;
- construct a query using the column letters;
- select the output format;
- run the query.
So what sort of questions might we want to ask of the data? Let’s build some up.
We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F
We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC
Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').
How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC
Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'
Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.
How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…
The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A
In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5
How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5
Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC
How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC
To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.
To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info
In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)
From the Scraperwiki database, we need to interrogate the API:
To do this, we’ll use a database query language – SQL.
What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:
select * from support where support in (select candinit from support)
Note that “support” refers to two things here – these are columns:
select * from support where support in (select candinit from support)
and these are the table the columns are being pulled from:
select * from support where support in (select candinit from support)
Here’s the result of Runing the query:
We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).
There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:
Simples:-)
More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?
In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)
This is what we’ll be working towards:
If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.
Here’s a reminder of what the original PDF doc looked like (archive copy):
Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…
When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).
We can download a CSV representation of the data from the scraper directly:
The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.
To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.
Start Gephi, select Data Laboratory tab and then New Project from the File menu.
You should now see a toolbar that includes an “Import Spreadsheet option”:
Import the CSV file as such, identifying it as an Edges Table:
You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…
You should now see the data has been loaded in to Gephi…
If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.
Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:
If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)
However, there are also other groupings in which one candidate may support another:
These connections may allow us to see grouping of candidates supporting each other along party lines.
One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (oit degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:
Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:
That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:
We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:
Let’s reset the colours and go back to colourings along party lines:
If we go to the Preview view, we can generate a prettified view of the network:
In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:
Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)
To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:
Just that one then…
Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.
Here’s the result:
And again we can go into Preview mode to generate a nice vectorised version of the graph:
This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).
So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.
If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.
PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)

















































