Interest Differencing: Folk Commonly Followed by Tweeting MPs of Different Parties

Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.

Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:

There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).

That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)

PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…

Sketching Substantial Council Spending Flows to Serco Using OpenlyLocal Aggregated Spending Data

An article in today’s Guardian (Serco investigated over claims of ‘unsafe’ out-of-hours GP service) about services provided by Serco to various NHS Trusts got me thinking about how much local councils spend with Serco companies. OpenlyLocal provides a patchy(?) aggregating service over local council spending data (I don’t think there’s an equivalent aggregator for NHS organisations’ spending, or police authority spending?) so I thought I’d have a quick peek at how the money flows from councils to Serco.

If we search the OpenlyLocal Spending Dashboard, we can get a summary of spend with various Serco companies from local councils whose spending data has ben logged by the site:

Using the local spend on corporates scraper I used to produce Inter-Council Payments Network Graph, I grabbed details of payments to companies returned by a search on OpenlyLocal for suppliers containing the keyword serco, and then generated a directed graph with edges defined: a) from council nodes to company nodes; b) from company nodes to canonical company nodes. (Where possible, OpenlyLocal tries to reconcile companies identified for payment by councils with canonical company identifiers so that we can start to get a feeling for how different councils make payments to the same companies.)

I then exported the graph as a json node/edge list so that it could be displayed by Mike Bostock’s d3.js Sankey diagram library:

(Note that I’ve filtered the edges to only show ones above a certain payment amount (£10k).)

As a presentation graphic, it’s really tatty, doesn’t include amount labels (though they could be added) and so on. But as a sketch, it provides an easy to digest view over the data as a starting point for a deeper conversation with the data. We might also be able to use the diagram as a starting point for a data quality improvement process, by identifying the companies that we really should try to reconcile.

Here are flows associated with speend to g4s identified companies:

I also had a quick peek at which councils were spending £3,500 and up (in total) with the OU…

Digging into OpenlyLocal spending data a little more deeply, it seems we can get a breakdown of how total payments from council to supplier are made up, such as by spending department.

Which suggests to me that we could introduce another “column” in the Sankey diagram that joins councils with payees via spending department (I suspect the Category column would result in data that’s a bit too fine grained).

See also: University Funding – A Wider View

Sketching Sponsor Partners Running UK Clinical Trials

Using data from the clinicaltrials.gov registry (search for UK clinical trials), I grabbed all records relating to trials that have at least in part run in the UK as an XML file download, then mapped links between project lead sponsors and their collaborators. Here’s a quick sketch of the result:

ukClinicalTrialPartners (PDF)

The XML data schema defines the corresponding fields as follows:

<!-- === Sponsors ==================================================== -->

  <xs:complexType name="sponsors_struct">
    <xs:sequence>
      <xs:element name="lead_sponsor" type="sponsor_struct"/>
      <xs:element name="collaborator" type="sponsor_struct" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>

Here’s the Python script I used to extract the data and generate the graph representation of it (requires networkx), which I then exported as a GEXF file that could be loaded into Gephi and used to generate the sketch shown above.

import os
from lxml import etree
import networkx as nx
import networkx.readwrite.gexf as gf
from xml.etree.cElementTree import tostring


def flatten(el):
	if el != None:
		result = [ (el.text or "") ]
		for sel in el:
			result.append(flatten(sel))
			result.append(sel.tail or "")
		return "".join(result)
	return ''

def graphOut(DG):
	writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
	writer.add_graph(DG)
	#print tostring(writer.xml)
	f = open('workfile.gexf', 'w')
	f.write(tostring(writer.xml))

def sponsorGrapher(DG,xmlRoot,sponsorList):
	sponsors_xml=xmlRoot.find('.//sponsors')
	lead=flatten(sponsors_xml.find('./lead_sponsor/agency'))
	if lead !='':
		if lead not in sponsorList:
			sponsorList.append(lead)
			DG.add_node(sponsorList.index(lead),label=lead,name=lead)
			
	for collab in sponsors_xml.findall('./collaborator/agency'):
		collabname=flatten(collab)
		if collabname !='':
			if collabname not in sponsorList:
				sponsorList.append(collabname)
				DG.add_node( sponsorList.index( collabname ), label=collabname, name=collabname, Label=collabname )
			DG.add_edge( sponsorList.index(lead), sponsorList.index(collabname) )
	return DG, sponsorList

def parsePage(path,fn,sponsorGraph,sponsorList):
	fnp='/'.join([path,fn])
	xmldata=etree.parse(fnp)
	xmlRoot = xmldata.getroot()
	sponsorGraph,sponsorList = sponsorGrapher(sponsorGraph,xmlRoot,sponsorList)
	return sponsorGraph,sponsorList

XML_DATA_DIR='./ukClinicalTrialsData'
listing = os.listdir(XML_DATA_DIR)

sponsorDG=nx.DiGraph()
sponsorList=[]
for page in listing:
	if os.path.splitext( page )[1] =='.xml':
		sponsorDG, sponsorList = parsePage(XML_DATA_DIR,page, sponsorDG, sponsorList)

graphOut(sponsorDG)

Once the file is loaded in to Gephi, you can hover over nodes to see which organisations partnered which other organisations, etc.

One thing the graph doesn’t show directly are links between co-collaborators – edges go simply from lead partner to each collaborator. It would also be possible to generate a graph that represents pairwise links between every sponsor of a particular trial.

The XML data download also includes information about the locations of trials (sometimes at the city level, sometimes giving postcode level data). So the next thing I may look at is a map to see where sponsors tend to runs trials in the UK, and maybe even see whether different sponsors tend to favour different trial sites…

Further down the line, I wonder whether any linkage can be made across to things like GP practice level prescribing behaviour, or grant awards from the MRC?

PS these may be handy too – World Health Organisation Clinical Trials Registry portal, EU Clinical Trials Register

PPS looks like we can generate a link to the clinicaltrials.gov download file easily enough. Original URL is:
http://clinicaltrials.gov/ct2/results?cntry1=EU%3AGB&show_flds=Y&show_down=Y#down
Download URL is:
http://clinicaltrials.gov/ct2/results/download?down_stds=all&down_typ=study&down_flds=shown&down_fmt=plain&cntry1=EU%3AGB&show_flds=Y&show_down=Y
So now I wonder: can Scraperwiki accept a zip file, unzip it, then parse all the resulting files? Answers, with code snippets, via the comments, please:-) DONE – example here: Scraperwiki: clinicaltrials.gov test

Visualising F1 Telemetry Data and Plotting Latitude and Longitude with ggplot Map Projections in R

Why don’t X-Y plots of latitude and longitude data look “right” compared to traditional map views?

For example, here’s an X-Y scatterplot of some of Jenson Button’s McLaren telemetry data from the 2010 Australian Formula One Grand Prix:

The image was generated, from a data file hosted on Google Spreadsheets, using the following R script, and the ggplot2 library:

require(ggplot2)
require(RCurl)

#gsqAPI is a helper function that loads data in from a shared as public Google Spreadsheet.
gsqAPI = function(key,query,gid=0){ return( read.csv( paste( sep="",'http://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid) ) ) }

#Provide the spreadsheet key
#Data was originally grabbed from the McLaren F1 Live Dashboard during the race and is Copyright (�) McLaren Marketing Ltd 2010 (I think? Or possibly Vodafone McLaren Mercedes F1 2010(?)). I believe that speed, throttle and brake data were sponsored by Vodafone.
key='0AmbQbL4Lrd61dER5Qnl3bHo4MkVNRlZ1OVdicnZnTHc'
#We can write SQL like queries over the spreadsheet, as described in https://blog.ouseful.info/2009/05/18/using-google-spreadsheets-as-a-databace-with-the-google-visualisation-api-query-language/
q='select *'

#Run the query on the database
df=gsqAPI(key,q)

#Sanity check - preview the imported data
head(df)

#Example circuit map - sort of - showing the gLat (latitudinal 'g-force') values around the circuit (point size is absolute value of gLat, colour has two values, one for + and one for - values (swing to left and swing to right)).
g=ggplot(df) + geom_point(aes(x=NGPSLongitude,y=NGPSLatitude,col=sign(gLat),size=abs(gLat)))
print(g)

What’s lacking is a projection from the everyday Cartesian coordinate system to something like a Mercator based projection. Fortunately, the Grammar of Graphics model that underpins ggplot allows us to write the necessary co-ordinate system transformation into our chart generating command:

ggplot(df) + geom_point(aes(x=NGPSLongitude,y=NGPSLatitude,col=sign(gLat),size=abs(gLat))) + coord_map(project="mercator")

Here’s the result:

(Note: I haven’t totally got my head round what the different co-ordinate transforms do, or how they relate to any sort of ‘reality’! But they’re another thing I’m now aware of…;-)

As to what the chart shows? It’s a plot of how the latitudinal (left-right) ‘g-force’ acts on Button as he tours the circuit. The points are coloured according to whether the force acts from the left or the right (+ or -) and sized according to the magnitude of the force (away from normal?). So it points out left and right hand corners and how tight they are, essentially;-)

To show just how easy it is to write simple graphics and even statistical charts using ggplot, here are a few more examples:

#Example "driver DNA" trace, showing low gear  throttle usage (distance round track on x-axis, lap number on y axis, node size is inversely proportional to gear number (low gear, large point size), colour relativ to throttlepedal depression
g=ggplot(df) + geom_point(aes(x=sLap,y=Lap,col=rThrottlePedal,size=-NGear)) + scale_colour_gradient(low='red',high='green')
print(g)

I started calling things like the above chart “Driver DNA” charts – the x-axis represents distance round the track, the y-axis is lap number. In this case, nodes are sized inversely proportionally to the gear (so low gear, large pointsize) and coloured by throttle pedal pressure. You’ll notice how consistent Button is lap on lap. The idea behind the colouring/sizing for this chart was that it would provide a glimpse into behaviour around low gear turns.

#Example of gear value around the track
g=ggplot(df) + geom_line(aes(x=sLap,y=NGear))
print(g)

This chart soooo reminds me of simple op art…:-) I’m not sure how useful it is for showing gear selection according to distance round the track, but I just love the line of it:-)

#We can also show a trace for a single lap, such as speed coloured by gear
g=ggplot(subset(df,Lap==22)) + geom_line(aes(x=sLap,y=vCar,colour=NGear))
print(g)

ggplot can, of course, do line charts. In the above example, I make an intitial exploration into how line segment colour can be used to highlight gear selection that allows the car to reach the speed (y-axis) it does as it goes round the circuit (x is distance round lap).

#We can also do statistical graphics - like a boxplot showing the distribution of speed values by gear
g = ggplot(df) + geom_boxplot(aes(factor(NGear),vCar))
print(g)

ggplot isn’t just about literal graphing from data elements directly to marks on a canvas. It can also do stats as part of the mapping; in this case, we generate a boxplot that summarises the range of speeds achieved for different gear values.

#Footwork - brake and throttle pedal depression based on gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),rThrottlePedal),colour='darkgreen') + geom_jitter(aes(factor(NGear),pBrakeF),colour='darkred')
print(g)

Some more dots… in the above case, I try to explore Button;s footwork in a little more detail, seeing how he applies brake and throttle pressure according to gear selection (throttle depression is green, brake is red).

#Forces on the driver
#gLong by brake and gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),gLong,col=pBrakeF)) + scale_colour_gradient(low='red',high='green')
print(g)

In the diagrams immediately above and below, I try to show what sorts of longitudinal forces are typically experienced by Button according to gear selection, the idea being that we may get to see whether gears are used for linear acceleration or deceleration.

#gLong by throttle and gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),gLong,col=rThrottlePedal)) + scale_colour_gradient(low='red',high='green')
print(g)

#gLong boxplot
ggplot(df) + geom_boxplot(aes(factor(NGear),gLong))+ geom_jitter(aes(factor(NGear),gLong),size=1)

Here, I use a boxplot to to try to see whether or not the longitudinal g-force is typically experienced under acceleration or braking by gear. Note that the points are scattered according to random jitter about their actual, integer values.

Finally, here’s a look at how engine RPM and the car speed relate to gear selection. Would you be able to work out how to write this diagram?

Here’s how I did it…

#How do engine revs and speed relate to gear selction?
ggplot(df)+geom_point(aes(x=nEngine,y=vCar,col=factor(NGear)))

Hopefully what this quick tour of ggplot has illustrated is how easy it can be to generate a wide range of charts from the same data set. May all your charts be written, and then generated directly from their source data;-)

PS I did try to generate these images via CloudStat, but for some reason the images didn’t generate properly [WORKS NOW – THANKS @cloudstatorg] (it seemed to work fine when I tried just a couple? Here’s the link anyway: Cloudstat: F1 telemetry demo. The plots were generated in RStudio and saved using ggsave()

As to how this fits in with other things? Regular readers may remember the occasional rant I’ve had about the importance of providing the queries that map from data sets onto summary data tables so that the means by which the summaries were generated from open raw data are made transparent. In a similar way, data cleansing tools such as Google Refine or Stanford Data Wrangler allow you to log the transformations applied to a messy raw data set in order to get it into a state where you can actually work with it. The same is true of images. It’s far too easy to generate complex graphics from even more complex datasets, and then forget how the image was actually created, maybe what it even represents. By writing the diagram, essentially generating a query that maps from the data onto the visual representation provided by the generated diagram or chart, we preserve the audit trail from data to the chart output.

In the same way we might imagine the word equation:

DATA + QUERY = REPORT

we might also imagine:

DATA + GRAPHER = CHART

(You get the idea? The words are probably not the right words, but the sentiment is there…)

In an academic research setting, where it’s common to find lists of figures presented separately in books or theses, it would also make sense to include a ‘figure appendix’ which gives, for example, the ggplot or ggplot commands required to generate each statistical graphic presented from the actual data source.

Sigh…if only my first name was Damien and I could persuade my minions to do the Lichtenstein thing and paint-by-numbers over some large projections of some of the more aesthetically interesting charts;-)

Visual Conversations With Data

Just noticed that I didn’t post a copy of the second of my three presentations last week, Visual Conversations With Data, delivered to the eSTeEM Collquium Pictures to Help People Think and Act on diagramming, et al., in education.

Something I meant to say, but didn’t, is that one of the problems with folks’ prior expectations or assumptions about data visualisations is that they reveal a single truth. I’m not sure that’s the case (I’ve started pondering the phrase “no truth, many truths” in this respect) which is another reason I see the role of visualisation as being a participant in a conversation where you explore questions and ideas and try to actively tease out different perspectives around a hypothesis based on the data.

For a related take on this idea of “many possible truths”, see Paul Bradshaw’s The Straw Man of Data Journalism’s “Scientific” Claim.

More Thoughts on a Content Strategy for Data – Many Eyes and Google Fusion Tables

It’s one thing publishing data just to comply with a formal requirement to make it public, quite another if you’re publishing it because you want folk to doing something with it.

But if you decide you’re publishing data because you want folk to do something with it, what does that mean exactly?

[Not quite related but relevant: Pete Sefton (@ptsefton) on Did you say you “own” this data? You keep using that word. I do not think it means what you think it means.]

One answer might be that you want them to be able to refer to the data for their own purposes, simply by cutting an pasting a results datatable out of one of your spreadsheets so they can paste it into one of theirs and refer to it as “evidence”:

Reference summary data

Another might be that you want folk to be able to draw on your data as part of their own decision making process. And so on. (For a couple of other use cases, see First Thoughts On A Content Strategy for Data.)

A desire that appears to have have gained some traction over the last couple of years is to publish data so that folk can produce visualisations based on it. This is generally seen as a Good Thing, although I’m not sure I know exactly why…? Perhaps it’s because visualisations are shiny objects and folk can sometimes be persuaded to share (links to) shiny objects across their social networks; this in turn may help raise wider awareness about the existence of your data, and potentially bring it to the attention of somebody who can actually makes some use of it, or extract some value from it, possibly in combination with one or more other datasets that you may or may not be aware of.

Something that I’ve become increasingly aware of over the last couple of years is that people respond to graphics and data visualisations in very different ways. The default assumption seems to be that a graphic should expose some truth in very obvious way without any external input. (I’m including things like axis labels, legends, and captions in the definition of a graphic.) That is, the graphic should be a self-contained, atomic object, meaningful in its own right. I think this view is borne out of the assumption that graphics are used to communicate something that is known by the author who used it to their audience. The graphic is chosen because it does “self-evidently” make some point that makes the author’s case. Let’s call these “presentation graphics”. Presentation graphics are shiny objects, designed to communicate something in particular, to a particular audience, in (if at all possible) a self-contained way.

Another way of using visualisations is as part of a visual analysis process. In this case, visual representations of the data are generated by the analyst as part of a conversation they are having with the data. One aim of this conversation (or maybe we should call it an interrogation?!) may be to get the data to reveal something about its structure, or meaningful patterns contained within it. Visual analysis is therefore less to do with the immediate requirement of producing meaningful presentation graphics, and more to do with getting the data to tell its story. Scripted speeches contain soundbites – presentation graphics. Conversations can ramble all over the place and are often so deeply situated in a particular context they are meaningless to onlookers – as visualisations produced during a visual analysis activity may be. (Alternatively, the visual analyst spends their time trying to learn how to ride a bike. Chris Hoy and Victoria Pendleton show how it’s done with the presentation graphics…)

It could be that I’m setting up something of a false dichotomy between extrema here, because sometimes a simple, “directly generated” chart may be effective as both a simple 1-step visual analysis view, and as a presentation graphic. But I’m trying to think through my fingers and type my way round to what I actually believe about all this stuff, and arguing to limits is one lazy way of doing this! The distinction is also not just mine… For example, Iliinsky and Steele’s Designing Data Visualizations identifies the following:

Explanatory visualization: Data visualizations that are used to transmit information or a point of view from the designer to the reader. Explanatory visualizations typically have a specific “story” or information that they are intended to transmit.
Exploratory visualization: Data visualizations that are used by the designer for self-informative purposes to discover patterns, trends, or sub-problems in a dataset. Exploratory visualizations typically don’t have an already-known story.

They also define a data visualizations as “[v]isualizations that are algorithmically generated and can be easily regenerated with different data, are usually data-rich, and are often aesthetically shallow.” Leaving aside the aesthetics, the notion that data visualisations can be “algorithmically generated” is important here.

A related insight I picked up from the New York Times’ Amanda Cox is the use of statistical charts of visual analysis as sketches that help us engage with data en route to understanding some of the stories it contains, stories that may then be told by whatever means are appropriate (which may or may not include graphical representations or visualisations).

So when it comes to publishing data in the hope that folk will do something visual with it, does that mean we want to provide them with the data that can be directly used to convey some known truth in an appealing way, or do we want to provide them with data in such a way that they can engage with it in a (visual) analytic way and the communicate their insight through a really beautiful presentation graphic? (Note that it may often be the case that something discovered through a visual analysis step may actually best be communicated through a simple set of ranked, tabulated data presented as text…) Maybe this explains why folk are publishing the data in the hope that it will be “visualised”? They are conflating visual analysis with presentation graphics, and hoping that novel visualisation (visual analysis) techniques will: 1) provide new insights (new sense made from) the data, that: 2) also work as shiny, shareable and insightful presentation graphics? Hmmm…

Publishing Data to Support Visual Engagement

So we have our data set, but how can we publish it in a way that supports generative visual engagement with it (generative in the sense that we want the user to have at least some role in creating their own visual representations of the data)?

The easiest route to engagement is to publish an interactive visualisation on top of your data set so that the only way folk can engage with the data is through the interactive visualisation interface. So for example, interactive visualisations published by the BBC or New York Times. These typically support the generation of novel views over the data by allowing the user to construct queries over the data through interactive form elements (drop down lists, radio buttons, sliders, checkboxes, etc.); these queries are then executed to filter or analyse the data and provide a view over it that can be visually displayed in a predetermined way. The publisher may also choose to provide alternative ways of visualising the data (for example, scatter plot or bar chart) based on preidentified ways of mapping from the data to various graphical dimensions within particular chart types. In the case of the interactive visualisation hosted on the publisher’s website, the user is thus typically isolated from the actual data.

An alternative approach is to publish the data in an environment that supports the creation of visualisations at what we might term the row and column level. This is where ideas relating to a content strategy for data start to come in to play. An example of this is IBM’s Many Eyes data visualisation tool. Long neglected by IBM, the Many Eyes website provides an environment for: 1) uploading tabular datasets; 2) generating preconfigured interactive visualisations on top of the datasets; 3) providing embeddable versions of visualisations; 4) supporting discussions around visualisations. Note that a login is required to upload data and generate new visualisations.

As an example of what’s possible, I uploaded a copy of the DCMS CASE data relating to Capital Investment to Many Eyes (DCMS CASE data – capital investment (modified)):

CASE Data on Many Eyes

Once the data is uploaded, the user has the option of generating one or more interactive visualisations of the data from a wide range of visualisation types. For example, here’s a matrix chart view (click through to see the interactive version; note: Java required).

Many Eyes example

And here’s a bubble chart:

Many Eyes Bubblechart

In an ideal world, getting the data into Many Eyes should just(?) have been a case of copy and pasting data from the original spreadsheet. (Note that this requires access to a application that can open the spreadsheet, either on the desktop or online.) In the case of the DCMS CASE data, this required opening the spreadsheet, finding the correct sheet, then identifying the cell range containing the data we want to visualise:

DCMS CASE data - raw

Things are never just that simple, of course… Whilst it is possible to define columns as “text” or “number” in Many Eyes, the date field was recognised by the Many Eyes visualisation tools as a “number” which may lead to some visualisations incorrectly aggregating data from the same region across several date ranges. In order to force the Many Eyes visualisations to recongnise the date column as “text”, I had to edit the original data file (before uploading it to Many Eyes) by prepending the date ranges with an alphabetic character (so for example I replaced instances of 2004/05 with Y2004/05).

Recommendation In terms of a “content strategy for data”, then, we need to identify possible target domains where we wish to syndicate or republish our data and then either: 1) publish the data to that domain, possibly under our own branding, or “official” account on that domain – this approach also allows the publisher to add provenance metadata, a link to the homepage for the data or its original source, and so on; or, 2) publish the data on our site in such a way that we know it will work on the target domain (which means testing it/trying it out…). If you expect users to upload your data to services like Many Eyes themselves, it would make sense to provide easily cut and pastable example text of the sort you might expect to see appear in the metadata fields of the data page on the target site and encourage users to make use of that text.

Recommendation A lot of the CASE data spreadsheets contain multiple cell ranges corresponding to different tables within a particular sheet. Many cut and paste tools support data that can be cut and pasted from appropriately highlighted cell ranges. However, other tools require data in comma separated (rather than tab separated) format which mean the user must copy and paste the data into another sheet and then save it as CSV. Although a very simple format, there is a lot to be said for publishing very simple CSV files containing your data. Provenance and explanatory data often gets separated from data contained in CSV files, but you can always add a manifest text file to the collection of CSV data files to explain the contents of each one.

Whilst services such as Many Eyes do their best in trying to identify numeric versus categorical data columns, unless the user is familiar with the sorts of data a particular visualisation type requires and how it should be presented, it can sometimes be hard to understand why Many Eyes has automatically identified particular values for use in drop down list boxes, and at times hard to interpret what is actually being displayed. (This is a good reason to limit the use of Many Eyes to a visual analysis role, and use it to discover things that look interesting/odd and then go off and dig in the data a little more to se if there really is anything interesting there…)

In some cases, it may be possible to reshape the data and get it in to a form that Many Eyes can work with. (Remember the Iliinsky and Steele definition of a data visualisation as something “algorithmically generated”? If the data is presented in the right way, then Many Eyes can do something with it. But if it’s presented in the wrong way, not joy…) As an example, if we look at the CASE Capital Investment data, we see it has columns for Region, Local Authority, Date, as well as columns relating to the different investment types. Presented this way, we can easily group data across different years within an LA or Region. Alternatively, we might have selected Region, Local Authority, and Asset type columns, with separate columns for each date range. This different combination of rows and columns may provides a different basis for the sorts of visualisations we can generate within Many Eyes and the different summary views we can present over the data.

Recommendation The shape in which the data it published may have an effect on the range of visualisations that can be directly generated from the data, without reshaping by the user. It may be appropriate to publish the data in a variety of shapes, or provide tools for reshaping data for use with particular target services. Tools such as the Stanford Data Wrangler are making it easier for people to clean and reshape messy data sets, but that is out of scope for this review. In addition, it is worth consider the data type or physical form in which data is published. For example, in columns relating to finanacial amounts, prepending each data element in a cell with a £ my break cut and paste visualisation tools such as Many Eyes, which will recognise the element as a character string. Some tools are capable of recognising datetime formats, so in some cases it may be appropriate to publish date/datetime in a standardised way. Many tools choke on punctuation characters from Windows character sets, and despite best efforts, rogue characters and undeclared or incorrect character encodings often find their way in to datasets which present them working correctly in third party applications. Some tools will automatically strip out leading and trailing whitespace characters, others will treat them as actual characters. Where string matching operations are applied (for example, grouping data elements) a word with a trailing space and a word without a trailing space may be treated as defining different groups. (Which is to say, try to strip leading and trailing whitespace in your data. Experts know to check for this, novices don’t).

If the expectation is that users will make use of a service such as Many Eyes, it may be worth providing an FAQ area that describes what shape the different visualisation expect the data to be in, with examples from your own data sets. Services such as Number Picture, which provide a framework for visualising data by means of visualisation templates that accept data in a specified shape and form, provided helpful user prompts that explain what the (algorithmic) visualisation expects in terms of the shape and form of input data:

Number picture - describes the shape and form the data needs to be in

Custom Filter Visualisations – Google Fusion Tables

Google Fusion Tables are like spreadsheets on steroids. They combine features of traditional spreadsheets with database like query support and access to popular chart types. Google Fusion Tables can be populated by importing data from Google Spreadsheets or uploading data from CSV files (example Fusion Table).

Google Fusion Table - data import

Google Fusion Tables can also be used to generate new tables based on the fusion of two (or, by chaining, more than two) tables that share a common column. So for example, given three CSV data files containing different data sets (for example, file A has LA codes, and Regions, file B has LA codes and arts spend by LA, and file C has LA codes and sports engagement data) we can merge the files on the common columns to give a “fused” data set (for example, a single table containing four columns: LA codes, Regions, arts spend, sports engagement data). Note that the data may need to be appropriately shaped before it can be fused in a meaningful way with other data sets.

As with many sites that support data upload/import, it’s typically down the to the user to add appropriate metadata to to the data file. This metadata is important for a variety of reasons: firstly, it provides context around a dataset; secondly, it may aid in discovery of the data set if the data and its metadata is publicly indexed; thirdly, it may support tracking, which can be useful if the original publisher needs to demonsstrate how widely a dataset has been (re)used.

Google spreadsheets provenance metadata

If there are too many steps involved in getting the data from the download site into the target environment (for example, if it needs downloading, a cell range copying, saving into another data file, cleaning, then uploading) the distance from the original data source to the file that is uploaded may result in the user not adding much metadata at all. As before, if it is anticipated that a service such as Google Fusion Tables is a likely locus for (re)use of a dataset, the publisher should consider publishing the data directly through the service, with high quality metadata in place, or provide obvious cues and cribs to users about the metadata they might wish to add to their data uploads.

A nice feature of Google Fusion Tables is the way it provides support for dynamic and compound queries over a data set. So for example, we can filter rows:

Google fusion table query filters

Or generate summary/aggregate views:

Generating aggregate views

A range of standard visualisation types are available:

Google Fusion tables visualisation options

Charts can be used to generate views over filtered data:

Google Fusion Tables Filters and charts

Or filtered and aggregated data:

Google Fusion Tables Filtered and aggregated views

Note that these charts may not be of publishable quality/useful as presentation graphics, but they may be useful as part of a visual analysis of the data. To this extent, the lack of detailed legends and titles/captions for the chart does not necessarily present a problem – the visual analyst should be aware of what the data they are viewing actually represents (and they can always check the filter and aggregate settings if they are unsure, as well as dropping in to the tabular data view to check actual numerical values if anything appears to be “odd”. However, the lack of explanatory labeling is likely to be an issue if the intention is to produce a presentation graphic, in which case the user will need to grab a copy of the image and maybe postprocess it elsewhere.

Note that Google Fusion Tables is capable of geo-coding certain sorts of location related data such as placenames or postcodes and rendering associated markers on a map. It is also possible to generate thematic maps based on arbitrary geographical shapefiles (eg Thematic Maps with Google Fusion Tables [PDF]).

Helping Data Flow – Treat It as a Database

Services such as Google Spreadsheets provide online spreadsheets that support traditional spreadsheet operations that include chart generation (using standard chart types familiar to spreadsheet users) and support for interactive graphical widgets (including more exotic chart types, such as tree maps), powered by spreadsheet data, that can be embedded in third party webpages. Simple aggregate reshaping of data is provided in the from of support for Pivot Tables. (Note however that Google Spreadsheet functionality is sometimes a little bug ridden…) Google spreadsheets also provides a powerful query API (the Google Visulisation API), that allows the spreadsheet to be treated as a database. For an example in another government domain, see Government Spending Data Explorer; see also Guardian Datastore MPs’ Expenses Spreadsheet as a Database ).

Publishing data in this way has the following benefits: 1) treating the data as a spreadsheet allows query based views to be generated over it; 2) this views can be visualised directly in the page (this includes dynamic visulisations, as for example described in Google Chart Tools – dynamic controls, and gallery); 3) queries can be used to generated CSV based views over the data that can be (re)used in third party applications.

Geographical Data

Sometimes it makes sense to visualise data in a geographical way. One service that provides a quick way of generating choropleth/thematic maps from simple two or three column data keyed by UK administrative geography labels or identifiers is OpenHeatmap. Data can be uploaded from a simple CSV file or imported from a Google spreadsheet using predetermined column names (a column identifying geographical areas according to one of fixed number of geographies, a number value column for colouring the geographical area, and an optional date column for animation purposes (so a map can be viewed in an animated way over consecutive time periods):

Openheatmap

Once generated, links to an online version of the map are available.

The code for OpenHeatmap is available as open source software so without too much effort it should be possible to modify the code in order to host a local instance of the software and tie it in a set of predetermined Google spreadsheets, local CSV files, or data views generated from queries over a predetermined datasource so that only the publisher’s data can be visualised using the particular instance of OpenHeatmap.

Other services for publishing and visualising geo-related data are available (eg Geocommons) and could play a role as a possible outlet in a content strategy for data with a strong geographical bias.

Power Tools – R

A further class of tools that can be used to generate visual representations or arbitrary datasets are the fully programmatic tools, such as the R statistical programming language. Developed for academic use, R is currently increasing in popularity on the coat tails of “big data” and the growing interest in analysis of activity data (“paradata”) that is produced as a side-effect of our digital activities. R is capable of importing data in a wide variety of formats from local files as well as via a URL from an online source. The R data model supports a range of powerful transformations that allow data to be shaped as required. Merging data that shares common columns (in whole or part) from separate sources is also supported.

In order to reduce overheads in getting data into a useful shape within the R environment, it may make sense to publish datafile “wrappers” that act as a simple API to data contained with one or published spreadsheets or datafiles. By providing an object type and, where appropriate, access methods for the the data, the data publisher can provide a solid framework on top of which third parties can build their own analysis and statistical charts. R is supported by a wide range of third party extension libraries for generating a wide range of statistical charts and graphics, including maps. (Of particular note are ggplot2 for generating graphics according to the Grammar of Graphics model, and googleVis, which provides a range of functions that support the rapid generation of Google Charts). Many of the charts can be generated from a single R command if the data is in the correct shape and format.

As well as running as a local, desktop application, R can also be run as a hosted webservice (for example, cloudstat.org; the RStudio cross-platform desktop application can also be accessed as a hosted online service, and could presumably be used to provide a robust, online hosted analysis environment tied in to a set of locked down data sources). It is also possible to use R to power online hosted statistical charting services; see for example .

Uploading data to ggplot2

Some cleaning of the data may be required before uploading to the ggplot service. For example, empty cells marked as such by a “-” should be replaced by empty cells; numeric values containing a “,”, may be misinterpreted as character strings (factor levels) rather than numbers (in which case the data needs cleaning by removing commas). Again, if it is known that a service such as ggplot2 is likely to be a target for data reuse, publishing the data in a format that is known to work “just by loading the data in” to R with default import settings will reduce friction/overheads and keep the barriers to reusing the data within that environment to a minimum.

Observation Most of the time, most people don’t get past default settings on any piece of software. If someone tries to load your data into an application, odds on they will use default, factory settings. If you know that users are likely to want to use your data in a particular package, make at least a version of your data available in a format that will load into that package under the default settings in a meaningful way.

Finally, a couple of wild card possibilities.

Firstly, Wolfram Alpha. Wolfram Alpha provides access to a “computational search engine” that accepts natural language queries about facts or data and attempts to provide reasoned responses to those queries, including graphics. Wolfram Alpha is based around a wide range of curated data sets, so an engagement strategy with them may, in some certain circumstances, be appropriate (for example, working with them in the publication of data sets and then directing users to Wolfram Alpha in return). Wolfram Alpha also offers a “Pro” service (Wolfram Alpha Pro) that allows users to visualise and query their own data.

Secondly, the Google Refine Reconciliation API. Google Refine is a cross-platform for cleaning datasets, with the ability to reconcile the content of data columns with canonical identifiers published elsewhere. For example, it is possible to reconcile the names of local authorities with canonical Linked Data identifiers via the Kasabi platform (UK Adminstrative Geography codes and identifiers).

Google refine reconciliation

By anchoring cell values to canonical identifiers, it becomes possible to aggregate data from different sources around those known, uniquely identified items in a definite and non-arbitrary way. By publishing: a) a reconciliation service (eg for LEP codes); and b) data that relates to identifiers returned by the reconciliation service (for example, sports data by LEP), the data publisher provides a focus for third parties who want to reconcile their own data against the published identifiers, as well as a source of data values that can be used to annotate records referencing those identifiers. (So for example, if you make it easy for me to get Local Authority codes based on local authority names from your reconciliation service, and also publish data linked to those identifiers (sports engagement data, say), if I reconcile my data against your codes, I will also be provided with the opportunity to annotate my data with your data (so I can annotate my local LEP spend data with your LER sports engagement data; [probably a bad example… need something more convincing?!])… Although uptake of the reconciliation API (and the associated possibility of providing annotation services) is still a minority interest, there are some signs of interest in it (for example, Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data; note that data published on the Kasabi platform also exposes a Google Refine reconciliation service endpoint.) In my opinion, there are potentially significant benefits to be had by publishing reconciliation service endpoints with associated annotation services if a culture of use grows up around this protocol.

Not covered: as part of this review, I have not covered applications such as Microsoft Excel or Tableau Desktop (the latter being a Windows only data visualisation environment that is growing in popularity). Instead, I have tried to focus on applications that are freely available either via the web or on a cross-platform basis. There is also a new kid on the block – datawrapper.de – but it’s still early days for this tool…

Do Retweeters Lack Commitment to a Hashtag?

I seem to be going down more ratholes than usual at the moment, in this case relating to activity round Twitter hashtags. Here’s a quick bit of reflection around a chart from Visualising Activity Around a Twitter Hashtag or Search Term Using R that shows activity around a hashtag that was minted for an event that took place before the sample period.

The y-axis is organised according to the time of first use (within the sample period) of the tag by a particular user. The x axis is time. The dots represent tweets containing the hashtag, coloured blue by default, red if they are an old-style RT (i.e. they begin RT @username:).

So what sorts of thing might we look for in this chart, and what are the problems with it? Several things jump out at me:

  • For many of the users, their first tweet (in this sample period at least) is an RT; that is, they are brought into the hashtag community through issuing an RT;
  • Many of the users whose first use is via an RT don’t use the hashtag again within the sample period. Is this typical? Does this signal represent amplification of the tag without any real sense of engagement with it?
  • A noticeable proportion of folk whose first use is not an RT go on to post further non-RT tweets. Does this represent an ongoing commitment to the tag? Note that this chart does not show whether tweets are replies, or “open” tweets. Replies (that is, tweets beginning @username are likely to represent conversational threads within a tag context rather than “general” tag usage, so it would be worth using an additional colour to identify reply based conversational tweets as such.
  • “New style” retweets are diaplayed as retweets by colouring… I need to check whether or nor newstyle RT information is available that I could use to colour such tweets appropriately. (or alternatively, I’d have to do some sort of string matching to see whether or not a tweet was the same as a previously seen tweet, which is a bit of a pain:-(

(Note that when I started mapping hashtag communities, I used to generate tag user names based on a filtered list of tweets that excluded RTs. this meant that folk who only used the tag as part of an RT and did not originate tweets that contained the tag, either in general or as part of a conversation, would not be counted as a member of the hashtag community. More recently, I have added filters that include RTs but exclude users who used the tag only once, for example, thus retaining serial RTers, but not single use users.)

So what else might this chart tell us? Looking at vertical slices, it seems that news entrants to the tag community appear to come in waves, maybe as part of rapid fire RT bursts. This chart doesn’t tell us for sure that this is happening, but it does highlight areas of the timelime that might be worth investigating more closely if we are interested in what happened at those times when there does appear to be a spike in activity. (Are there any modifications we could make to this chart to make them more informative in this respect? The time resolution is very poor, for example, so being able to zoom in on a particular time might be handy. Or are there other charts that might provide a different lens that can help us see what was happening at those times?)

And as a final point – this stuff may be all very interesting, but is it useful?, And if so, how? I also wonder how generalisable it is to other sorts of communication analysis. For example, I think we could use similar graphical techniques to explore engagement with an active comment thread on a blog, or Google+, or additions to an online forum thread. (For forums with mutliple threads, we maybe need to rethink how this sort of chart would work, or how it might be coloured/what symbols we might use, to distinguish between starting a new thread, or adding to a pre-existing one, for example. I’m sure the literature is filled with dozens of examples for how we might visualise forum activity, so if you know of any good references/links…?! ;-) #lazyacademic)

What is the Potential Audience Size for a Hashtag Community?

What’s the potential audience size around, or ‘reach’ associated with, a Twitter hashtag?

Way back when, in the early days of webs stats, reported figures tended to centre around the notion of hits, the number of calls made to a server via website activity. I forget the details, but the metric was presumably generated from server logs. This measure was always totally unreliable, because in the course of serving a web page, a server might be hit multiple times, once for each separately delivered asset, such as images, javascript files, css files and so on. Hits soon gave way to the notion of Page Views, which more accurately measured the number of pages (rather than assets) served via a website. This was complemented with the notion of Visits and Unique Visits: Visits, as tracked by a cookies, represent a set of pages viewed around about the same time by the same person. Unique Visits (or “Uniques”), represent the number of different people who appear to have visited the site in any given period.

What we see here, then, is a steady evolution in the complexity of website metrics that reflects on the one hand dissatisfaction with one way of measuring or reporting activity, and on the other practical considerations with respect to instrumentation and the ability to capture certain metrics once they are conceived of.

Widespread social media monitoring/tracking is largely still in the realm of “hits” measurement. Personal dashboards for services such as Twitter typically display direct measures provided by the Twitter API, or measures trivially/directly identified from Twitter API or archived data – number of followers, numbers of friends, distribution of updates over time, number of mentions, and so on.

Something both myself and Martin Hawksey have been thinking about on and off for some time are ways of reporting activity around Twitter hashtags. A commonly(?!) asked question in this respect relates to how much engagement (whatever that means) there has been with a particular tag. So here’s a quick mark in the sand about some of my current thinking about this. (Note that these ideas may well have been more formally developed in the academic literature – I’m a bit behind in my reading! If you know something that covers this in more detail, or that I should cite, please feel free to add a link in the comments… #lazyAcademic.)

One of the first metrics that comes to my mind is the number of people who have used a particular hashtag, and the number of their followers. Easily stated, it doesn’t take a lot of thought to realise even these “simple” measures are fraught with difficulty:

  • what counts as a use of the hashtag? If I retweet a measure of yours that contains a hashtag, have I used it in any meaningful sense? Does a “use” mean the creation of a new tweet containing the tag? What about if I reply to a tweet from you than contains the tag and I include the tag in my reply to you, even if I’m not sure what that tag relates to?
  • the potential audience size for the tag (potential uniques?), based on the number of followers of the tag users. At first glance, we might think this can be easily calculated by adding together the follower counts of the tag users, but this is more strictly an approximation of the potential audience: the set of followers of A may include some of the followers of B, or C; do we count the tag users themselves amongst the audience? If so, the upper bound also needs to take into account the fact that none of the users may be followers of any of the other tag users.
    Note there is also a lower bound – the largest follower count amongst the tag users (whatever that means…) of the hashtag. Furthermore, if we want to count the number of folk not using the tag but who may have seen the tag, this lower bound can be revised downwards by subtracting the number of tag users minus one (for the tag user with the largest follower count). The value is still only an approximation, though, becuase it assumes that all the tag users are actually included as followers of at least one, each, of the tag users. (If you think these points are “just academic”, they are and they aren’t – observations like these can often be used to help formulate gaming strategies around metrics based on these measures.)
  • the potential number of views of a tag, for example based on the product of the number of times a user tweets and their follower count?
  • the reach of (or active engagement with?) the tag, as measured by the number of people who actually see the tag, or the number of people who take and action around it (such as replying to a tagged tweet, RTing it, or clicking on a link a tagged tweet contains); note that we may be able ot construct probabilistic models (albeit quite involved ones) of the potential reach based on factors like the number of people someone follows, when they are online, the rate at which the people they follow tweet, and so on..

To try to make this a little more concrete, here are a couple of scripts for exploring the potential audience size of a tag based on the followers of the tag users (where a user is someone who publishes or retweets a tweet containing the tag over a specified period). The first, Python script runs a Twitter search and generates a list of unique users of the tag, along with the timestamp of their first use of the tag within the sample period. This script also grabs all the followers of the tag users, along with their counts, and generates running cumulative (upper bound approximation) count of the tag user follower numbers as well as calculating the rolling set of unique followers to date as each new tag user is observed. The second, R script plots the values.

The first thing we can do is look at the incidence of new users of the hashtag over time:

(For a little more discussion of this sort of chart, see Visualising Activity Around a Twitter Hashtag or Search Term Using R and its inspiration, @mediaczar’s How should Page Admins deal with Flame Wars?.)

More relevant to this post, however, is a plot showing some counts relating to followers of users of the hashtag:

In this case, the top, green line represents the summed total number of followers for tag users as they enter the conversation. If every user had completely different followers, this might be meaningful, but where conversation takes place around a tag between folk who know each other, it’s highly likely that they have followers in common.

The middle, red line shows a count of the number of unique followers to date, based on the the followers of users of the tag to date.

The lower, blue line shows the difference between the red and green lines. This represents the error between the summed follower counts and the actual number of unique followers.

Here’s a view over the number of new unique potential audience members at each time step (I think the use of the line chart here may be a mistake… I think bars/lineranges would probably be more appropriate…):

In the following chart, I overplot oneline with another. The lower layer (a red line) is the total follower account for each new tag user. The blue is the increase in the potential audience count (that is, the number of the new users’ followers that haven’t potentially seen the tag so far). The range of the visible part of the red line thus shows the number of a new tag user’s followers who have potentially already seen the tag. Err… maybe (that is, if my code is correct and all the scripts are doing what I think they’re doing! If they aren’t, then just treat this post as an exploration of the sorts of charts we might be able to produce to explore audience reach;-)

Here are the scripts (such as they are!)

import newt,csv,tweepy
import networkx as nx

#the term we're going to search for
tag='ddj'
#how many tweets to search for (max 1500)
num=500

##Something along lines of:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(SKEY, SSECRET)
api = tweepy.API(auth, cache=tweepy.FileCache('cache',cachetime), retry_errors=[500], retry_delay=5, retry_count=2)

#You need to do some work here to search the Twitter API
tweeters, tweets=yourSearchTwitterFunction(api,tag,num)
#tweeters is a list of folk who tweeted the term of interest
#tweets is a list of the Twitter tweet objects returned from the search
#My code for this is tightly bound up in a large and rambling library atm...

#Put tweets into chronological order
tweets.reverse()

#I was being lazy and wasn't sure what vars I needed or what I was trying to do when I started this!
#The whole thing really needs rewriting...
tweepFo={}
seenToDate=set([])
uniqSourceFo=[]
#runtot is crude and doesn't measure overlap
runtot=0
oldseentodate=0

#Construct a digraph from folk using the tag to their followers
DG=nx.DiGraph()

for tweet in tweets:
	user=tweet['from_user']
	if user not in tweepFo:
		tweepFo[user]=[]
		print "Getting follower data for", str(user), str(len(tweepFo)), 'of', str(len(tweeters))
		mi=tweepy.Cursor(api.followers_ids,id=user).items()
		userID=tweet['from_user_id'] #check
		DG.add_node(userID,label=user)
		for m in mi:
			tweepFo[user].append(m)
			#construct graph
			DG.add_edge(userID,m,weight=1)
			DG.node[m]['label']=''
		ufc=len(tweepFo[user])
		runtot=runtot+ufc
		#seen to date is all people who have seen so far, plus new ones, so it's the union
		oldseentodate=len(seenToDate)
		seenToDate=seenToDate.union(set(tweepFo[user]))
		uniqSourceFo.append((tweet['created_at'],len(seenToDate),user,runtot,ufc,oldseentodate))
	else:
		#I'm weighting the edges so we can count how many times folk see the hashtag
		if len(DG.edges(userID))>0:
			tmp1,tmp2=DG.edges(userID)[0]
			weight=DG[userID][tmp2]['weight']+1
			for fromN,toN in DG.edges(userID):
				DG[fromN][toN]['weight']=weight


fo='reports/tmp/'+tag+'_ncount.csv'
f=open(fo,'wb+')
writer=csv.writer(f)
writer.writerow(['datetime','count','newuser','crudetot','userFoCount','previousCount'])
for ts,l,u,ct,ufc,ols in uniqSourceFo:
	print ts,l
	writer.writerow([ts,l,u,ct,ufc,ols])

f.close()

print "Writing graph.."
filter=[]
for n in DG:
	if DG.degree(n)>1: filter.append(n)
filter=set(filter)
H=DG.subgraph(filter)
nx.write_graphml(H, 'reports/tmp/'+tag+'_ncount_2up.graphml')
print "Writing other graph.."
nx.write_graphml(DG, 'reports/tmp/'+tag+'_ncount.graphml')

Here’s the R script…

ddj_ncount <- read.csv("~/code/twapps/newt/reports/tmp/ddj_ncount.csv")
#Convert the datetime string to a time object
ddj_ncount$ttime=as.POSIXct(strptime(ddj_ncount$datetime, "%a, %d %b %Y %H:%M:%S"),tz='UTC')

#Order the newuser factor levels into the order in which they first use the tag
dda=subset(ddj_ncount,select=c('ttime','newuser'))
dda=arrange(dda,-desc(ttime))
ddj_ncount$newuser=factor(ddj_ncount$newuser, levels = dda$newuser)

#Plot when each user first used the tag against time
ggplot(ddj_ncount) + geom_point(aes(x=ttime,y=newuser)) + opts(axis.text.x=theme_text(size=6),axis.text.y=theme_text(size=4))

#Plot the cumulative and union flavours of increasing possible audience size, as well as the difference between them
ggplot(ddj_ncount) + geom_line(aes(x=ttime,y=count,col='Unique followers')) + geom_line(aes(x=ttime,y=crudetot,col='Cumulative followers')) + geom_line(aes(x=ttime,y=crudetot-count,col='Repeated followers')) + labs(colour='Type') + xlab(NULL)

#Number of new unique followers introduced at each time step
ggplot(ddj_ncount)+geom_line(aes(x=ttime,y=count-previousCount,col='Actual delta'))

#Try to get some idea of how many of the followers of a new user are actually new potential audience members
ggplot(ddj_ncount) + opts(axis.text.x=theme_text(angle=-90,size=4)) + geom_linerange(aes(x=newuser,ymin=0,ymax=userFoCount,col='Follower count')) + geom_linerange(aes(x=newuser,ymin=0,ymax=(count-previousCount),col='Actual new audience'))

#This is still a bit experimental
#I'm playing around trying to see what proportion or number of a users followers are new to, or subsumed by, the potential audience of the tag to date...
ggplot(ddj_ncount) + geom_linerange(aes(x=newuser,ymin=0,ymax=1-(count-previousCount)/userFoCount)) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)

In the next couple of posts in this series, I’ll start to describe how we can chart the potential increase in audience count as a delta for each new tagger, along with a couple of ways of trying to get some initial sort of sense out of the graph file, such as the distribution of the potential number of “views” of a tag across the unique potential audience members…

PS See also the follow on post More Thoughts on Potential Audience Metrics for Hashtag Communities

Visualising Activity Around a Twitter Hashtag or Search Term Using R

I think one of valid criticisms around a lot of the visualisations I post here and on my various #f1datajunkie blogs is that I often don’t post any explanatory context around the visualisations. This is partly a result of the way I use my blog posts in a selfish way to document the evolution of my own practice, but not necessarily the “so what” elements that represent any meaning or sense I take from the visualisations. In many cases, this is because the understanding I come to of a dataset is typically the result of an (inter)active exploration of the data set; what I blog are the pieces of the puzzle that show how I personally set about developing a conversation with a dataset, pieces that you can try out if you want to…;-)

An approach that might get me more readers would be to post commentary around what I’ve learned about a dataset from having a conversation with it. A good example of this can be seen in @mediaczar’s post on How should Page Admins deal with Flame Wars?, where this visualisation of activity around a Facebook post is analysed in terms of effective (or not!) strategies for moderating a flame war.

@mediaczar visualisation of engagement around facebook flamewars

The chart shows a sequential ordering of posts in the order they were made along the x-axis, and the unique individual responsible for each post, ordered by accession to the debate along the y-axis. For interpretation and commentary, see the original post: How should Page Admins deal with Flame Wars? ;-)

One take away of the chart for me is that it provides a great snapshot of new people entering into a conversation (vertical lines) as well as engagement by an individual (horizontal lines). If we use a time proportional axis on x, we can also see engagement over time.

In a Twitter context, it’s likely that a rapid increase in numbers of folk engaging with a hashtag, for example, might be the result of an RT related burst of activity. For folk who have already engaged in hashtag usage, for example as part of a live event backhannel, a large number of near co-occurring tweets that are not RTs might signal some notable happenstance within the event.

To explore this idea, here’s a quick bit of R tooling inspired by Mat’s post… It uses the twitteR library and sources tweets via a Twitter search.

require(twitteR)
#Pull in a search around a hashtag.
searchTerm='#ukgc12'
rdmTweets <- searchTwitter(searchTerm, n=500)
# Note that the Twitter search API only goes back 1500 tweets

#Plot of tweet behaviour by user over time
#Based on @mediaczar's http://blog.magicbeanlab.com/networkanalysis/how-should-page-admins-deal-with-flame-wars/
#Make use of a handy dataframe creating twitteR helper function
tw.df=twListToDF(rdmTweets)
#@mediaczar's plot uses a list of users ordered by accession to user list
## 1) find earliest tweet in searchlist for each user [ http://stackoverflow.com/a/4189904/454773 ]
require(plyr)
tw.dfx=ddply(tw.df, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))})
## 2) arrange the users in accession order
tw.dfxa=arrange(tw.dfx,-desc(created))
## 3) Use the username accession order to order the screenName factors in the searchlist
tw.df$screenName=factor(tw.df$screenName, levels = tw.dfxa$screenName)
#ggplot seems to be able to cope with time typed values...
require(ggplot2)
ggplot(tw.df)+geom_point(aes(x=created,y=screenName))

We can get a feeling for which occurrences were old-style RTs by identifying tweets that start with a classic RT, and then colouring each tweet appropriately (note there may be some overplotting/masking of points…I’m not sure how big the x-axis time bins are…)

#Identify and colour the RTs...
library(stringr)
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#Identify classic style RTs
tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
tw.df$rtt=sapply(tw.df$rt,function(rt) if (is.na(rt)) 'T' else 'RT')
ggplot(tw.df)+geom_point(aes(x=created,y=screenName,col=rtt))

So now we can see when folk entered into the hashtag community via a classic RT.

We can also start to explore who was classically retweeted when:

#Generate a plot showing how a person is RTd
tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
#Note that this doesn't show how many RTs each person got in a given time period if they got more than one...
ggplot(subset(tw.df,subset=(!is.na(rtof))))+geom_point(aes(x=created,y=rtof))

Another view might show who was classically RTd by whom (activity along a row indicating someone was retweeted a lot through one or more tweets, activity within a column identifying an individual who RTs a lot…):

#We can start to get a feel for who RTs whom...
require(gdata)
#We don't want to display screenNames of folk who tweeted but didn't RT
tw.df.rt=drop.levels(subset(tw.df,subset=(!is.na(rtof))))
#Order the screennames of folk who did RT by accession order (ie order in which they RTd)
tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created))
tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName)
# Plot who RTd whom
ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(x=screenName,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)

What sense you might make of all this, or where to take it next, is down to you of course… Err, erm…?! ;-)

PS see also: https://blog.ouseful.info/2012/01/21/a-quick-view-over-a-mashe-google-spreadsheet-twitter-archive-of-ukgc2012-tweets/

TV Audience Social Interest Mapping – Shameless vs. Newsnight vs Masterchef

How easy is it to differentiate between audiences of different types of TV programme based on their socially signalled interests?

This evening, I ran a couple of Twitter searches against the #shameless and #newsnight hashtags. In each case, I grabbed 1500 of the most recent tweets and generated lists of folk who had tweeted the corresponding hashtag at least twice in the sample set. I then grabbed the lists of all the friends of the folk in each list to generate a projection map of the friends of recent hashtaggers. The final preprocessing step was to filter each network to contain only nodes that had at least an indegree or outdegree of 25 (that is, I filtered the network to only include folk who had at least 25 friends, or were linked to by at least 25 of the folk in the corresponding hashtaggers list).

Here’s the resulting map generated around the #shameless tag – it gives an impression of folk who tend to be followed by folk using the #shameless tag:

Positiioning #shameless

Music, celebrities, footballers and comedians, err, I think?!

By way of comparison, here’s a sketch of who the folk using the #newsnight tag follow:

Positioning the #newsnight audience

MPs, political hacks, and higher brow level of comedian maybe?! ;-)

PS for what it’s worth, here’s another map for tweets grabbed now around #masterchef, which aired a few hours ago:

Positioning #masterchef

So that’ll be a cooking show with some high profile talent (in the narrower scheme of things) then?!