<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>OUseful.Info, the blog... &#187; Rstats</title>
	<atom:link href="http://blog.ouseful.info/category/syndication/rstats/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.ouseful.info</link>
	<description>Trying to find useful things to do with emerging technologies in open education</description>
	<lastBuildDate>Sun, 19 May 2013 12:53:01 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.ouseful.info' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>OUseful.Info, the blog... &#187; Rstats</title>
		<link>http://blog.ouseful.info</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.ouseful.info/osd.xml" title="OUseful.Info, the blog..." />
	<atom:link rel='hub' href='http://blog.ouseful.info/?pushpress=hub'/>
		<item>
		<title>Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics</title>
		<link>http://blog.ouseful.info/2013/04/21/evaluating-event-impact-through-social-media-follower-histories-with-possible-relevance-for-mooc-learning-analytics/</link>
		<comments>http://blog.ouseful.info/2013/04/21/evaluating-event-impact-through-social-media-follower-histories-with-possible-relevance-for-mooc-learning-analytics/#comments</comments>
		<pubDate>Sun, 21 Apr 2013 17:40:13 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Rstats]]></category>
		<category><![CDATA[Thinkses]]></category>
		<category><![CDATA[learnalytics]]></category>
		<category><![CDATA[learning analytics]]></category>
		<category><![CDATA[learningAnalytics]]></category>
		<category><![CDATA[MOOC]]></category>
		<category><![CDATA[unMOOC]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=10354</guid>
		<description><![CDATA[Last year I sat on a couple of panels organised by I&#8217;m a Scientist&#8217;s Shane McCracken at various science communication conferences. A couple of days ago, I noticed Shane had popped up a post asking Who are you Twitter?, a quick review of a social media mapping exercise carried out on the followers of the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10354&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Last year I sat on a couple of panels organised by I&#8217;m a Scientist&#8217;s Shane McCracken at various science communication conferences. A couple of days ago, I noticed Shane had popped up a post asking <a href="http://about.imascientist.org.uk/2013/who-are-you-twitter/">Who are you Twitter?</a>, a quick review of a social media mapping exercise carried out on the followers of the @imascientist Twitter account. </p>
<p>Using the technique described in <a href="http://blog.ouseful.info/2013/04/05/estimated-follower-accession-charts-for-twitter/">Estimated Follower Accession Charts for Twitter</a>, we can estimate a follower acquisition growth curve for the @imascientist Twitter account:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/04/imascientist.png"><img src="http://ouseful.files.wordpress.com/2013/04/imascientist.png?w=700" alt="imascientist"   class="alignnone size-full wp-image-10355" /></a></p>
<p>I&#8217;ve already noted how we may be able to use &#8220;spikes&#8221; in follower acquisition rates to identify news events that involved the owner of a particular Twitter account and caused a surge in follower numbers as a result (<a href="http://blog.ouseful.info/2013/03/04/what-happened-then-using-approximated-twitter-follower-accession-to-identify-political-events/">What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events</a>).</p>
<p>Thinking back to the context of evaluating the impact of events that include social media as part of the overall campaign, it struck me that whilst running a particular event may not lead to a huge surge in follower numbers on the day of the event or in the immediate aftermath, the followers who do sign up over that period might have signed up as a result of the event. And now we have the first inklings of a <em>post hoc</em> analysis tool that lets us try to identify these people, and perhaps look to see if their profiles are different to profiles of followers who signed up at different times (maybe reflecting the audience interest profile of folk who attended a particular event, or reflecting sign ups from a particular geographical area?)</p>
<p>In other words, through generating the follower acquisition curve, can we use it to filter down to folk who started following around a particular time in order to then see whether there is a possibility that they started following as a result of a particular event, and if so can count as some sort of &#8220;conversion&#8221;? (I appreciate that there are a lot of caveats in there!;-)</p>
<p>A similar approach may also be relevant in the context of analysing link building around historical online community events, such as MOOCs&#8230; If we know somebody took a particular MOOC at a particular time, might we be able to construct their follower acquisition curve and then analyse it around the time of the MOOC, looking to see if the connections built over that period are different to the users other followers, and as such may represent links developed as a result of taking the MOOC? Analysing the timelines of the respective parties may further reveal conversational dynamics between those parties, and as such allow is to see whether a fruitful social learning relationship developed out of contact made in the MOOC?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/10354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/10354/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10354&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/04/21/evaluating-event-impact-through-social-media-follower-histories-with-possible-relevance-for-mooc-learning-analytics/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/04/imascientist.png" medium="image">
			<media:title type="html">imascientist</media:title>
		</media:content>
	</item>
		<item>
		<title>Estimated Follower Accession Charts for Twitter</title>
		<link>http://blog.ouseful.info/2013/04/05/estimated-follower-accession-charts-for-twitter/</link>
		<comments>http://blog.ouseful.info/2013/04/05/estimated-follower-accession-charts-for-twitter/#comments</comments>
		<pubDate>Fri, 05 Apr 2013 10:31:40 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Rstats]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=10236</guid>
		<description><![CDATA[Just over a year or so ago, Mat Morrison/@mediaczar introduced me to a visualisation he&#8217;d been working on (How should Page Admins deal with Flame Wars?) that I started to refer to as an accession chart (Visualising Activity Around a Twitter Hashtag or Search Term Using R). The idea is that we provide each entrant [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10236&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Just over a year or so ago, Mat Morrison/@mediaczar introduced me to a visualisation he&#8217;d been working on (<a href="http://blog.magicbeanlab.com/networkanalysis/how-should-page-admins-deal-with-flame-wars/">How should Page Admins deal with Flame Wars?</a>) that I started to refer to as an accession chart (<a href="http://blog.ouseful.info/2012/02/06/visualising-activity-round-a-twitter-hashtag-or-search-term-using-r/">Visualising Activity Around a Twitter Hashtag or Search Term Using R</a>). The idea is that we provide each entrant into a conversation or group with an accession number: the first person has accession number 1, the second person accession number 2 and so on. The accession number is plotted in rank order on the vertical y-axis, with ranked/time ordered &#8220;events&#8221; along the horizontal x-axis: utterances in a conversation for example, or posts to a forum.</p>
<p>A couple of months ago, I wondered whether this approach might also be used to estimate when folk started following an individual on Twitter. My reasoning went something like this:</p>
<blockquote><p>One of the things I <em>think</em> is true of the Twitter API call for the followers of an account is that it returns lists of followers in reverse accession order. So the person who followed an account most recently will be at the top of the list (the first to be returned) and the person who followed first will be at the end of the list. Unfortunately, we don&#8217;t know <em>when</em> followers joined, so it&#8217;s hard to spot bursty growth in the number of followers of an account. However, it struck me that we may be able to get a bound on this by looking at the dates at which followers joined Twitter, along with their &#8216;accession order&#8217; as followers of an account. If we get the list of followers and reverse it, and assume that this gives an ordered list of followers (with the follower that started following the longest time ago first), we can then work through this list and keep track of the oldest &#8216;created_at&#8217; date seen so far. This gives us an upper bound (most recent date) for when followers that far through the list started following. (You can&#8217;t start following until you join twitter&#8230;)</p>
<p>So for example, if followers A, B, C, D in that accession order (ie started following target in that order) have user account creation dates 31/12/09, 1/1/09, 15/6/12, 5/5/10 then:<br />
- A started following no earlier than 31/12/09 (because that&#8217;s when they joined Twitter and it&#8217;s the most recent creation date we&#8217;ve seen so far)<br />
- B started following no earlier than 31/12/09 (because they started following after B)<br />
- C started following no earlier than 15/6/12 (because that&#8217;s when they joined Twitter and it&#8217;s the most recent creation date we&#8217;ve seen so far)<br />
- D started following no earlier than 15/6/12 (because they started following after C, which gave use the most recent creation date seen so far)</p>
<p>That&#8217;s probably confused you enough, so here&#8217;s a chart &#8211; accession number is along the bottom (i.e. the x-axis), joining date (in days ago) is on the y-axis:</p></blockquote>
<p><a href="http://ouseful.files.wordpress.com/2013/02/recencyvacc1.png"><img src="http://ouseful.files.wordpress.com/2013/02/recencyvacc1.png?w=700" alt="recencyVacc"   class="alignnone size-full wp-image-9881" /></a></p>
<p><em>NOTE: this diverges from the accession graph described above, where accession number goes on the y-axis and rank ordered event along the x-axis.</em></p>
<p>What the chart shows is an estimate (the red line) of how many days ago a follower with a particular accession number started to follow a particular Twitter account. </p>
<p>As described in <a href="http://blog.ouseful.info/2013/02/19/sketches-around-twitter-followers/">Sketches Around Twitter Followers</a>, we see a clear break at 1500 days ago when Twitter started to get popular. This approach also suggests a technique for creating &#8220;follower probes&#8221; that we can use to date a follower record: if you know which day a particular user followed a target account, you can use that follower to put a datestamp into the follower record (assuming the Twitter API returned followers in reverse accession order).</p>
<p>Here&#8217;s an example of the code I used based on Twitter follower data grabbed for @ChrisPincher (whose follower profile appeared to be out of sorts from the analysis sketched in <a href="http://blog.ouseful.info/2012/02/06/visualising-activity-round-a-twitter-hashtag-or-search-term-using-r/">Visualising Activity Around a Twitter Hashtag or Search Term Using R</a>). I&#8217;ve corrected the x/y axis ordering so follower accession number is now the vertical, y-component.</p>
<pre class="brush: r; title: ; notranslate">require(ggplot2)

processUserData = function(data) {
    data$tz = as.POSIXct(data$created_at)
    data$days = as.integer(difftime(Sys.time(), data$tz, units = &quot;days&quot;))
    data = data[rev(rownames(data)), ]
    data$acc = 1:length(data$days)
    data$recency = cummin(data$days)

    data
}

mp_cp &lt;- read.csv(&quot;~/code/MPs/ChrisPincher_fo_0__2013-02-16-01-29-28.csv&quot;, row.names = NULL)

ggplot(processUserData(mp_cp)) +  geom_point(aes(x = -days, y = acc), size = 0.4) + geom_point(aes(x = -recency, y = acc), col = &quot;red&quot;, size = 1)+xlim(-2000,0)
</pre>
<p>Here&#8217;s @ChrisPincher&#8217;s chart:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/04/cp_demo.png"><img src="http://ouseful.files.wordpress.com/2013/04/cp_demo.png?w=700" alt="cp_demo"   class="alignnone size-full wp-image-10239" /></a></p>
<p>The black dots reveal how many days ago a particular follower joined Twitter. The red line is the estimate of when a particular follower started following the account, estimated based on the most recently created account seen to date amongst the previously acceded followers.</p>
<p>We see steady growth in follower numbers to start with, and then the account appears to have been spam followed? (Can you spot when?!;-) The clumping of creation dates of accounts during the attack also suggests they were created programmatically.</p>
<p>[In the "next" in this series of posts [<a href="http://blog.ouseful.info/2013/03/04/what-happened-then-using-approximated-twitter-follower-accession-to-identify-political-events/">What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events</a>], I&#8217;ll show how spikes in follower acquisition on a particular day can often be used to &#8220;detect&#8221; historical news events.]</p>
<p><em>PS after coming up with this recipe, I did a little bit of &#8220;scholarly research&#8221; and I learned that a similar approach for estimating Twitter follower acquisition times had already been described at least once, at the opening of this paper: <a href="http://research.microsoft.com/en-us/um/people/jchayes/papers/timestamps.pdf">We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter</a> – “We estimate the edge creation time for any follower of a celebrity by positing that it is equal to the greatest lower bound that can be deduced from the edge orderings and follower creation times for that celebrity”.</em></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/10236/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/10236/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10236&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/04/05/estimated-follower-accession-charts-for-twitter/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/recencyvacc1.png" medium="image">
			<media:title type="html">recencyVacc</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/04/cp_demo.png" medium="image">
			<media:title type="html">cp_demo</media:title>
		</media:content>
	</item>
		<item>
		<title>Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column</title>
		<link>http://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/</link>
		<comments>http://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/#comments</comments>
		<pubDate>Wed, 03 Apr 2013 08:54:57 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Rstats]]></category>
		<category><![CDATA[ddj]]></category>
		<category><![CDATA[schoolofdata]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=10197</guid>
		<description><![CDATA[One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with &#8220;everyday&#8221; desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10197&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with &#8220;everyday&#8221; desktop tools. When I was <a href="http://blog.ouseful.info/2013/04/02/revisiting-mps-expenses/">Revisiting MPs’ Expenses</a>, the expenses data I <a href="http://www.parliamentary-standards.org.uk/DataDownloads.aspx">downloaded from IPSA</a> (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.</p>
<p>In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here&#8217;s one way using a handy little R script in <a href="http://rstudio.org/">RStudio</a>&#8230;</p>
<p>Load the full expenses data CSV file into RStudio (for example, calling the dataframe it is loaded into <tt>mpExpenses2012</tt>. Previewing it we see there is a column <tt>MP.s.Name</tt> identifying which MP each expense claim line item corresponds to:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/04/mpexpenses2012r.png"><img src="http://ouseful.files.wordpress.com/2013/04/mpexpenses2012r.png?w=700&#038;h=278" alt="mpexpenses2012R" width="700" height="278" class="alignnone size-full wp-image-10186" /></a></p>
<p>We can easily pull out the unique values of the MP names using the <tt>levels</tt> command, and then for each name take a subset of the data containing just the items related to that MP and print it out to a new CSV file in a pre-existing directory:</p>
<pre class="brush: r; title: ; notranslate">mpExpenses2012 = read.csv(&quot;~/Downloads/DataDownload_2012.csv&quot;)
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
  #Subset the data by MP
  tmp=subset(mpExpenses2012,MP.s.Name==name)
  #Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
  fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
  #Save the CSV file containing separate expenses data for each MP
  write.csv(tmp,fn,row.names=FALSE)
}</pre>
<p>Simples:-)</p>
<p>This technique can be used to split any CSV file into multiple CSV files based on the unique values contained within a particular, specified column.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/10197/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/10197/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10197&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/04/mpexpenses2012r.png" medium="image">
			<media:title type="html">mpexpenses2012R</media:title>
		</media:content>
	</item>
		<item>
		<title>Revisiting MPs&#8217; Expenses</title>
		<link>http://blog.ouseful.info/2013/04/02/revisiting-mps-expenses/</link>
		<comments>http://blog.ouseful.info/2013/04/02/revisiting-mps-expenses/#comments</comments>
		<pubDate>Tue, 02 Apr 2013 23:20:19 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Anything you want]]></category>
		<category><![CDATA[Rstats]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=10183</guid>
		<description><![CDATA[I couldn&#8217;t but notice the chatter about Iain Duncan Smith claiming he&#8217;d have no problem &#8220;living on 53 pounds a dayweek&#8220;, which made me wonder not only how many meal catered events he attends each week (and how many of his scheduled meeting also have complementary tea and biscuits (a bellweather of the extent of [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10183&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I couldn&#8217;t but notice the chatter about Iain Duncan Smith claiming he&#8217;d have no problem &#8220;living on 53 pounds a <s>day</s><em>week</em>&#8220;, which made me wonder not only how many meal catered events he attends each week (and how many of his scheduled meeting also have complementary tea <em>and biscuits</em> (a bellweather of the extent of cuts in many institutions&#8230;;-), but also how he fares on the expenses stakes&#8230;</p>
<p>For the last couple of years, details about MPs&#8217; expense claims have been published via the <a href="http://www.parliamentary-standards.org.uk/DataDownloads.aspx">Independent Parliamentary Standards Authority (IPSA) website</a>. The data seems to be most easily grabbed as CSV files containing all MPs&#8217; claims for a parliamentary session (or tax year?) &#8211; eg 2012/13 or 2011/2012. As you might expect, that means the files are relatively large &#8211; 20MB (~100,000 rows) for 12/13, or just over 40MB (~190k rows) for 2011/12.</p>
<p>Files of this size are fine if you&#8217;re happy working with files of this size (?!), but can be a pain if you aren&#8217;t&#8230; So here are a couple of ways getting the data into a more manageable form, starting from raw data files that look something like this&#8230;</p>
<p><a href="http://ouseful.files.wordpress.com/2013/04/mp-expenses-raw.png"><img src="http://ouseful.files.wordpress.com/2013/04/mp-expenses-raw.png?w=700&#038;h=67" alt="mp expenses raw" width="700" height="67" class="alignnone size-full wp-image-10185" /></a></p>
<p>The file is made up of a series of rows, one per expense entry, with common columns. If we loaded this data into a spreadsheet application such as Excel or Google Spreadsheets, we&#8217;d see a single sheet containing however many tens of thousands of rows of data. Assuming that the application could cope with loading such a large amount of data of course&#8230; which it might not be able to do&#8230;which means we may need to make the data file a bit more manageable, somehow&#8230;</p>
<p>Let&#8217;s start with grabbing data relating to Iain Duncan Smith&#8217;s expense claims. On a Linux box, or a Mac, this is easy enough from a <em>Terminal</em> command line. (On Windows, something like <a href="http://sourceware.org/cygwin/">cygwin</a> should provide you with equivalents of some of the more useful unix tools.) For example, the <em>grep</em> command let&#8217;s us pull just the rows that contain the phrase <em>Iain Duncan Smith</em>:</p>
<p><tt>grep "Iain Duncan Smith" DataDownload_2012.csv &gt; IDS_expenses_2012.csv</tt></p>
<p>This reads along the lines of: <em>search through the file &#8220;DataDownload_2012.csv&#8221; looking for rows that contain &#8220;Iain Duncan Smith&#8221;, then copy those rows and only those rows into the file &#8220;IDS_expenses_2012.csv&#8221;</em></p>
<p>For what it&#8217;s worth, I&#8217;ve posted IDS&#8217; expenses data from 2011 and 2012 to a couple of Google Fusion Tables: <a href="https://www.google.com/fusiontables/DataSource?docid=1yKM6VUovYKw0gI8w1j61aEKsuK68CFVcqyCTyiQ">2011/12</a>, <a href="https://www.google.com/fusiontables/DataSource?docid=1IIzuS2zRNtopW0w_sfpTl2K2kV7WdBF9M45foWk">2012/13</a></p>
<p>Another way of wrangling the data into a state we can start to play with it is to load it into <a href="http://www.rstudio.com/">RStudio</a>, where we can start applying magical R incantations to it.</p>
<p><tt>mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")</tt></p>
<p><a href="http://ouseful.files.wordpress.com/2013/04/mpexpenses2012r.png"><img src="http://ouseful.files.wordpress.com/2013/04/mpexpenses2012r.png?w=700&#038;h=278" alt="mpexpenses2012R" width="700" height="278" class="alignnone size-full wp-image-10186" /></a></p>
<p>We can then generate a subset of the data containing just IDS&#8217; data:</p>
<p><tt>ids2012=subset(mpExpenses2012, MP.s.Name=="Iain Duncan Smith")</tt></p>
<p>We can also generate a combined data set of IDS&#8217; expense claims from both 2011/12 and 2012/13, for example:</p>
<p><tt>mpExpenses2011 = read.csv("~/Downloads/DataDownload_2011.csv")<br />
ids2011=subset(mpExpenses2011, MP.s.Name=="Iain Duncan Smith")</p>
<p>ids11and12=rbind(ids2011,ids2012)</tt></p>
<p>However, all is not well&#8230;</p>
<p>On loading the 2011 data into R, 158320 observations are loaded in. The actual number of rows (including the header &#8211; so one more row than the number of &#8220;observations&#8221;) can be given by running a line count (from the terminal/command line) over the original file:</p>
<p><tt>wc -l DataDownload_2011.csv<br />
  187447 DataDownload_2011.csv</tt></p>
<p>That is, <strong>187447</strong> rows&#8230;</p>
<p>If we try to pull out a list of MPs&#8217; names using the <em>levels</em> command:</p>
<p><tt>mpNames=data.frame(name=levels(mpExpenses2011$MP.s.Name))</tt></p>
<p>we find that as well as the expected MPs names, there&#8217;s some &#8220;bad&#8221; data:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/04/etl-messup.png"><img src="http://ouseful.files.wordpress.com/2013/04/etl-messup.png?w=700" alt="etl messup"   class="alignnone size-full wp-image-10187" /></a></p>
<p>(What we expect to see in the name column is a list of MP names, one unique name per row.)</p>
<p>This is, of course, the way of the world. Folk who publish data rarely if ever, provide a demonstration of how to actually open it cleanly into an application (typically because data publishers think that once they have published the data, it&#8217;s bound to be all right and doesn&#8217;t need checking. This is, however, rarely true&#8230;although, for the 2012/13 data, there are 99071 loaded observations against 99072 (including 1 header) rows in the download file, which does <em>seem to be</em> correct).</p>
<p>What we should do now, of course, is go into a an ETL (or at least, TL) debug mode on the 2011 data and try to figure out what&#8217;s going wrong with the simple import&#8230; or we just work with the data we have and try to work around the dodgy rows&#8230;</p>
<p>&#8230;or we limit ourselves to the 2012 data, which does seem okay&#8230;</p>
<p>So if we do that, what other sorts of investigation come to mind?</p>
<p>One thing that came to mind after skimming the data&#8230;</p>
<p><a href="http://ouseful.files.wordpress.com/2013/04/mp-cost-of-travel.png"><img src="http://ouseful.files.wordpress.com/2013/04/mp-cost-of-travel.png?w=700&#038;h=268" alt="mp cost of travel" width="700" height="268" class="alignnone size-full wp-image-10188" /></a></p>
<p>was a &#8220;rail travel fares according to MPs&#8217; expenses&#8221; lookup table.</p>
<p>So for example, we might start by creating a subset of the data based on expenses categorised as &#8220;Travel&#8221; and then look to see what sorts of trvel classification falls within that Category:</p>
<p><tt>travel=droplevels(subset(mpExpenses2012,Category=="Travel"))<br />
levels(travel$Expense.Type)</tt></p>
<p>Here&#8217;s what we get:</p>
<pre>[1] "Car Hire"                       "Car Hire Fuel"                  "Car Hire Fuel MP"              
 [4] "Car Hire Fuel MP Staff"         "Car Hire Insurance MP Staff"    "Car Hire MP"                   
 [7] "Car Hire MP Staff"              "Congest. Zone/Toll Seas Ticket" "Congestion Zone/Toll"          
[10] "Congestion Zone/Toll Dependant" "Congestion Zone/Toll MP"        "Congestion Zone/Toll MP Staff" 
[13] "Food &amp; Drink"                   "Food &amp; Drink @ Parliament"      "Food &amp; Drink @ Parlmnt OFF Est"
[16] "Food &amp; Drink MP"                "Food &amp; Drink MP Staff"          "Hotel Late Sitting"            
[19] "Hotel Late Sitting &gt; 1.00"      "Hotel London Area MP Staff"     "Hotel NOT London Area (Travel)"
[22] "Hotel NOT London Area MP Staff" "Hotel Outside UK"               "Own Bicycle MP"                
[25] "Own Car Dependant"              "Own Car MP"                     "Own Car MP Staff"              
[28] "Own Vehicle Bicycle"            "Own Vehicle Bicycle MP Staff"   "Own Vehicle Car"               
[31] "Own Vehicle Car Dependant"      "Own Vehicle Car MP Staff"       "Own Vehicle Mot Cycle MP Staff"
[34] "Parking"                        "Parking Dependant"              "Parking MP Staff"              
[37] "Parking Season Ticket"          "Public Tr AIR"                  "Public Tr AIR Dependant"       
[40] "Public Tr AIR MP Staff"         "Public Tr BUS"                  "Public Tr BUS MP Staff"        
[43] "Public Tr COACH"                "Public Tr COACH MP Staff"       "Public Tr FERRY"               
[46] "Public Tr FERRY MP Staff"       "Public Tr OTHER"                "Public Tr OTHER Dependant"     
[49] "Public Tr OTHER MP Staff"       "Public Tr RAIL - RTN"           "Public Tr RAIL - SGL"          
[52] "Public Tr RAIL Dependant - RTN" "Public Tr RAIL Dependant - SGL" "Public Tr RAIL Foreign"        
[55] "Public Tr RAIL MP Staff - RTN"  "Public Tr RAIL MP Staff - SGL"  "Public Tr RAIL Other"          
[58] "Public Tr RAIL Other Dependant" "Public Tr RAIL Other MP Staff"  "Public Tr RAIL Railcard"       
[61] "Public Tr RAIL Railcd MP Staff" "Public Tr RAIL Sleeper Suppl"   "Public Tr Season Ticket"       
[64] "Public Tr UND"                  "Public Tr UND Dependant"        "Public Tr UND MP Staff"        
[67] "Public Tr Underground MP"       "Taxi"                           "Taxi After Late Sitting"       
[70] "Taxi after Late Sitting 11pm"   "Taxi Dependant"                 "Taxi MP"                       
[73] "Taxi MP Staff"                  "Taxi Working Late After 9pm"    "Taxi Working Late Before 9pm"</pre>
<p>We might further pull out just the rows relating to rail travel (almost 10,000 rows from the 2012/13 dataset):</p>
<p><tt>rail=droplevels(subset(travel,grepl("Rail",Expense.Type,ignore.case=TRUE)))</tt></p>
<p>and then we might start looking to see who&#8217;s travelling First vs. who&#8217;s travelling Standard, as well as building up a database of rail fares between locations as claimed on expenses. But that&#8217;s for another day&#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/10183/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/10183/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10183&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/04/02/revisiting-mps-expenses/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/04/mp-expenses-raw.png" medium="image">
			<media:title type="html">mp expenses raw</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/04/mpexpenses2012r.png" medium="image">
			<media:title type="html">mpexpenses2012R</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/04/etl-messup.png" medium="image">
			<media:title type="html">etl messup</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/04/mp-cost-of-travel.png" medium="image">
			<media:title type="html">mp cost of travel</media:title>
		</media:content>
	</item>
		<item>
		<title>Publishing Stats for Analytic Reuse &#8211; FAOStat Website and R Package</title>
		<link>http://blog.ouseful.info/2013/03/08/publishing-stats-for-analytics-reuse-faostat-website-and-r-package/</link>
		<comments>http://blog.ouseful.info/2013/03/08/publishing-stats-for-analytics-reuse-faostat-website-and-r-package/#comments</comments>
		<pubDate>Fri, 08 Mar 2013 14:45:29 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Rstats]]></category>
		<category><![CDATA[eurostat]]></category>
		<category><![CDATA[FAOstat]]></category>
		<category><![CDATA[opendata]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=9917</guid>
		<description><![CDATA[How can stats and data publishers, from NGOs and (inter)national statistics agencies to scientific researchers, publish their data in a way that supports its analysis directly, as well as in combination with other datasets? Here&#8217;s one approach I learned about from Michael Kao of the UN Food and Agriculture Organisation statistics division, FAOStat. At first [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9917&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>How can stats and data publishers, from NGOs and (inter)national statistics agencies to scientific researchers, publish their data in a way that supports its analysis directly, as well as in combination with other datasets?</p>
<p>Here&#8217;s one approach I learned about from Michael Kao of the UN Food and Agriculture Organisation statistics division, <a href="http://faostat.fao.org/">FAOStat</a>.</p>
<p>At first glimpse, the FAOStat website offers a rich website that supports data downloads, previews and simple analysis tools around a wide variety of international food related datasets:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/03/faostat-website.png"><img src="http://ouseful.files.wordpress.com/2013/03/faostat-website.png?w=700&#038;h=500" alt="FAOStat website" width="700" height="500" class="alignnone size-full wp-image-10062" /></a></p>
<p><a href="http://ouseful.files.wordpress.com/2013/03/faostat-graphical-tools.png"><img src="http://ouseful.files.wordpress.com/2013/03/faostat-graphical-tools.png?w=700&#038;h=657" alt="FAOstat - graphical tools" width="700" height="657" class="alignnone size-full wp-image-10061" /></a></p>
<p><a href="http://ouseful.files.wordpress.com/2013/03/faostat-inline-data-preview.png"><img src="http://ouseful.files.wordpress.com/2013/03/faostat-inline-data-preview.png?w=700&#038;h=673" alt="faostat - inline data preview" width="700" height="673" class="alignnone size-full wp-image-10060" /></a></p>
<p><a href="http://ouseful.files.wordpress.com/2013/03/faostat-ddata-analysis.png"><img src="http://ouseful.files.wordpress.com/2013/03/faostat-ddata-analysis.png?w=700&#038;h=480" alt="FAOStat - ddata analysis" width="700" height="480" class="alignnone size-full wp-image-10059" /></a></p>
<p>One problem with having so many controls and fields available is that it can be hard to know where (or how) to get started &#8211; a bit like the problem of being presented with an empty SPARQL query box&#8230;</p>
<p>It would be quite handy to be able to set &#8211; and save with meaningful labels &#8211; preference sets about the countries you&#8217;re interested in so you don&#8217;t have to keep keep scrolling through long country lists looking for the countries you want to generate reports for? (Support for &#8220;standard&#8221; groupings of countries might also be useful?) Being able to share URLs to predefined reports might also be handy? But this would possibly make the site even more complex to use!</p>
<p>One easier way of working with FAOStat data, particularly if you access the FAO datasets regularly, might be to take a programmatic route using the <a href="http://cran.r-project.org/web/packages/FAOSTAT/index.html">FAOStat R package</a>. Making datasets available in ways that bring that data directly into a desktop analysis environment where they can be worked on without requiring cleaning or other forms of tidying up (which is often the case when data is made available via Excel spreadsheets or CSV files) is a trend I hope we see more of. (That is not to say that data shouldn&#8217;t also be published in &#8220;generic&#8221; document formats&#8230;). If you are using a reproducible research strategy, queries to original datasources provide implicit, self-describing metadata about the data source and the query used to return a particular dataset, metadata that is all to easy to lose, or otherwise detach from a dataset when working with downloaded files.</p>
<p>I haven&#8217;t had chance to play with this package yet &#8211; it&#8217;s still in testing anyway, I think? &#8211; but it looks quite handy at a first glance (I need to do a proper review&#8230;). As well as providing a way of running data grab queries over theFAO FAOSTAT and World Bank WDI APIs, it seems to provide support for &#8220;linkage&#8221;. As the <a href="http://cran.r-project.org/web/packages/FAOSTAT/vignettes/FAOSTAT.pdf">draft vignette</a> suggests, &#8220;Merge is a typical data manipulation step in daily work yet a non-trivial exercise especially when working with different data sources. The built in <tt>mergeSYB</tt> function enables one to merge data from different sources as long as the country coding system is identiﬁed. &#8230; Data from any source with [a] classiﬁcation [supported by the package] can be supplied to <tt>mergeSYB</tt> in order to obtain a single merged data. <em>(sic)</em>&#8220;. Supported formats currently include: <em>United Nations M49 country standard [<a href="http://unstats.un.org/unsd/methods/m49/m49.htm">UN_CODE</a>]; FAO country code scheme [<a href="http://termportal.fao.org/faonocs/appl/">FAOST_CODE</a>]; FAO Global Administrative Unit Layers (GAUL) [ADM0_CODE]; ISO 3166-1 alpha-2 [<a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">ISO2_CODE</a>]; ISO 3166-1 alpha-2 (World Bank) [<a href="http://data.worldbank.org/node/18">ISO2_WB_CODE</a>]; ISO 3166-1 alpha-3 [<a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3">ISO3_CODE</a>]; ISO 3166-1 alpha-3 (World Bank) [<a href="http://data.worldbank.org/node/18">ISO3_WB_CODE</a>]</em>.</p>
<p>By releasing an &#8220;official&#8221; R package to access the FAOStat API, it occurs to me that this makes it much easier to start building sector specific <a href="http://www.rstudio.com/shiny/">Shiny applications</a> around particular datasets? I wonder whether the FAOstat folk have considered whether there is a possibility of developing a small Shiny app or custom client ecosystem around their data, even if it just takes the form of a curated set of gists that can be downloaded directly into RStudio, for example, using <a href="http://www.inside-r.org/node/168838"><tt>runGist</tt></a>?</p>
<p>I don&#8217;t know whether the <a href="http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/">Eurostat EC Statistics database</a> has an associated R package too? (If so, it could be quite interesting trying to tie them together?! I do note, however, that <a href="http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/bulk_download">Eurostat data is available for download</a> (though I haven&#8217;t read the terms/license conditions&#8230;).</p>
<p>I also note that a Linked Data/SPARQL way in to Eurostat data appears to be available? <a href="http://eurostat.linked-statistics.org/">Eurostat Linked Data</a>.</p>
<p>[Man flu, hence the brevity of the post... skulks back off to sick bed...]</p>
<p>PS BY the by, I notice that the <a href="http://www.england.nhs.uk/statistics/2013/03/08/nhs-111-statistics-january-2013/">NHS are experimenting with making some data releases available via Google Public Data Explorer</a> [scroll down...]</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/9917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/9917/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9917&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/03/08/publishing-stats-for-analytics-reuse-faostat-website-and-r-package/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/faostat-website.png" medium="image">
			<media:title type="html">FAOStat website</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/faostat-graphical-tools.png" medium="image">
			<media:title type="html">FAOstat - graphical tools</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/faostat-inline-data-preview.png" medium="image">
			<media:title type="html">faostat - inline data preview</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/faostat-ddata-analysis.png" medium="image">
			<media:title type="html">FAOStat - ddata analysis</media:title>
		</media:content>
	</item>
		<item>
		<title>What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events</title>
		<link>http://blog.ouseful.info/2013/03/04/what-happened-then-using-approximated-twitter-follower-accession-to-identify-political-events/</link>
		<comments>http://blog.ouseful.info/2013/03/04/what-happened-then-using-approximated-twitter-follower-accession-to-identify-political-events/#comments</comments>
		<pubDate>Mon, 04 Mar 2013 21:42:58 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Anything you want]]></category>
		<category><![CDATA[Rstats]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=10028</guid>
		<description><![CDATA[Following a chat with @andypryke, I thought I&#8217;d try out a simple bit of feature detection around approximated follower acquisition charts (e.g. Estimated Follower Accession Charts for Twitter) to see if I could detect dates around which there were spikes in follower acquisition. So for example, here&#8217;s the follower acquistion chart for Seem Malhotra: We [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10028&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Following a chat with <a href="http://twitter.com/andypryke">@andypryke</a>, I thought I&#8217;d try out a simple bit of feature detection around approximated follower acquisition charts (e.g. <a href="http://blog.ouseful.info/2013/04/05/estimated-follower-accession-charts-for-twitter/">Estimated Follower Accession Charts for Twitter</a>) to see if I could detect dates around which there were spikes in follower acquisition.</p>
<p>So for example, here&#8217;s the follower acquistion chart for Seem Malhotra:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/03/seemamalhotra.png"><img src="http://ouseful.files.wordpress.com/2013/03/seemamalhotra.png?w=700" alt="seemaMalhotra"   class="alignnone size-full wp-image-10029" /></a></p>
<p>We see a spike in follower count about 440 days ago, with an increased daily follower acquisition rate thereafter. WHat happened 440 days or so ago? We can easily look this up on something like Wolfram Alpha (query on /440 days ago/) or directly in R:</p>
<pre class="brush: r; title: ; notranslate">as.Date(Sys.time())-440
[1] &quot;2011-12-20&quot;</pre>
<p>So what happened in December 2011? A quick search on /&#8221;Seema Malhotra&#8221; December 2011/ turns up the news that she won a by-election on 16 December 2011. The spike in followers matches the by-election date well, and the increased rate in daily follower acquisition since then is presumably related to the fact that Seema Malhotra is now an MP.</p>
<p>So what&#8217;s the new line on the chart (the black, stepped line along the bottom)? It&#8217;s actually a 5 point moving average of the first difference in follower count over time (that is, sort of a smoothed version of a crude approximation to the gradient of the follower acquisition curve; the <em>firstdiff</em> curve is normalised by finding the difference in accumulated follower count between consecutive time samples divided the number of days between samples. So it&#8217;s a sort of gradient rather than first difference. If the samples were all a day apart, it would be a first difference&#8230;). I also filter the line to only show days on which there was a &#8220;significant jump&#8221; in follower count, arbitrarily set at a 5 sample moving average of more than 50 new followers per day. Note that scaling of the moving average values too &#8211; the numerical y-axis scale is 1:1 for the cumulative follower number, but 10x the moving average value. The numerical value labels that annotate the line chart correspond to the number of days ago (relative to the date the chart was generated) that the peak corresponds to. For any chart critics out there &#8211; this is a &#8220;working chart&#8221;, rather than a polished presentation graphic;-)</p>
<pre class="brush: r; title: ; notranslate">#Process Twitter user data file
processUsersData=function(data){
  data$tz=as.POSIXct(data$created_at)
  data$days=as.integer(difftime(Sys.time(),data$tz,units='days'))
  data=data[rev(rownames(data)),]
  data$acc=1:length(data$days)
  data$recency=cummin(data$days)
  data$frfo=data$friends_count/data$followers_count
  data$stfo=data$statuses_count/data$followers_count
  data$foperday=data$followers_count/data$days
  data$stperday=data$statuses_count/data$days
  data$fost=data$followers_count/(1+data$statuses_count)
  
  data
}

#The TTR library includes various moving average functions
require(TTR)

differ_a=function(d){
  d=processUsersData(d)

  #Find the users who are used to approximate the accession date
  d2=subset(d,days==recency)
  #Dedupe these rows (need to check if I grab the first of the last...)
  d3=d2[!duplicated(d2$recency),]

  #First difference
  d3$accdiff=c(0,diff(d3$acc))
  d3$daysdiff=c(0,-diff(d3$days))
  d3$firstdiff=d3$accdiff/d3$daysdiff

  #First difference smoothed over 5 values - note we do dodgy things against time here - just look for signal!
  d3$SMA5=SMA(d3$firstdiff,5)

  #Second difference
  d3$fdd=c(0,diff(d3$firstdiff))
  d3$seconddiff=d3$fdd/d3$daysdiff
  
  d3
}

#An example plotter - sm is the user data
g= ggplot(processUsersData(sm))
g=g+geom_point(aes(x=-days,y=acc),size=1) #The black acc/days dots
g=g+geom_point(aes(x=-recency,y=acc),col='red',size=1) #The red acc/days  acquisition date estimate dots
g=g+geom_line(data=differ_a(sm),aes(x=-days,y=10*SMA5)) #The firstdiff moving average line
g=g+geom_text(data=subset(differ_a(sm),SMA5&gt;50),aes(x=-days,y=10*SMA5,label=days),size=3) #Feature label
g=g+ggtitle(&quot;Seema Malhotra&quot;) #Chart title
</pre>
<p>Here&#8217;s Chris Pincher:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/03/chrispincher.png"><img src="http://ouseful.files.wordpress.com/2013/03/chrispincher.png?w=700" alt="chrispincher"   class="alignnone size-full wp-image-10030" /></a></p>
<p>This account got hit about 79 days ago (December 15th 2012) &#8211; we need to ignore the width of the moving average curve and just focus on the leading edge, as a zoom into the chart, with a barchart depicting firstdiff replacing the first diff moving average line, demonstrates.</p>
<pre class="brush: r; title: ; notranslate">#Got a rogue datapoint in there somehow?
ggplot(subset(processUsersData(cpmp),days&amp;lt;5000))
g=g+geom_point(aes(x=-days,y=acc),size=1)
g=g+geom_point(aes(x=-recency,y=acc),col=&#039;red&#039;,size=1)
g=g+geom_bar(data=subset(differ_a(cpmp),days50 &amp;amp; days&amp;lt;5000),aes(x=-days,y=firstdiff,label=days),size=3)
g=g+ggtitle(&amp;quot;Chris Pincher&amp;quot;)+xlim(-200,-25)</pre>
<p><a href="http://ouseful.files.wordpress.com/2013/03/chrispincherzoom.png"><img src="http://ouseful.files.wordpress.com/2013/03/chrispincherzoom.png?w=700" alt="chrispincherzoom"   class="alignnone size-full wp-image-10031" /></a></p>
<p>The spam followers that were signed up to the account look like they were created in batches several months prior to what I presume was an attack? COuld this have been in response to his <a href="http://www.tamworthconservatives.co.uk/2012/12/christopher-pincher-speaks-out-about-the-collapse-of-drive-assist/">Speaking Out about the Collapse of Drive Assist</a> on Thursday, December 13th, 2012, his <a href="http://www.huffingtonpost.co.uk/chris-pincher/securing-the-uks-energy-n_b_2275881.html" rel="nofollow">Huffpo post</a> on the 11th, or his <a href="http://conservativehome.blogs.com/parliament/christopher-pincher-mp/">vote against the Human Rights Act</a> as reported on the 5th? </p>
<p>Who else has an odd follower acquisition chart? How about Aidan Burley?</p>
<p><a href="http://ouseful.files.wordpress.com/2013/03/aidanburley.png"><img src="http://ouseful.files.wordpress.com/2013/03/aidanburley.png?w=700" alt="AidanBurley"   class="alignnone size-full wp-image-10032" /></a></p>
<p>219 days ago &#8211; 28th July 2012&#8230;</p>
<p><a href="http://www.guardian.co.uk/politics/2012/jul/28/olympics-opening-ceremony-multicultural-crap-tory-mp"><img src="http://ouseful.files.wordpress.com/2013/03/aidanburleycrap.png?w=700&#038;h=589" alt="aidanBurleycrap" width="700" height="589" class="alignnone size-full wp-image-10035" /></a></p>
<p>I guess that caused something of a Twitter storm, and a resulting growth in his follower count&#8230; Diane Abbott&#8217;s <a href="http://www.guardian.co.uk/politics/2012/jan/05/diane-abbott-accused-racism-twitter">racist tweet row</a> from December 2012 also grew her twitter following&#8230; Top tip for follower acquisition, there;-)</p>
<p>Nadine Dorries&#8217; outspoken comments in May 2012 around David Cameron&#8217;s party leadership, and then same sex marriage, was good for her Twitter follower count, which received another push when she joined I&#8217;m a Celebrity and was suspended from the Parliamentary Conservative party.</p>
<p>Showing your emotions in Parliament looks like a handy trick too&#8230;Here&#8217;s a spike around about October 20th, 2011&#8230;</p>
<p><a href="http://www.liverpoolecho.co.uk/liverpool-news/local-news/2011/10/19/hillsborough-debate-alison-mcgovern-on-why-she-had-to-make-her-emotionally-charged-speech-video-100252-29620042/"><img src="http://ouseful.files.wordpress.com/2013/03/alisonmcgovern.png?w=700&#038;h=379" alt="alisonMcgovern" width="700" height="379" class="alignnone size-full wp-image-10036" /></a></p>
<p>(There also looks to be a gradient change around 200 days ago maybe? The second diff calculations might pull this out?)</p>
<p>Chris Bryant&#8217;s <a href="http://www.guardian.co.uk/commentisfree/2011/jul/07/phone-hacking-chris-bryant">speech on the phone hacking saga</a> in July 2011 showed that publicly well-received parliamentary speeches can be good for numbers too; not surprisingly, the phone hacking scandal was also good for Tom Watson&#8217;s follower count around the end of July 2011. Election victories can be good too: Andy Sawford got a jump in followers when he was announced as a PPC (10th August 2012) and then when he won his seat (November 7th 2012); Ben Bradshaw&#8217;s numbers also jumped around the time of his May 2010 election victory, as did Lynne Featherstone&#8217;s, particularly with her <a href="http://www.guardian.co.uk/theguardian/2012/mar/10/lynne-featherstone-equality-women-feminism">appointment to a government position</a>. Jesse Norman appeared to get a bump after the Prime Minister <a href="http://www.bbc.co.uk/news/uk-politics-18795720">confronted him on July 11th 2012</a>; Nick de Bois saw a leap in followers following the riots in his constituency in early August 2011, and the riots also seem to have bumped David Lammy&#8217;s and Diane Abbott&#8217;s numbers up.</p>
<p>A <a href="http://www.guardian.co.uk/uk/2011/sep/17/welsh-miners-families-face-loss">tragedy</a> on September 17th looks like it may have pushed Peter Hain&#8217;s numbers, but he was in the news a reasonable amount around that time &#8211; maybe getting your name in the press for several days in a row is good for Twitter follower counts? Steve Rotherham also benefited from another recalled tragedy, the Hillsborough distaster, when, in October 2011, <a href="http://www.liverpoolecho.co.uk/liverpool-news/local-news/2011/10/20/liverpool-mp-steve-rotheram-to-challenge-ex-sun-editor-kelvin-mackenzie-over-paper-s-hillsborough-lies-100252-29631849/">he called the ex-Sun&#8217;s editor out</a> over it&#8217;s original coverage; he seems to have received another boost in followers when he <a href="http://www.liverpoolecho.co.uk/liverpool-news/local-news/2012/09/18/liverpool-walton-mp-steve-rotheram-leads-commons-debate-on-internet-trolls-100252-31855396/">lead a debate on internet trolls</a> in September 2012.</p>
<p>Personal misfortune didn&#8217;t do Michael Fabricant any harm &#8211; his <a href="http://www.dailymail.co.uk/news/article-2219554/Michael-Fabricant-I-guilty-I-pleb-Tory-MP-caught-speeding-30mph-zone-says-hell-train.html">speeding conviction</a> <a href="http://conservativehome.blogs.com/parliament/2012/10/mayday-mayday-whips-office-calling-stop-mike_fabricant-now-we-repeat-stop-mike_fabricant-now-.html">colourful Twitter baiting</a> in October 2012 caused his follower count to fly and achieve an elevated rate of daily growth it&#8217;s maintained ever since. </p>
<p>A Dispatches special on ticket touts got a bounce in followers for Sharon Hodgson, who was sponsoring a <a href="http://www.sharonhodgson.org/ticket-touting-private-members-bill">private member&#8217;s bill on ticket touts</a> at the time; winning a social media award seemed to do Kevin Brennan a favour in terms of his daily follower acquisition rate, as this ramp increase around the start of December 2010 shows:</p>
<p><a href="http://www.guardian.co.uk/cardiff/2010/nov/18/cardiff-mp-wins-social-media-award"><img src="http://ouseful.files.wordpress.com/2013/03/kevinbrennan.png?w=700&#038;h=412" alt="kevinBrennan" width="700" height="412" class="alignnone size-full wp-image-10037" /></a></p>
<p>So there we have it; political life as seen through the lens of Twitter follower acquisition bursts:-)</p>
<p>But what now? I guess one thing to do would be to have a go at estimating the daily growth rates of the various twittering MPs, and see if thy have any bearing to things like ministerial (or Shadow Minister) responsiblity? Where rates seem to change (sustained kinks in the curve), it might be worth looking to see whether we can identify any signs of changes in tweeting behaviour &#8211; or maybe news stories that come to associate the MP with Twitter in some way?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/10028/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/10028/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=10028&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/03/04/what-happened-then-using-approximated-twitter-follower-accession-to-identify-political-events/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/seemamalhotra.png" medium="image">
			<media:title type="html">seemaMalhotra</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/chrispincher.png" medium="image">
			<media:title type="html">chrispincher</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/chrispincherzoom.png" medium="image">
			<media:title type="html">chrispincherzoom</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/aidanburley.png" medium="image">
			<media:title type="html">AidanBurley</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/aidanburleycrap.png" medium="image">
			<media:title type="html">aidanBurleycrap</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/alisonmcgovern.png" medium="image">
			<media:title type="html">alisonMcgovern</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/03/kevinbrennan.png" medium="image">
			<media:title type="html">kevinBrennan</media:title>
		</media:content>
	</item>
		<item>
		<title>Sketches Around Twitter Followers</title>
		<link>http://blog.ouseful.info/2013/02/19/sketches-around-twitter-followers/</link>
		<comments>http://blog.ouseful.info/2013/02/19/sketches-around-twitter-followers/#comments</comments>
		<pubDate>Tue, 19 Feb 2013 14:09:14 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Infoskills]]></category>
		<category><![CDATA[Rstats]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=9868</guid>
		<description><![CDATA[I&#8217;ve been doodling&#8230; Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I&#8217;d have a quick look at some stats around follower groupings&#8230; I started off with a data grab, pulling down the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9868&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve been doodling&#8230; Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I&#8217;d have a quick look at some stats around follower groupings&#8230;</p>
<p>I started off with a data grab, pulling down the IDs of accounts on a particular Twitter list and then looking up the user details for each follower. This gives summary data such as the number of friends, followers and status updates; a timestamp for when the account was created; whether the account is private or not; the &#8220;location&#8221;, as well as a possibly more informative timezone field (you may tell fibs about the location setting but I suspect the casual user is more likely to set a timezone appropriate to their locale).</p>
<p>So what can we do with that data? Simple scatter plots, for one thing &#8211; here&#8217;s how friends vs. followers distribute for MPs on the Tweetminster UKMPs list:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/ukmps_frfo_scatter.png"><img src="http://ouseful.files.wordpress.com/2013/02/ukmps_frfo_scatter.png?w=700&#038;h=266" alt="ukMPS_frfo_scatter" width="700" height="266" class="alignnone size-full wp-image-9869" /></a></p>
<p>We can also see how follower numbers are distributed across those MPs, for example, which looks reasonable and helps us get our eye in&#8230;:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/ukmps_fo_dist.png"><img src="http://ouseful.files.wordpress.com/2013/02/ukmps_fo_dist.png?w=700&#038;h=233" alt="ukMPS_fo_dist" width="700" height="233" class="alignnone size-full wp-image-9872" /></a></p>
<p>We can also calculate ratios and then plot them &#8211; followers per day (the number of followers divided by the number of days since the account was registered, for example) vs the followers per status update (to try to get a feeling of how the number of followers relates to the number of tweets):</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/ukmps_foday_fost.png"><img src="http://ouseful.files.wordpress.com/2013/02/ukmps_foday_fost.png?w=700&#038;h=257" alt="ukMPs_foday_fost" width="700" height="257" class="alignnone size-full wp-image-9873" /></a></p>
<p>This particular view shows a few outliers, and allows us to spot a couple of accounts that have recently had a &#8216;change of use&#8217;.</p>
<p>As well as looking at the stats across the set of MPs, we can pull down the list of followers of a particular account (or sample thereof &#8211; I grabbed the lesser of all followers or 10,000 randomly sampled followers from a target account) and then look at the summary stats (number of followers, friends, date they joined Twitter, etc) over those followers.</p>
<p>So for example, things like this &#8211; a scatterplot of friends/follower counts similar to the one above:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/friendsfollowers.png"><img src="http://ouseful.files.wordpress.com/2013/02/friendsfollowers.png?w=700" alt="friendsfollowers"   class="alignnone size-full wp-image-9876" /></a></p>
<p>&#8230;sort of. There&#8217;s something obviously odd about that graph, isn&#8217;t there? The &#8220;step up&#8221; at a friends count of 2000. This is because Twitter imposes, in most cases, a limit of 2000 friends on an account.</p>
<p>How about the followers per day for an account versus the number of days that account has been on Twitter, with outliers highlighted?</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/foperday_days.png"><img src="http://ouseful.files.wordpress.com/2013/02/foperday_days.png?w=700" alt="foperday_days"   class="alignnone size-full wp-image-9877" /></a></p>
<p>Alternatively, we can do counts by number of days the followers have been on Twitter:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/rplot.png"><img src="http://ouseful.files.wordpress.com/2013/02/rplot.png?w=700" alt="Rplot"   class="alignnone size-full wp-image-9878" /></a></p>
<p>The bump around 1500 days ago corresponds to Twitter getting suddenly popular around then, as this chart from Google Trends shows:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/gtrends.png"><img src="http://ouseful.files.wordpress.com/2013/02/gtrends.png?w=700&#038;h=501" alt="gtrends" width="700" height="501" class="alignnone size-full wp-image-9879" /></a></p>
<p>Sometimes, you get a distribution that is very, very wrong&#8230; If we do a histogram that has bins along the x-axis specifying that a follower had 0-100 followers of their own, or 500-600 followers etc, and then for all the followers of a particular account, pop them into a corresponding bin given the number of their followers, counting the number of people in each bin once we have allocated them all, we might normally expect to see something like this:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/normally-log-followers.png"><img src="http://ouseful.files.wordpress.com/2013/02/normally-log-followers.png?w=700" alt="normally log followers"   class="alignnone size-full wp-image-9875" /></a></p>
<p>However, if an account is followed by lots of followers that have zero or very few followers of their own, we get a skewed distribution like this:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/a-dodgy-follower-distribution.png"><img src="http://ouseful.files.wordpress.com/2013/02/a-dodgy-follower-distribution.png?w=700" alt="a dodgy follower distribution"   class="alignnone size-full wp-image-9874" /></a></p>
<p>There&#8217;s obviously something not quite, erm, normal(?!) about this account (at least, <em>at the time I grabbed the data, there was something not quite normal etc etc&#8230;</em>).</p>
<p>When we get stats from the followers of a set of folk, such as the members of a list, we can generate summary statistics over the sets of followers of each person on the list &#8211; for example, the median number of followers, or different ratios (eg mean of the friend/follower ratios for each follower). Lots of possible stats &#8211; but which ones does it make sense to look at?</p>
<p>Here&#8217;s one&#8230; a plot of the median followers per status ratio versus the median friend/follower ratio:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/fostvfrfo.png"><img src="http://ouseful.files.wordpress.com/2013/02/fostvfrfo.png?w=700" alt="fostvfrfo"   class="alignnone size-full wp-image-9883" /></a></p>
<p>Spot the outlier ;-) </p>
<p>So that&#8217;s a quick review of some of the views we can get from data grabs of the user details from the followers of a particular account. A useful complement to the social positioning maps I&#8217;ve also been doing for some time:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/davidevennett.png"><img src="http://ouseful.files.wordpress.com/2013/02/davidevennett.png?w=700&#038;h=700" alt="davidevennett" width="700" height="700" class="alignnone size-full wp-image-9882" /></a></p>
<p>It&#8217;s just a shame that my whitelisted Twitter API key is probably going to die in few weeks:-(</p>
<p>[In the next post in this series I'll describe a plot that estimates when folk started following a particular account, and demonstrate how it can be used to identify notable "events" surrounding the person being followed...]</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/9868/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/9868/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9868&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/02/19/sketches-around-twitter-followers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/ukmps_frfo_scatter.png" medium="image">
			<media:title type="html">ukMPS_frfo_scatter</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/ukmps_fo_dist.png" medium="image">
			<media:title type="html">ukMPS_fo_dist</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/ukmps_foday_fost.png" medium="image">
			<media:title type="html">ukMPs_foday_fost</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/friendsfollowers.png" medium="image">
			<media:title type="html">friendsfollowers</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/foperday_days.png" medium="image">
			<media:title type="html">foperday_days</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/rplot.png" medium="image">
			<media:title type="html">Rplot</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/gtrends.png" medium="image">
			<media:title type="html">gtrends</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/normally-log-followers.png" medium="image">
			<media:title type="html">normally log followers</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/a-dodgy-follower-distribution.png" medium="image">
			<media:title type="html">a dodgy follower distribution</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/fostvfrfo.png" medium="image">
			<media:title type="html">fostvfrfo</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/davidevennett.png" medium="image">
			<media:title type="html">davidevennett</media:title>
		</media:content>
	</item>
		<item>
		<title>Reshaping Horse Import/Export Data to Fit a Sankey Diagram</title>
		<link>http://blog.ouseful.info/2013/02/18/reshaping-horse-importexport-data-to-fit-a-sankey-diagram/</link>
		<comments>http://blog.ouseful.info/2013/02/18/reshaping-horse-importexport-data-to-fit-a-sankey-diagram/#comments</comments>
		<pubDate>Mon, 18 Feb 2013 10:31:35 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Infoskills]]></category>
		<category><![CDATA[Rstats]]></category>
		<category><![CDATA[ddj]]></category>
		<category><![CDATA[schoolofdata]]></category>
		<category><![CDATA[toolchain]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=9840</guid>
		<description><![CDATA[As the food labeling and substituted horsemeat saga rolls on, I&#8217;ve been surprised at how little use has been made of &#8220;data&#8221; to put the structure of the food chain into some sort of context* (or maybe I&#8217;ve just missed those stories?). One place that can almost always be guaranteed to post a few related [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9840&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>As the food labeling and substituted horsemeat saga rolls on, I&#8217;ve been surprised at how little use has been made of &#8220;data&#8221; to put the structure of the food chain into some sort of context* (or maybe I&#8217;ve just missed those stories?). One place that can almost always be guaranteed to post a few related datasets is the Guardian Datastore, who use <a href="https://docs.google.com/spreadsheet/ccc?key=0ArwVnOqE20IkdGFRU3ZxREg4NUttRUp5YllHY095X1E&amp;usp=sharing#gid=3">EU horse import/export data</a> to produce <a href="http://www.guardian.co.uk/uk/datablog/interactive/2013/feb/15/europe-trade-horsemeat-map-interactive">interactive map of the European trade in horsemeat</a></p>
<p><small>*One for the to do list &#8211; a round up of &#8220;#ddj&#8221; stories around the episode.)</small></p>
<p><a href="http://www.guardian.co.uk/uk/datablog/interactive/2013/feb/15/europe-trade-horsemeat-map-interactive"><img src="http://ouseful.files.wordpress.com/2013/02/guardian-datablog-eu-trade-in-horsemeat.png?w=700&#038;h=715" alt="Guardian datablog - EU trade in horsemeat" width="700" height="715" class="alignnone size-full wp-image-9841" /></a></p>
<p>(The article describes the source of the data as the <a href="http://epp.eurostat.ec.europa.eu/newxtweb/">Eurpoean Union Unistat statistics website</a>, although I couldn&#8217;t find a way of recreating the Guardian spreadsheet from that source. When I asked Simon Rogers how he&#8217;d come by the data, he <a href="https://twitter.com/smfrogers/statuses/302577122705293312">suggested</a> putting questions into the Eurostat press office;-)</p>
<p>The data published by the Guardian datastore is a matrix showing the number of horse imports/exports between EU member countries (as well as major traders outside the EU) in 2012:</p>
<p><a href="https://docs.google.com/spreadsheet/ccc?key=0ArwVnOqE20IkdGFRU3ZxREg4NUttRUp5YllHY095X1E&amp;usp=sharing#gid=3"><img src="http://ouseful.files.wordpress.com/2013/02/guardian-datablog-horsemeat-importexport-data.png?w=700&#038;h=377" alt="Guardian Datablog horsemeat importexport data" width="700" height="377" class="alignnone size-full wp-image-9842" /></a></p>
<p>One way of viewing this data structure is as an <em>edge weighted adjacency matrix</em> that describes a graph (a network) in which the member countries are nodes and the cells in the matrix define edge weights between country nodes. The weighted edges are also directed, signifying the flow of animals <em>from</em> one country <em>to</em> another.</p>
<p>Thinking about trade as <em>flow</em> suggests a variety of different visualisation types that build on the metaphor of flow, such as a Sankey diagram. In a Sankey diagram, edges of different thicknesses connect different nodes, with the edge thickness dependent on the amount of &#8220;stuff&#8221; flowing through that connection. (The Guardan map above also uses edge thickness to identify trade volumes.) Here&#8217;s an example of a Sankey diagram I created around the horse export data:</p>
<p><a href="https://views.scraperwiki.com/run/eu_horse_imports_sankey_diagram/?"><img src="http://ouseful.files.wordpress.com/2013/02/horse-exports-eu-sankey-demo.png?w=700&#038;h=403" alt="Horse exports - EU - Sankey demo" width="700" height="403" class="alignnone size-full wp-image-9844" /></a></p>
<p>(The layout is a little rough and ready &#8211; I was more interested in finding a recipe for creating the base graphic &#8211; <em>sans</em> design tweaks;-) &#8211; from the data as supplied.)</p>
<p>So how did I get to the diagram from the data?</p>
<p>As already mentioned, the data came supplied as an adjacency matrix. The Sankey diagram depicted above was generated by passing data in an appropriate form to the <a href="https://github.com/d3/d3-plugins/tree/master/sankey">Sankey diagram plugin</a> to Mike Bostock&#8217;s d3.js graphics library. The plugin requires data in a JSON data format that describes a graph. I happen to know that that the Python networkx library can <a href="http://networkx.github.com/documentation/latest/reference/readwrite.json_graph.html">generate an appropriate data object</a> from a graph modeled using networkx, so I know that if I can generate a graph in networkx I can create a basic Sankey diagram &#8220;for free&#8221;.</p>
<p>So how can we create the graph from the data?</p>
<p>The networkx documentation describes a method &#8211; <a href="http://networkx.github.com/documentation/latest/reference/generated/networkx.readwrite.edgelist.read_weighted_edgelist.html">read_weighted_edgelist</a> &#8211; for reading in a weighted adjacency matrix from a text file, and creating a network from it. If I used this to read the data in, I would get a directed network with edges going into and out of country nodes showing the number of imports and exports. However, I wanted to create a diagram in which the &#8220;import to&#8221; and &#8220;export from&#8221; nodes were distinct so that exports could be seen to flow across the diagram. The approach I took was to transform the two-dimensional adjacency matrix into a <em>weighted edge list</em> in which each row has three columns: <em>exporting country, importing country, amount</em>.</p>
<p>So how can we do that?</p>
<p>One way is to use R. Cutting and pasting the export data of interest from the spreadsheet and into a text file (adding in the missing first column header as I did so) gives a <a href="https://dl.dropbox.com/u/1156404/horseexportsEU.txt">source data file</a> that looks something like this:</p>
<p><a href="https://dl.dropbox.com/u/1156404/horseexportsEU.txt"><img src="http://ouseful.files.wordpress.com/2013/02/horse-export-source-data.png?w=700" alt="horse export source data"   class="alignnone size-full wp-image-9846" /></a> </p>
<p>In contrast, the edge list looks something like this:<br />
<a href="http://ouseful.files.wordpress.com/2013/02/reshaped-horse-data.png"><img src="http://ouseful.files.wordpress.com/2013/02/reshaped-horse-data.png?w=700" alt="reshaped horse data"   class="alignnone size-full wp-image-9845" /></a></p>
<p>So how do we get from one to the other?</p>
<p>Here&#8217;s the R script I used &#8211; it reads the file in, does a bit of fiddling to remove commas from the numbers and turn the result into integer based numbers, and then uses the <em>melt</em> function from the <em>reshape</em> library to generate the edge list, finally filtering out edges where there were no exports:</p>
<pre class="brush: r; title: ; notranslate">#R code

horseexportsEU &lt;- read.delim(&quot;~/Downloads/horseexportsEU.txt&quot;)
require(reshape)
#Get a &quot;long&quot; edge list
x=melt(horseexportsEU,id='COUNTRY')
#Turn the numbers into numbers by removing the comma, then casting to an integer
x$value2=as.integer(as.character(gsub(&quot;,&quot;, &quot;&quot;, x$value, fixed = TRUE) ))
#If we have an NA (null/empty) value, make it -1
x$value2[ is.na(x$value2) ] = -1
#Column names with countries that originally contained spaces convert spaces dots. Undo that. 
x$variable=gsub(&quot;.&quot;, &quot; &quot;, x$variable, fixed = TRUE)
#I want to export a subset of the data
xt=subset(x,value2&gt;0,select=c('COUNTRY','variable','value2'))
#Generate a text file containing the edge list
write.table(xt, file=&quot;foo.csv&quot;, row.names=FALSE, col.names=FALSE, sep=&quot;,&quot;)</pre>
<p>(Another way of getting a directed, weighted edge list from an adjacency table might be to import it into networkx from the weighted adjacency matrix and then export it as weighted edge list. R also has graph libraries available, such as <em>igraph</em>, that can do similar things. But then, I wouldn&#8217;t have go to show the &#8220;melt&#8221; method to reshaping data;-)</p>
<p>Having got the data, I now use a Python script to generate a network, and then export the required JSON representation for use by the d3js Sankey plugin:</p>
<pre class="brush: python; title: ; notranslate">#python code

import StringIO
import csv

#Bring in the edge list explicitly
#rawdata = '''&quot;SLOVENIA&quot;,&quot;AUSTRIA&quot;,1200
#&quot;AUSTRIA&quot;,&quot;BELGIUM&quot;,134600
#&quot;BULGARIA&quot;,&quot;BELGIUM&quot;,181900
#&quot;CYPRUS&quot;,&quot;BELGIUM&quot;,200600
#... etc
#&quot;ITALY&quot;,&quot;UNITED KINGDOM&quot;,12800
#&quot;POLAND&quot;,&quot;UNITED KINGDOM&quot;,129100'''

#We convert the rawdata string into a filestream
f = StringIO.StringIO(rawdata)
#..and then read it in as if it were a CSV file..
reader = csv.reader(f, delimiter=',')

def gNodeAdd(DG,nodelist,name):
    node=len(nodelist)
    DG.add_node(node,name=name)
    #DG.add_node(node,name=name)
    nodelist.append(name)
    return DG,nodelist

nodelist=[]

DG = nx.DiGraph()

#Here's where we build the graph
for item in reader:
    #Even though import and export countries have the same name, we create a unique version depending on
    # whether the country is the importer or the exporter.
    importTo=item[0]+'.'
    exportFrom=item[1]
    amount=item[2]
    if importTo not in nodelist:
        DG,nodelist=gNodeAdd(DG,nodelist,importTo)
    if exportFrom not in nodelist:
        DG,nodelist=gNodeAdd(DG,nodelist,exportFrom)
    DG.add_edge(nodelist.index(exportFrom),nodelist.index(importTo),value=amount)

json = json.dumps(json_graph.node_link_data(DG))
#The &quot;json&quot; serialisation can then be passed to a d3js containing web page...
</pre>
<p>Once the JSON object is generated, it can be handed over to d3.js. The whole script is available here: <a href="https://scraperwiki.com/views/eu_horse_imports_sankey_diagram/">EU Horse imports Sankey Diagram</a>.</p>
<p>What this recipe shows is how we can chain together several different tools and techniques (Google spreadsheets, R, Python, d3.js) to create a visualisation with too much effort (honestly!). Each step is actually quite simple, and with practice can be achieved quite quickly. The trick to producing the visualisation becomes one of decomposing the problem, trying to find a path from the format the data is in to start with, to a form in which it can be passed directly to a visualisation tool such as the d3js Sankey plugin.</p>
<p>PS In passing, as well as the data tables that can be searched on Eurostat, I also found the <a href="http://epp.eurostat.ec.europa.eu/portal/page/portal/publications/eurostat_yearbook_2012">Eurostat Yearbook</a>, which (for the most recent release at least), includes data tables relating to reported items:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/eurostat-yearbook.png"><img src="http://ouseful.files.wordpress.com/2013/02/eurostat-yearbook.png?w=700&#038;h=510" alt="Eurostat Yearbook" width="700" height="510" class="alignnone size-full wp-image-9843" /></a></p>
<p>So it seems that the more I look, the more and more places seems to making data that appears in reports available <em>as data</em>&#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/9840/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/9840/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9840&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/02/18/reshaping-horse-importexport-data-to-fit-a-sankey-diagram/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/guardian-datablog-eu-trade-in-horsemeat.png" medium="image">
			<media:title type="html">Guardian datablog - EU trade in horsemeat</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/guardian-datablog-horsemeat-importexport-data.png" medium="image">
			<media:title type="html">Guardian Datablog horsemeat importexport data</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/horse-exports-eu-sankey-demo.png" medium="image">
			<media:title type="html">Horse exports - EU - Sankey demo</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/horse-export-source-data.png" medium="image">
			<media:title type="html">horse export source data</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/reshaped-horse-data.png" medium="image">
			<media:title type="html">reshaped horse data</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/eurostat-yearbook.png" medium="image">
			<media:title type="html">Eurostat Yearbook</media:title>
		</media:content>
	</item>
		<item>
		<title>F1Stats &#8211; Correlations Between Qualifying, Grid and Race Classification</title>
		<link>http://blog.ouseful.info/2013/02/09/f1stats-correlations-between-qualifying-grid-and-race-classisification/</link>
		<comments>http://blog.ouseful.info/2013/02/09/f1stats-correlations-between-qualifying-grid-and-race-classisification/#comments</comments>
		<pubDate>Sat, 09 Feb 2013 23:17:15 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Rstats]]></category>
		<category><![CDATA[f1datajunkie]]></category>
		<category><![CDATA[f1stats]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=9825</guid>
		<description><![CDATA[Following directly on from F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification, and continuing in my attempt to replicate some of the methodology and results used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing, here&#8217;s [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9825&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Following directly on from <a href="http://blog.ouseful.info/2013/01/30/f1stats-visually-comparing-qualifying-and-grid-positions-with-race-classification/">F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification</a>, and continuing in my attempt to replicate some of the methodology and results used in <a href="http://newton.uor.edu/FacultyFolder/Silva/NASCARvF1.pdf">A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing</a>, here&#8217;s a quick look at the correlation scores between the final practice, qualifying and grid positions and the final race classification.</p>
<p>I&#8217;ve already done brief review of what correlation means (sort of) in <a href="http://blog.ouseful.info/2013/01/25/f1stats-a-prequel-to-getting-started-with-rank-correlations/">F1Stats – A Prequel to Getting Started With Rank Correlations</a>, so I&#8217;m just going to dive straight in with some R code that shows how I set about trying to find the correlations between the different classifications:</p>
<p>Here&#8217;s the answer from the <s>back of the book</s> paper that we&#8217;re aiming for&#8230;</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/f1vnascarcorrelation.png"><img src="http://ouseful.files.wordpress.com/2013/02/f1vnascarcorrelation.png?w=700" alt="F1VNASCARcorrelation"   class="alignnone size-full wp-image-9826" /></a></p>
<p>Here&#8217;s what I got:</p>
<p><small>
<pre>&gt; corrs.df[order(corrs.df$V1),]
              V1   p3pos.int    qpos.int     grid.int racepos.raw    pval.grid    pval.qpos  pval.p3pos
2      AUSTRALIA  0.30075188  0.01503759  0.087218045           1 7.143421e-01 9.518408e-01 0.197072158
13      MALAYSIA  0.42706767  0.57293233  0.630075188           1 3.584362e-03 9.410805e-03 0.061725312
6          CHINA -0.26015038  0.57443609  0.514285714           1 2.183596e-02 9.193214e-03 0.266812583
3        BAHRAIN  0.13082707  0.73233083  0.739849624           1 2.900250e-04 3.601434e-04 0.581232598
16         SPAIN  0.25112782  0.80451128  0.804511278           1 2.179221e-05 2.179221e-05 0.284231482
14        MONACO  0.51578947  0.48120301  0.476691729           1 3.513870e-02 3.326706e-02 0.021403708
17        TURKEY  0.52330827  0.73082707  0.730827068           1 3.756531e-04 3.756531e-04 0.019344720
9  GREAT BRITAIN  0.65413534  0.83007519  0.830075188           1 8.921842e-07 8.921842e-07 0.002260234
8        GERMANY  0.32030075  0.46917293  0.452631579           1 4.657539e-02 3.844275e-02 0.168419054
10       HUNGARY  0.49649123  0.37017544  0.370175439           1 1.194050e-01 1.194050e-01 0.032293715
7         EUROPE  0.28120301  0.72030075  0.720300752           1 4.997719e-04 4.997719e-04 0.228898214
4        BELGIUM  0.06766917  0.62105263  0.621052632           1 4.222076e-03 4.222076e-03 0.777083014
11         ITALY  0.52932331  0.52481203  0.524812030           1 1.895282e-02 1.895282e-02 0.017815489
15     SINGAPORE  0.50526316  0.58796992  0.715789474           1 5.621214e-04 7.414170e-03 0.024579520
12         JAPAN  0.34912281  0.74561404  0.849122807           1 0.000000e+00 3.739715e-04 0.143204045
5         BRAZIL -0.51578947 -0.02105263 -0.007518797           1 9.771776e-01 9.316030e-01 0.021403708
1      ABU DHABI  0.42556391  0.66466165  0.628571429           1 3.684738e-03 1.824565e-03 0.062722332</pre>
<p></small></p>
<p>The paper mistakenly reports the grid values as the qualifying positions, so if we look down the grid.int column that I use to contain the correlation values between the <em>grid</em> and final classifications, we see they broadly match the values quoted in the paper. I also calculated the p-values and they seem to be a little bit off, but of the right order.</p>
<p>And here&#8217;s the R-code I used to get those results&#8230; The first chunk is just the loader, a refinement of the code I have used previously:</p>
<pre class="brush: r; title: ; notranslate">require(RSQLite)
require(reshape)

#Data downloaded from my f1com scraper on scraperwiki
f1 = dbConnect(drv=&quot;SQLite&quot;, dbname=&quot;f1com_megascraper.sqlite&quot;)

getRacesData.full=function(year='2012'){
  #Data query
  results.combined=dbGetQuery(f1,
                              paste('SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==',year,' and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;',sep=''))
  
  #Data tidying
  results.combined=ddply(results.combined,.(race),mutate,racepos.raw=1:length(race))
  for (i in c('racepos','grid','qpos','p3pos','driverNum'))
    results.combined[[paste(i,'.int',sep='')]]=as.integer( as.character(results.combined[[i]]))
  results.combined$race=reorder(results.combined$race,results.combined$raceNum)
  
  results.combined
}

f1 = dbConnect(drv=&quot;SQLite&quot;, dbname=&quot;f1com_megascraper.sqlite&quot;)

results.combined=getRacesData.full(2009)
corrs.df[order(corrs.df$V1),]</pre>
<p>Here&#8217;s the actual correlation calculation &#8211; I use the <a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html"><tt>cor</tt> function</a>:</p>
<pre class="brush: r; title: ; notranslate">#The cor() function returns data that looks like:
#            p3pos.int   qpos.int   grid.int racepos.raw
#p3pos.int   1.0000000 0.31578947 0.28270677  0.30075188
#qpos.int    0.3157895 1.00000000 0.97744361  0.01503759
#grid.int    0.2827068 0.97744361 1.00000000  0.08721805
#racepos.raw 0.3007519 0.01503759 0.08721805  1.00000000
#Row/col 4 relates to the correlation with the race classification, so for now just return that

corr.rank.race=function(results.combined,cmethod='spearman'){
  ##Correlations
  corrs=NULL
  #Run through the races
  for (i in levels(factor(results.combined$race))){
    results.classified = subset( results.combined,
                                 race==i,
                                 select=c('p3pos.int','qpos.int','grid.int','racepos.raw'))
    #print(i)
    #print( results.classified)
    cp=cor(results.classified,method=cmethod,use=&quot;complete.obs&quot;)
    #print(cp[4,])
    corrs=rbind(corrs,c(i,cp[4,]))
  }
  corrs.df=as.data.frame(corrs)
  
  signif=data.frame()
  for (i in levels(factor(results.combined$race))){
    results.classified = subset( results.combined,
                                 race==i,
                                 select=c('p3pos.int','qpos.int','grid.int','racepos.raw'))
    #p.value
    pval.grid=cor.test(results.classified$racepos.raw,results.classified$grid.int,method=cmethod,alternative = &quot;two.sided&quot;)$p.value
    pval.qpos=cor.test(results.classified$racepos.raw,results.classified$qpos.int,method=cmethod,alternative = &quot;two.sided&quot;)$p.value
    pval.p3pos=cor.test(results.classified$racepos.raw,results.classified$p3pos.int,method=cmethod,alternative = &quot;two.sided&quot;)$p.value

    signif=rbind(signif,data.frame(race=i,pval.grid=pval.grid,pval.qpos=pval.qpos,pval.p3pos=pval.p3pos))
  }

  corrs.df$qpos.int=as.numeric(as.character(corrs.df$qpos.int))
  corrs.df$grid.int=as.numeric(as.character(corrs.df$grid.int))
  corrs.df$p3pos.int=as.numeric(as.character(corrs.df$p3pos.int))
  
  corrs.df=merge(corrs.df,signif,by.y='race',by.x='V1')
  corrs.df$V1=factor(corrs.df$V1,levels=levels(results.combined$race))
  corrs.df
}

corrs.df=corr.rank.race(results.combined)</pre>
<p>It&#8217;s then trivial to plot the result:</p>
<pre class="brush: r; title: ; notranslate">require(ggplot2)
xRot=function(g,s=5,lab=NULL) g+theme(axis.text.x=element_text(angle=-90,size=s))+xlab(lab)

g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int))
g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1)
g=g+ggtitle('F1 2009 Correlation: grid and final classification')
g</pre>
<p><a href="http://ouseful.files.wordpress.com/2013/02/f12009gridfinalcorr.png"><img src="http://ouseful.files.wordpress.com/2013/02/f12009gridfinalcorr.png?w=700" alt="f12009gridfinalcorr"   class="alignnone size-full wp-image-9829" /></a></p>
<p><a href="http://blog.ouseful.info/2013/01/25/f1stats-a-prequel-to-getting-started-with-rank-correlations/">Recalling that</a> there are different types of rank correlation function, specifically &#8220;Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series&#8221;, I wondered whether it would make sense to look at correlations under this measure to see whether there were any obvious looking differences compared to Spearmans&#8217;s rho, that might prompt us to look at the actual grid/race classifications to see which score appears to be more meaningful.</p>
<p>The easiest way to spot the difference is probably graphically:</p>
<pre class="brush: r; title: ; notranslate">corrs.df2=corr.rank.race(results.combined,'kendall')
corrs.df2[order(corrs.df2$V1),]

g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int),col='red',size=4)
g=g+geom_point(data=corrs.df2, aes(x=V1,y=grid.int),col='blue')
g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1)
g=g+ggtitle('F1 2009 Correlation: grid and final classification')
g</pre>
<p><small>
<pre>corrs.df2[order(corrs.df2$V1),]
              V1   p3pos.int    qpos.int    grid.int racepos.raw    pval.grid    pval.qpos  pval.p3pos
2      AUSTRALIA  0.17894737 -0.01052632  0.04210526           1 8.226829e-01 9.744669e-01 0.288378196
13      MALAYSIA  0.26315789  0.41052632  0.46315789           1 3.782665e-03 1.110136e-02 0.112604127
6          CHINA -0.20000000  0.41052632  0.35789474           1 2.832863e-02 1.110136e-02 0.233266557
3        BAHRAIN  0.07368421  0.51578947  0.52631579           1 8.408301e-04 1.099522e-03 0.677108239
16         SPAIN  0.17894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.288378196
14        MONACO  0.38947368  0.35789474  0.35789474           1 2.832863e-02 2.832863e-02 0.016406081
17        TURKEY  0.37894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.019784403
9  GREAT BRITAIN  0.46315789  0.63157895  0.63157895           1 3.622261e-05 3.622261e-05 0.003782665
8        GERMANY  0.23157895  0.31578947  0.30526316           1 6.380788e-02 5.475355e-02 0.164976406
10       HUNGARY  0.36842105  0.36842105  0.36842105           1 2.860214e-02 2.860214e-02 0.028602137
7         EUROPE  0.21052632  0.62105263  0.62105263           1 5.176962e-05 5.176962e-05 0.208628398
4        BELGIUM  0.02105263  0.46315789  0.46315789           1 3.782665e-03 3.782665e-03 0.923502331
11         ITALY  0.35789474  0.36842105  0.36842105           1 2.373450e-02 2.373450e-02 0.028328627
15     SINGAPORE  0.35789474  0.45263158  0.55789474           1 3.589956e-04 4.748310e-03 0.028328627
12         JAPAN  0.26315789  0.57894737  0.69590643           1 6.491222e-06 3.109641e-04 0.124796908
5         BRAZIL -0.37894737 -0.05263158 -0.04210526           1 8.226829e-01 7.732195e-01 0.019784403
1      ABU DHABI  0.34736842  0.61052632  0.55789474           1 3.589956e-04 7.321900e-05 0.033643947</pre>
<p></small></p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/f12009gridracecorrspearmanredvkendallblue.png"><img src="http://ouseful.files.wordpress.com/2013/02/f12009gridracecorrspearmanredvkendallblue.png?w=700" alt="f12009gridracecorrspearmanredvkendallblue"   class="alignnone size-full wp-image-9831" /></a></p>
<p>Hmm.. Kendall gives lower values for all races except Hungary &#8211; maybe put that on the &#8220;must look at Hungary compared to the other races&#8221; pile&#8230;;-)</p>
<p>One thing that did occur to me was that I have access to race data from other years, so it shouldn&#8217;t be too hard to see how the correlations play out over the years at different circuits (do grid/race correlations tend to be higher at some circuits, for example?).</p>
<pre class="brush: r; title: ; notranslate">testYears=function(years=2009:2012){
  bd=NULL
  for (year in years) {
    d=getRacesData.full(year)
    corrs.df=corr.rank.race(d)
    bd=rbind(bd,cbind(year,corrs.df))
  }
  bd
}

a=testYears(2006:2012)
ggplot(a)+geom_point(aes(x=year,y=grid.int))+facet_wrap(~V1)+ylim(0,1)

g=ggplot(a)+geom_boxplot(aes(x=V1,y=grid.int))
g=xRot(g)
g
</pre>
<p><a href="http://ouseful.files.wordpress.com/2013/02/f1cirr2006_12.png"><img src="http://ouseful.files.wordpress.com/2013/02/f1cirr2006_12.png?w=700" alt="f1cirr2006_12"   class="alignnone size-full wp-image-9832" /></a></p>
<p>So Spain and Turkey look like they tend to the processional? Let&#8217;s see if a boxplot bears that out:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/f12006_12boxplotbycct.png"><img src="http://ouseful.files.wordpress.com/2013/02/f12006_12boxplotbycct.png?w=700" alt="f12006_12boxplotbycct"   class="alignnone size-full wp-image-9835" /></a></p>
<p>How predictable have the years been, year on year?</p>
<pre class="brush: r; title: ; notranslate">g=ggplot(a)+geom_point(aes(x=V1,y=grid.int))+facet_wrap(~year)+ylim(0,1)
g=xRot(g)
g

ggplot(a)+geom_boxplot(aes(x=factor(year),y=grid.int))</pre>
<p><a href="http://ouseful.files.wordpress.com/2013/02/f12006_12corrbyyear.png"><img src="http://ouseful.files.wordpress.com/2013/02/f12006_12corrbyyear.png?w=700" alt="f12006_12corrbyyear"   class="alignnone size-full wp-image-9833" /></a></p>
<p>And as a boxplot:</p>
<p><a href="http://ouseful.files.wordpress.com/2013/02/f12006_12processional.png"><img src="http://ouseful.files.wordpress.com/2013/02/f12006_12processional.png?w=700" alt="f12006_12processional"   class="alignnone size-full wp-image-9834" /></a></p>
<p>From a betting point of view, (eg <a href="http://blog.ouseful.info/2013/01/28/getting-started-with-f1-betting-data/">Getting Started with F1 Betting Data</a> and <a href="http://blog.ouseful.info/2013/01/16/the-basics-of-betting-as-a-way-of-keeping-score/">The Basics of Betting as a Way of Keeping Score…</a>) it possibly also makes sense to look at the correlation between the P3 times and the qualifying classification to see if there is a testable edge in the data when it comes to betting on quali?</p>
<p>I think I need to tweak my code slightly to make it easy to pull out correlations between specific columns, but that&#8217;ll have to wait for another day&#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/9825/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/9825/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9825&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/02/09/f1stats-correlations-between-qualifying-grid-and-race-classisification/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/f1vnascarcorrelation.png" medium="image">
			<media:title type="html">F1VNASCARcorrelation</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/f12009gridfinalcorr.png" medium="image">
			<media:title type="html">f12009gridfinalcorr</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/f12009gridracecorrspearmanredvkendallblue.png" medium="image">
			<media:title type="html">f12009gridracecorrspearmanredvkendallblue</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/f1cirr2006_12.png" medium="image">
			<media:title type="html">f1cirr2006_12</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/f12006_12boxplotbycct.png" medium="image">
			<media:title type="html">f12006_12boxplotbycct</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/f12006_12corrbyyear.png" medium="image">
			<media:title type="html">f12006_12corrbyyear</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2013/02/f12006_12processional.png" medium="image">
			<media:title type="html">f12006_12processional</media:title>
		</media:content>
	</item>
		<item>
		<title>Using SPARQL Query Libraries to Generate Simple Linked Data API Wrappers</title>
		<link>http://blog.ouseful.info/2013/01/31/sparqling-with-r/</link>
		<comments>http://blog.ouseful.info/2013/01/31/sparqling-with-r/#comments</comments>
		<pubDate>Thu, 31 Jan 2013 11:56:23 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Rstats]]></category>
		<category><![CDATA[Thinkses]]></category>
		<category><![CDATA[sparql]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=9744</guid>
		<description><![CDATA[A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds&#8217; Brief Review of the Land Registry Linked Data. I was going to post a [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9744&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) <a href="http://www.programmingr.com/content/sparql-with-r/">SPARQL with R in less than 5 minutes</a>, which shows how to query US data.gov Linked Data and then Leigh Dodds&#8217; <a href="http://blog.ldodds.com/2013/01/29/a-brief-review-of-the-land-registry-linked-data/">Brief Review of the Land Registry Linked Data</a>.</p>
<p>I was going to post a couple of of examples merging those two posts &#8211; showing how to access Land Registry data via Leigh&#8217;s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh&#8217;s today &#8211; <a href="http://blog.ldodds.com/2013/01/30/sparql-doc/"> SPARQL-doc</a> &#8211; a simple convention for documenting individual SPARQL queries, has sparked another thought&#8230;</p>
<p>For some time I&#8217;ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question &#8211; it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.</p>
<p>(*There are actually a couple of models I can think of: 1) I keep the query secret, but run it and give you the results; 2) I license the &#8220;query source code&#8221; to you and let you run it yourself. Hmm, I wonder: do folk license queries they share? How, and to what extent, might derived queries/query modifications be accommodated in such a licensing scheme?)</p>
<p>Pondering Leigh&#8217;s SPARQL-doc post, another post via R-bloggers, <a href="http://pirategrunt.com/2013/01/30/building-a-package-in-rstudio-is-actually-very-easy/">Building a package in RStudio is actually very easy</a> (which describes how to package a set of R files for distribution via github), <a href="http://www.asdfree.com/">asdfree (analyze survey data for free)</a>, a site that &#8220;announces obsessively-detailed instructions to analyze us government survey data with free tools&#8221; (and which includes R bundles to get you started quickly&#8230;), the resource listing <a href="http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html">Documentation for package ‘datasets’ version 2.15.2</a> that describes a bundled package of datasets for R and the <a href="http://data.gov.uk/blog/guest-post-developers-guide-linked-data-apis-jeni-tennison">Linked Data API</a>, which sought to provide a simple RESTful API over SPARQL endpoints, I wondered the following:</p>
<p><em>How about developing and sharing commented <strong>query libraries</strong> around Linked Data endpoints that could be used in arbitrary Linked Data clients?</em></p>
<p>(By &#8220;Linked Data clients&#8221;, I mean different user agent contexts. So for example, calling a query from Python, or R, or <a href="http://blog.ouseful.info/2010/02/17/using-data-from-linked-data-datastores-the-easy-way/">Google Spreadsheets</a>.) That&#8217;s it&#8230; Simple.</p>
<p>One approach (the simplest?) might be to put each separate query into a separate file, with a filename that could be used to spawn a function name that could be used to call that query. Putting all the queries into a directory and zipping them up would provide a minimal packaging format. An additional manifest file might minimally document the filename along with the parameters that can be passed into and returned from the query. Helper libraries in arbitrary languages would open the query package and &#8220;compile&#8221; a programme library/set of &#8220;API&#8221; calling functions for that language (so for example, in R it would create a set of R functions, in Python a set of Python functions).</p>
<p>(This reminds me of a Twitter exchange with Nick Jackson/@jacksonj04 a couple of days ago around &#8220;self-assembling&#8221; API programme libraries that could be compiled in an arbitrary language from a JSON API, cf. <a href="http://developers.helloreverb.com/swagger/">Swagger</a> (<a href="http://www.slideshare.net/fehguy/swagger-for-startups">presentation</a>), which I haven&#8217;t had time to look at yet.)</p>
<p>The idea, then is this: </p>
<ol>
<li>Define a simple file format for declaring documented SPARQL queries</li>
<li>Define a simple packaging format for bundling separate SPARQL queries</li>
<li>The simply packaged set of queries define a simple &#8220;raw query&#8221; API over a Linked Data dataset</li>
<li>Describe a simple protocol for creating programming language specific library wrappers around API from the query bundle package.</li>
</ol>
<p>So.. I guess two questions arise: 1) would this be useful? 2) how hard could it be?</p>
<p>[See also: @ldodds again, on <a href="http://blog.ldodds.com/2013/01/31/publishing-sparql-queries-and-documentation-using-github/">Publishing SPARQL queries and-documentation using github</a>]</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/9744/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/9744/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9744&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2013/01/31/sparqling-with-r/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>
	</item>
	</channel>
</rss>
