OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Rstats’ Category

Quoting Tukey on Visual Storytelling with Data

Time was when I used to be a reasonably competent scholar, digging into the literature chasing down what folk actually said, and chasing forward to see whether claims had been refuted. Then I fell out of love with the academic literature – too many papers that said nothing, too many papers that contained errors, too many papers…

…but as we start production on a new OU course on “data”, I’m having to get back in to the literature so I can defuse any claims that what I want to say is wrong by showing that someone else has said it before (which, if what is said has been peer reviewed, makes it right…).

One thing I’d forgotten about chasing thought lines through the literature was that at times it can be quite fun… and that it can quite often turn up memorable, or at least quotable, lines.

For example, last night I had a quick skim through some papers by folk hero in the statistics world, John Tukey, and turned up the following:

The habit of building one technique on another — of assembling procedures like something made of erector-set parts — can be especially useful in dealing with data. So too is looking at the same thing in many ways or many things in the same way; an ability to generalize in profitable ways and a liking for a massive search for order. Mathematicians understand how subtle assumptions can make great differences and are used to trying to trace the paths by which this occurs. The mathematician’s great disadvantage in approaching data is his—or her—attitude toward the words “hypothesis” and “hypotheses”.

When you come to deal with real data, formalized models for its behavior are not hypotheses in the mathematician’s sense… . Instead these formalized models are reference situations—base points, if you like — things against which you compare the data you actually have to see how it differs. There are many challenges to all the skills of mathematicians — except implicit trust in hypotheses — in doing just this.

Since no model is to be believed in, no optimization for a single model can offer more than distant guidance. What is needed, and is never more than approximately at hand, is guidance about what to do in a sequence of ever more realistic situations. The analyst of data is lucky if he has some insight into a few terms of this sequence, particularly those not yet mathematized.

Picturing of Data Picturing of data is the extreme case. Why do we use pictures? Most crucially to see behavior we had not explicitly anticipated as possible – for what pictures are best at is revealing the unanticipated; crucially, often as a way of making it easier to perceive and understand things that would otherwise be painfully complex. These are the important uses of pictures.

We can, and too often do, use picturing unimportantly, often wastefully, as a way of supporting the feeble in heart in their belief that something we have just found is really true. For this last purpose, when and if important, we usually need to look at a summary.

Sometimes we can summarize the data neatly with a few numbers, as when we report:
– a fitted line—two numbers,
– an estimated spread of the residuals around the “best” line—one more number,
– a confidence interval for the slope of the “best” line—two final numbers.

When we can summarize matters this simply in numbers, we hardly need a picture and often lose by going to it. When the simplest useful summary involves many more numbers, a picture can be very helpful. To meet our major commitment of asking what lies beyond, in the example asking “What is happening beyond what the line describes!”, a picture can be essential.

The main tasks of pictures are then:
– to reveal the unexpected,
– to make the complex easier to perceive.

Either may be effective for that which is important above all: suggesting the next step in analysis, or offering the next insight. In doing either of these there is much room for mathematics and novelty.

How do we decide what is a “picture” and what is not? The more we feel that we can “taste, touch, and handle” the more we are dealing with a picture. Whether it looks like a graph, or is a list of a few numbers is not important. Tangibility is important—what we strive for most.

[Tukey, John W. "Mathematics and the picturing of data." In Proceedings of the international congress of mathematicians, vol. 2, pp. 523-531. 1975.]

Wonderful!

Or how about these quotes from Tukey, John W. “Data-based graphics: visual display in the decades to come.” Statistical Science 5, no. 3 (1990): 327-339?

Firstly, on exploratory versus explanatory graphics:

I intend to treat making visual displays as something done by many people who want to communicate – often, on the one hand, to communicate identified phenomena to others, and often, on the other, to communicate unidentified phenomena to themselves. This broad clientele needs a “consumer product,” not an art course. To focus on a broad array of users is not to deny the existence of artists of visual communication, only to recognize how few they are and how small a share in the total volume of communication they can contribute. For such artists,very many statements that follow deserve escape clauses or caveats.

More thoughts on exploration and explanation (transfer of recognition), as well as a distinction between exploration and prospecting:

We all need to be clear that visual display can be very effective in serving two quite different functions, but only if used in correspondingly different ways. On the one hand, it can be essential in helping – or, even, in permitting – us to search in some data for phenomena, just as a prospector searches for gold or uranium.

Our task differs from the usual prospector’s task, in that we are concerned both with phenomena that do occur and with those that might occur but do not. On the other hand, visual display can be very helpful in transferring (to reader, viewer or listener) a recognition of the appearances that indicate the phenomena that deserve report. Indeed, when sufficient precomputation drives appropriately specialized displays, visual display can even also convey the statistical significances or non-significances of these appearances.

There is no reason why a good strategy for prospecting will also be a good strategy for transfer. We can expect some aspects of prospecting strategy (and most techniques) to carry over, but other aspects may not. (We say prospecting because we are optimistic that we may in due course have good lists of possible phenomena. If we do not know what we seek, we are “exploring” not “prospecting.”)

One major difference is in prospecting’s freedom to use multiple pictures. If it takes 5 or 10 kinds of pictures to adequately explore one narrow aspect of the data, the only question is: Will 5 or 10 pictures be needed, or can be condense this to 3 or 4 pictures, without appreciable loss? If “yes” we condense; if “no” we stick to the 5 or 10.

If it takes 500 to 1000 (quite different) pictures, however, our choice can only be between finding a relatively general way to do relatively well with many fewer pictures and asking the computer to sort out some number, perhaps 10 or 20 pictures of “greatest interest.”

In doing transfer, once we have one set of pictures to do what is needed, economy of paper (or plastic) and time (to mention or to read) push even harder toward “no more pictures than needed.” But, even here, we must be very careful not to insist, as a necessity not a desideratum, that a single picture can do it all. If it takes two pictures to convey the message effectively, we must use two.

For prospecting, we will want a bundle of pictures, probably of quite different kinds, so chosen that someone will reveal the presence of any one of possible phenomena, of potentially interesting behaviors, which will often have to be inadequately diverse. Developing, and improving, bundles of pictures for selected combinations of kinds of aspects and kinds of situations will be a continuing task for a long time.

For transfer, we will need a few good styles to transfer each phenomenon of possible importance, so that, when more than one phenomenon deserves transfer, we can choose compatible styles and try to transfer two, or even all, of these phenomena in a single picture. (We can look for the opportunity to do this, but, when we cannot find it, we will use two, or more, pictures as necessary.)

For prospecting, we look at long lists of what might occur – and expect to use many pictures. For transfer, we select short lists of what must be made plain – and use as few pictures as will serve us well.

Then on the power of visual representations, whilst recognising they may not always be up to the job…:

The greatest possibilities of visual display lie in vividness and inescapability of the intended message. A visual display can stop your mental flow in its tracks and make you think. A visual display can force you to notice what you never expected to see. (“Why, that scatter diagram has a hole in the middle!”) On the other hand, if one has to work almost as hard to drag something out of a visual display as one would to drag it out of a table of numbers, the visual display is a poor second to the table, which can easily provide so much more precision. (Here, as elsewhere, artists may deserve an escape clause.)

On design:

Another important aspect of impact is immediacy. One should see the intended at once; one should not even have to wait for it to gradually appear. If a visual display lacks immediacy in thrusting before us one of the phenomena for whose presentation it had been assigned responsibility, we ought to ask why and use the answer to modify the display so its impact will be more immediate.

(For a great example of how to progressively refine a graphic to support the making of a particular point, see this Storytelling With Data post on multifaceted data and story.)

Tukey, who we must recall was writing at a time when powerful statistical graphics tools such as ggplot were still yet to be implemented, also suggests that lessons are to be learned from graphic design for the production of effective statistical charts:

The art of statistical graphics was for a long time a pen-and-pencil cottage industry, with the top professionals skilled with the drafting or mapping pen. In the meantime, graphic designers, especially for books, have had access to different sorts of techniques (the techniques of graphic communication), such as grays of different screen weights, against which, for instance, both white and black lines (and curves) are effective. They also have a set of principles shared in part with the Fine Arts (some written down by Leonardo da Vinci). I do not understand all this well enough to try to tell you about “visual centers” and how attention usually moves when one looks at a picture, but I do know enough to find this area important – and to tell you that more of us need to learn a lot more about it.

Data – what is it good for?

Almost everything we do with data involves comparison – most often between two or more values derived from the data, sometimes between one value derived from the data and some mental reference or standard. The dedication of Richard Hamming’s book on numerical analysis reads “The purpose of computation is insight, not numbers.” We need a book on visual display that at least implies “The purpose of display is comparison (recognition of phenomena), not numbers.”

Tukey also encourages us to think about what data represents, and how it is represented:

Much of what we want to know about the world is naturally expressed as phenomena, as potentially interesting things that can be described in non numerical words. That an economic growth rate has been declining steadily throughout President X’s administration, for example, is a phenomenon, while the fact that the GNP has a given value is a number. With exceptions like “I owe him 27 dollars!” numbers are, when we look deeply enough, mainly of interest because they can be assembled, often only through analysis, to describe phenomena. To me phenomena are the main actors, numbers are the supporting cast. Clearly we most need help with the main actors.

If you really want numbers, presumably for later assembly into a phenomenon, a table is likely to serve you best. The graphic map of Napoleon’s incursion into Russia that so stirs Tufte’s imagination and admiration does quite well in showing the relevant phenomena, in giving the answers to “About where?”, “About when?” and “With roughly what fraction of the original army left?” It serves certain phenomena well. But if we want numbers, we can do better either by reading the digits that may be attached to the graphic – a simple but often effective form of table – or by going to a conventional table.

The questions that visual display (in some graphic mode) answers best are phenomenological (in the sense of the first sentence of this section). For instance:

* Is the value small, medium or large?
* Is the difference, or change, up, down or neutral?
* Is the difference, or change, small, medium or large?
* Do the successive changes grow, shrink or stay roughly constant?
* What about change in ratio terms, perhaps thought of as percent of previous?
* Does the vertical scatter change, as we move from left to right?
* Is the scatter pattern doughnut-shaped?

One way that we will enhance the usefulness of visual display is to find new phenomena of potential interest and then learn how to make displays that will be likely to reveal them, when they are present.

The absence of a positive phenomenon is itself a phenomenon! Such absences as:

* the values are all about the same
* there does not seem to be any definite curvature
* the vertical scatter does not seem to change, as we go from left to right!

are certainly potentially interesting. (We can all find instances where they are interesting.) Thus they are, honestly, phenomena in themselves. We need to be able to view apparent absence of specific phenomena effectively as well as noticing them when they are present! This is one of the reasons why fitting scatter plots with summarizing devices like middle traces (Tukey, 1977a, page 279 ff.) can be important.

Phenomena are also picked up later in the paper:

A graph or chart should not be just another form of table, in which we can look up the facts. If it is to do its part effectively, its focus – or so I believe – will have to be one or more phenomena.

Indeed, the requirement that we can directly read values from a chart seems to be something Tukey takes issue with:

As one who denies “reading off numbers” as the prime purpose of visual display, I can only denounce evaluating displays in terms of how well (given careful study) people read numbers off. If such an approach were to guide us well, it would have to be a very unusual accident.

He even goes so far as to suggest that we might consider being flexible in the way we geometrically map from measurement scales to points on a canvas, moving away from proportionality if that helps us see a phenomenon better:

The purpose of display is to make messages about phenomena clear. There is no place for a doctrinaire approach to “truth in geometry.” We must be honest and say what we did, but this need not mean plotting raw data.

The point is that we have a choice, not only in (x, y)-plots, but more generally. Planned disproportionality needs to be a widely-available option, one that requires the partnership of computation and display.

Tukey is also willing to rethink how we use familiar charts:

Take simple bar charts as an example, which I would define much more generally than many classical authors. Why have they survived? Not because they are geometrically true, and not because they lead to good numerical estimates by the viewer! In my thoughts, their virtue lies in the fact that we can all compare two bars, perhaps only roughly, in two quite different ways, both “About how much difference?” and “About what ratio?” The latter, of course, is often translated into “About how much percent change?” (Going on to three or more successive bars, we can see globally whether the changes in amount are nearly the same, but asking the same question about ratios – rather than differences-requires either tedious assessment of ratios between adjacent bars for one adjacent pair after another…

So – what other Tukey papers should I read?

Written by Tony Hirst

January 23, 2014 at 1:53 pm

Posted in Rstats, School_Of_Data

Tagged with ,

Using One Programming Language In the Context of Another – Python and R

Over the last couple of years, I’ve settled into using R an python as my languages of choice for doing stuff:

  • R, because RStudio is a nice environment, I can blend code and text using R markdown and knitr, ggplot2 and Rcharts make generating graphics easy, and reshapers such as plyr make wrangling with data realtvely easy(?!) once you get into the swing of it… (though sometimes OpenRefine can be easier…;-)
  • python, because it’s an all round general purpose thing with lots of handy libraries, good for scraping, and a joy to work with in iPython notebook…

Sometimes, however, you know – or remember – how to do one thing in one language that you’re not sure how to do in another. Or you find a library that is just right for the task hand but it’s in the other language to the one in which you’re working, and routing the data out and back again can be a pain.

How handy it would be if you could make use of one language in the context of another? Well, it seems as if we can (note: I haven’t tried any of these recipes yet…):

Using R inside Python Programs

Whilst python has a range of plotting tools available for it, such as matplotlib, I haven’t found anything quite as a expressive as R’s ggplot2 (there is a python port of ggplot underway but it’s still early days and the syntax, as well as the functionality, is still far from complete as compared to the original [though not a far as it was given the recent update;-)] ). So how handy would it be to be able to throw a pandas data frame, for example, into an R data frame and then use ggplot to render a graphic?

The Rpy and Rpy2 libraries support exactly that, allowing you to run R code within a python programme. For an example, see this Example of using ggplot2 from IPython notebook.

There also seems to be some magic help for running R in iPython notebooks and some experimental integrational work going on in pandas: pandas: rpy2 / R interface.

(See also: ggplot2 in Python: A major barrier broken.)

Using python Inside R

Whilst one of the things I often want to do in python is plot R style ggplots, one of the hurdles I often encounter in R is getting data in in the first place. For example, the data may come from a third party source that needs screenscraping, or via a web API that has a python wrapper but not an R one. Python is my preferred tool for writing scrapers, so is there a quick way I can add a python data grabber into my R context? It seems as if there is: rPython, though the way code is included looks rather clunky and WIndows support appears to be moot. What would be nice would be for RStudio to include some magic, or be able to support python based chunks…

(See also: Calling Python from R with rPython.)

(Note: I’m currently working on the production of an Open University course on data management and use, and I can imagine the upset about overcomplicating matters if I mooted this sort of blended approach in the course materials. But this is exactly the sort of pragmatic use that technologists use code for – as a tool that comes to hand and that can be used quickly and relatively efficiently in concert with other tools, at least when you’re working in a problem solving (rather than production) mode.)

Written by Tony Hirst

January 22, 2014 at 12:11 pm

Posted in Infoskills, Rstats

Tagged with , , ,

Setting Axis Limits on ggplot Charts

I’ve been doodling some chart in R/ggplot using geom_text() to generate a labelled scatterplot.

The chart actually builds up several layers using different datasets, so it’s not obvious how to set the ranges cleanly: I know the lower bound I want for the y-axis (y=0), but I want to let the upper bound float.

There’s also an issue with the labels overflowing the edges left and right.

exampleChart_badLimits

So here are a couple of lines to make everything better (chart is in g):

#Find the current ymax value for upper bound
#(via http://stackoverflow.com/questions/7705345/how-can-i-extract-plot-axes-ranges-for-a-ggplot2-object#comment24184444_8167461 )
gy=ggplot_build(g)$panel$ranges[[1]]$y.range[2]
g=g+ylim(0,gy)

#Handle the overflow by expanding the x-axis
g=g+scale_x_continuous(expand=c(0.1,0))

plot_neater

Written by Tony Hirst

December 3, 2013 at 12:41 pm

Posted in Rstats

Data Textualisation – Making Human Readable Sense of Data

A picture may be worth a thousand words, but whilst many of us may get a pre-attentive gut reaction reading from a data set visualised using a chart type we’re familiar with, how many of us actually take the time to read a chart thoroughly and maybe verbalise, even if only to ourselves, what the marks on the chart mean, and how they relate to each other? (See How fertility rates affect population for an example of how to read a particular sort of chart.)

An idea that I’m finding increasingly attractive is the notion of text visualisation (or text visualization for the US-English imperialistic searchbots). That is, the generation of mechanical text from data tables so we can read words that describe the numbers – and how they relate – rather than looking at pictures of them or trying to make sense of the table itself.

Here’s a quick example of the sort of thing I mean – the generation of this piece of text:

The total number of people claiming Job Seeker’s Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.

from a data table that can be sliced like this:

slicing nomis JSA figures

In the same way that we make narrative decisions when it comes to choosing what to put into a data visualisation, as well as how to read it (and how the various elements displayed in it relate to each other), so we make choices about the textual, or narrative, mapping from the data set to the text version (that is, the data textualisation) of it. When we present a chart or data table to a reader, we can try to influence their reading of it in variety of ways: by choosing the sort of order of bars on a bar chart, or rows in table, for example; or by highlighting one or more elements in a chart or table through the use of colour, font, transparency, and so on.

The actual reading of the chart or table is still largely under the control of the reader, however, and may be thought of as non-linear in the sense that the author of the chart or table can’t really control the order in which the various attributes of the table or chart, or relationships between the various elements, are encountered by the reader. In a linear text, however, the author retains a far more significant degree of control over the exposition, and the way it is presented to the reader.

There is thus a considerable amount of editorial judgement put into the mapping from a data table to text interpretations of the data contained within a particular row, or down a column, or from some combination thereof. The selection of the data points and how the relationships between them are expressed in the sentences formed around them directs attention in terms of how to read the data in a very literal way.

There may also be a certain amount of algorithmic analysis used along the way as sentences are constructed from looking at the relationships between different data elements; (“up 94″ is a representation (both in sense of rep-resentation and re-presentation) of a month on month change of +94, “down 377″ generated mechanically from a year on year comparison).

Every cell in a table may be a fact that can be reported, but there are many more stories to be told by comparing how different data elements in a table stand in relation to each other.

The area of geekery related to this style of computing is known as NLG – natural language generation – but I’ve not found any useful code libraries (in R or Python, preferably…) for messing around with it. (The JSA example above was generated using R as a proof of concept around generating monthly press releases from ONS/nomis job-figures.

PS why “data textualisation“, when we can consider even graphical devices as “texts” to be read? I considered “data characterisation” in the sense of turning data in characters, but characterisation is more general a term. Data narration was another possibility, but those crazy Americans patenting everything that moves might think I was “stealing” ideas from Narrative Science. Narrative Science (as well as Data2Text and Automated Insights etc. (who else should I mention?)) are certainly interesting but I have no idea how any of them do what they do. And in terms of narrating data stories – I think that’s a higher level process than the mechanical textualisation I want to start with. Which is not to say I don’t also have a few ideas about how to weave a bit of analysis into the textualisation process too…

Written by Tony Hirst

November 18, 2013 at 4:36 pm

Local Council Spending Data – Time Series Charts

In What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations? I started wondering about the extent to which local spending transparency data might play a role in supporting consultation around new budgets.

As a first pass, I’ve popped up a quick application up at http://glimmer.rstudio.com/psychemedia/iwspend2013_14/ [if that's broken, try this one] (shiny code here). You can pass form items in via the URL (except to set the Directorate – oops!), and also search using regular expressions, but at the moment still need to hit the Search button to actually run the search. NOTE – there’s a little bug – you need to hit the Search button to get it to show data; note – selecting All directorates and no filter terms can be a bit slow to display anything…

Examples:

- http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?expensesType=(oil)|(gas)|(electricity)
http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?serviceArea=mainland
http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?supplierName=capita

I’ve started exploring various views over the data, but these need thinking through properly (in particular with respect to finding out views that may actually be useful!)

iw spend music expneses type

Hmm… did the budget change directorate?!

IW spend - music service area

IW spend music suppliers

Some more views over the suppliers tab – I started experimenting with some tabular views in the suppliers tab too…

IW spend music suppliers table 1

IW spend music suppliers table 2

This is all very “shiny” of course, but is it useful? From these early glimpses over the data, can you think of any ways that a look at the spending data might help support budget consultations? What views over the data, in particular, might support such an aim, and what sort of stories might we be able to tell around this sort of data?

Written by Tony Hirst

November 6, 2013 at 11:27 am

Posted in Policy, Rstats

Tagged with ,

Generating d3js Motion Charts from rCharts

Remember Gapminder, the animated motion chart popularised by Hans Rosling in his TED Talks and Joy of Stats TV programme? Well it’s back on TV this week in Don’t Panic – The Truth About Population, a compelling piece of OU/BBC co-produced stats theatre featuring Hans Rosling, and a Pepper’s Ghost illusion brought into the digital age courtesy of the Musion projection system:

Whilst considering what materials we could use to support the programme, we started looking for ways to make use of the Gapminder visualisation tool that makes several appearances in the show. Unfortunately, neither Gapminder (requires Java?), nor the Google motion chart equivalent of it (requires Flash?), appear to work with a certain popular brand of tablet that is widely used as a second screen device…

Looking around the web, I noticed that that Mike Bostock had produced a version of the motion chart using d3.js: The Wealth & Health of Nations. Hmmm…

Playing with that rendering on a tablet, I had a few problems when trying to highlight individual countries – the interaction interfered with an invisible date slider control – but a quick shout out to my OU colleague Pete Mitton resulted in a tweaked version of the UI with the date control moved to the side. I also added a tweak to allow specified countries to be highlighted. You can find an example here (source).

Looking at how the data was pulled into the chart, it seems to be quite a convoluted form of JSON. After banging my head against a wall for a bit, a question on Stack Overflow about how to wrangle the data from something that looked like this:

Country Region  Year    V1  V2
AAAA    XXXX    2001    12  13
BBBB    YYYY    2001    14  15
AAAA    XXXX    2002    36  56
AAAA    XXXX    1999    45  67

to something that looked like this:

[
  {"Country": "AAAA",
   "Region":"XXXX",
    "V1": [ [1999,45], [2001,12] , [2002,36] ],
    "V2":[ [1999,67], [2001,13] , [2002,56] ]
  },
  {"Country": "BBBB",
   "Region":"YYYY",
   "V1":[ [2001,14] ],
   "V2":[ [2001,15] ]
  }
]

resulted in a handy function from Ramnath Vaidyanathan that fitted the bill.

One of the reasons that I wanted to use R for the data transformation step, rather than something like Python, was that I was keen to try to get a version of the motion charts working with the rCharts library. Such is the way of the world, Ramnath is the maintainer of rCharts, and with his encouragement I had a go at getting the motion chart to work with that library, heavily cribbing from @timelyportfolio’s rCharts Extra – d3 Horizon Conversion tutorial on getting things to work with rCharts along the way.

For what it’s worth, my version of the code is posted here: rCharts_motionchart.

I put together a couple of demo’s that seem to work, including the one shown below that pulls data from the World Bank indicators API and then chucks it onto a motion chart…

UPDATE: I’ve made things a bit easier compared to the original recipe included in this post… we can now generate fertility/GDP/population motion chart for a range of specified countries using data pulled directly from the World Bank development indicators API with just the following two lines of R code:

test.data=getdata.WDI(c('GB','US','ES','BD'))
rChartMotionChart(test.data,'Country','Year','Fertility','GDP','Region','Population')

It’s not so hard to extend the code to pull in other datasets, either…

Anyway, here’s the rest of the original post… Remember, it’s easier now;-) [Code: https://github.com/psychemedia/rCharts_motionchart See example/demo1.R]

To start with, here are a couple of helper functions:

require('WDI')

#A handy helper function for getting country data - this doesn't appear in the WDI package?
#---- https://code.google.com/p/google-motion-charts-with-r/source/browse/trunk/demo/WorldBank.R?r=286
getWorldBankCountries <- function(){
  require(RJSONIO)
  wbCountries <- fromJSON("http://api.worldbank.org/country?per_page=300&format=json")
  wbCountries <- data.frame(t(sapply(wbCountries[[2]], unlist)))
  wbCountries$longitude <- as.numeric(wbCountries$longitude)
  wbCountries$latitude <- as.numeric(wbCountries$latitude)
  levels(wbCountries$region.value) <- gsub("\\(all income levels\\)", "", levels(wbCountries$region.value))
  return(wbCountries)
}


#----http://stackoverflow.com/a/19729235/454773
pluck_ = function (element){
  function(x) x[[element]]
}

#' Zip two vectors
zip_ <- function(..., names = F){
  x = list(...)
  y = lapply(seq_along(x[[1]]), function(i) lapply(x, pluck_(i)))
  if (names) names(y) = seq_along(y)
  return(y)
}

#' Sort a vector based on elements at a given position
sort_ <- function(v, i = 1){
  v[sort(sapply(v, '[[', i), index.return = T)$ix]
}

library(plyr)

This next bit still needs some refactoring, and a bit of work to get it into a general form:

#I chose to have a go at putting all the motion chart parameters into a list

params=list(
  start=1950,
  end=2010,
  x='Fertility',
  y='GDP',
  radius='Population',
  color='Region',
  key='Country',
  yscale='log',
  xscale='linear',
  rmin=0,
  xmin=0

  )

##This bit needs refactoring - grab some data; the year range is pulled from the motion chart config;
##It would probably make sense to pull countries and indicators etc into the params list too?
##That way, we can start to make this block a more general function?

tmp=getWorldBankCountries()[,c('iso2Code','region.value')]
names(tmp)=c('iso2Code','Region')

data <- WDI(indicator=c('SP.DYN.TFRT.IN','SP.POP.TOTL','NY.GDP.PCAP.CD'),start = params$start, end = params$end,country=c("BD",'GB'))
names(data)=c('iso2Code','Country','Year','Fertility','Population','GDP')

data=merge(data,tmp,by='iso2Code')

#Another bit of Ramnath's magic - http://stackoverflow.com/a/19729235/454773
dat2 <- dlply(data, .(Country, Region), function(d){
  list(
    Country = d$Country[1],
    Region = d$Region[1],
    Fertility = sort_(zip_(d$Year, d$Fertility)),
    GDP = sort_(zip_(d$Year, d$GDP)),
    Population=sort_(zip_(d$Year, d$Population))
  )
})

#cat(rjson::toJSON(setNames(dat2, NULL)))

To minimise the amount of motion chart configuration, can we start to set limits based on the data values?

#This really needs refactoring/simplifying/tidying/generalising
#I'm not sure how good the range finding heuristics I'm using are, either?!
paramsTidy=function(params){
  if (!('ymin' %in% names(params))) params$ymin= signif(min(0.9*data[[params$y]]),3)
  if (!('ymax' %in% names(params))) params$ymax= signif(max(1.1*data[[params$y]]),3)
  if (!('xmin' %in% names(params))) params$xmin= signif(min(0.9*data[[params$x]]),3)
  if (!('xmax' %in% names(params))) params$xmax= signif(max(1.1*data[[params$x]]),3)
  if (!('rmin' %in% names(params))) params$rmin= signif(min(0.9*data[[params$radius]]),3)
  if (!('rmax' %in% names(params))) params$rmax= signif(max(1.1*data[[params$radius]]),3)
  params
}

params=paramsTidy(params)

This is the function that generates the rChart:

require(rCharts)

#We can probably tidy the way that the parameters are mapped...
#I wasn't sure whether to try to maintain the separation between params and rChart$params?
rChart.generator=function(params, h=400,w=800){
  rChart <- rCharts$new()
  rChart$setLib('../motionchart')
  rChart$setTemplate(script = "../motionchart/layouts/motionchart_Demo.html")

  rChart$set(
  
   countryHighlights='',
   yearMin= params$start,
   yearMax=params$end,
  
   x=params$x,
   y=params$y,
   radius=params$radius,
   color=params$color,
   key=params$key,
  
   ymin=params$ymin,
   ymax=params$ymax,
   xmin=params$xmin,
   xmax=params$xmax,
   rmin=params$rmin,
   rmax=params$rmax,
  
   xlabel=params$x,
   ylabel=params$y,
  
   yscale=params$yscale,
   xscale=params$xscale,
   
   width=w,
   height=h
 )

 rChart$set( data= rjson::toJSON(setNames(dat2, NULL)) )

 rChart
}

rChart.generator(params,w=1000,h=600)

Aside from tidying – and documenting/commenting – the code, the next thing on my to do list is to see whether I can bundle this up in a Shiny app. I made a start sketching a possible UI, but I’ve run out of time to do much more for a day or two… (I was also thinking of country checkboxes for either pulling in just that country data, or highlighting those countries.)

items=c("Fertility","GDP","Population")
names(items)=items

shinyUI(pageWithSidebar(
  headerPanel("Motion Chart demo"),
  
  sidebarPanel(
    selectInput(inputId = 'x',
                label = "X",
                choices = items,
                selected = 'Fertility'),
    selectInput(inputId = 'y',
                label = "Y",
                choices = items,
                selected = 'GDP'),
    selectInput(inputId = 'r',
                label = "Radius",
                choices = items,
                selected = 'Population')
  ),
  mainPanel(
    #The next line throws an error (a library is expected? But I don't want to use one?)
    showOutput("motionChart",'')
  )
))

As ever, we’ve quite possibly run out of time on getting much up on the OpenLearn website by Thursday to support the programme as it airs, which is partly why I’m putting this code out now… If you manage to do anything with it that would allow folk to dynamically explore a range of development indicators over the next day or two (especially GDP, fertility, mortality, average income, income distributions (this would require different visualisations?)), we may be able to give it a plug from OpenLearn, and maybe via any tweetalong campaign that’s running as the programme airs…

If you do come up with anything, please let me know via the comments, or twitter (@psychemedia)…

Written by Tony Hirst

November 4, 2013 at 6:34 pm

Posted in OBU, Rstats

Tagged with

Negative Payments in Local Spending Data

In anticipation of a new R library from School of Data data diva @mihi_tr that will wrap the OpenSpending API and providing access to OpenSpending.org data directly from within R, I thought I’d start doodling around some ideas raised in Identifying Pieces in the Spending Data Jigsaw. In particular, common payment values, repayments/refunds and “balanced payments”, that is, multiple payments where the absolute value of a negative payment matches that of an above zero payment (so far example, -£269.72 and £269.72 would be balanced payments).

The data I’ve grabbed is Isle of Wight Council transparency (spending) data for the financial year 2012/2013. The data was pulled from the Isle of Wight Council website and cleaned using OpenRefine broadly according to the recipe described in Using OpenRefine to Clean Multiple Documents in the Same Way with a couple of additions: a new SupplierNameClustered column, originally created from the SupplierName column, but then cleaned to remove initials, (eg [SD]), direct debit prefixes (DD- etc) and then clustered using a variety of clustering algorithms; and a new unique index column, created with values '_'+row.index. You can find a copy of the data here.

To start with, let’s see how we can identify “popular” transaction amounts:

#We going to use some rearrangement routines...
library(plyr)

#I'm loading in the data from a locally downloaded copy
iw <- read.csv("~/code/Rcode/eduDemos1/caseStudies/IWspending/IWccl2012_13_TransparencyTest4.csv")

#Display most common payment values
commonAmounts=function(df,bcol='Amount',num=5){
  head(arrange(data.frame(table(df[[bcol]])),-Freq),num)
}
ca=commonAmounts(iw)

ca
#     Var1 Freq
#1 1567.72 2177
#2 1930.32 1780
#3  1622.6 1347
#4 1998.08 1253
#5  1642.2 1227

This shows that the most common payment by value is for £1567.72. An obvious question to ask here is: does this correspond to some sort of “standard payment” or tariff? And if so, can we reconcile this against the extent of the delivery of a particular level of service, perhaps as disclosed elsewhere?

We can also generate a summary report to show what transactions correspond to this amount, or the most popular amounts.

#We can then get the rows corresponding to these common payments
commonPayments=function(df,bcol,commonAmounts){
  df[abs(df[[bcol]]) %in% commonAmounts,]
}
cp.df=commonPayments(iw,"Amount",ca$Var1)

More usefully, we might generate “pivot table” style summaries of how the popular payments breakdown with respect to expenses area, supplier, or some other combination of factors. So for example, we can learn that there were 1261 payments of £1567.72 booked to the EF Residential Care services area, and SOMERSET CARE LTD [SB] received 231 payments of £1567.72. (Reporting by the clustered supplier name is often more useful…)

#R does pivot tables, sort of...
#We can also run summary statistics over those common payment rows.
zz1=aggregate(index ~ ServiceArea + Amount, data =cp.df, FUN="length")
head(arrange(zz1,-index))

#                       ServiceArea  Amount index
#1              EF Residential Care 1567.72  1261
#2             EMI Residential Care 1642.20   659
#3             EMI Residential Care 1930.32   645
#4              EF Residential Care 1930.32   561
#5              EF Residential Care 1622.60   530
#6 Elderly Frail Residential Income 1622.60   36

#Another way of achieving a similar result is using the plyr count() function:
head(arrange(count(cp.df, c('ServiceArea','ExpensesType','Amount')),-freq))
#That is:
zz1.1=count(cp.df, c('ServiceArea','ExpensesType','Amount'))

zz2=aggregate(index ~ ServiceArea +ExpensesType+ Amount, data =cp.df, FUN="length")
head(arrange(zz2,-index))
#                       ServiceArea                       ExpensesType  Amount
#1              EF Residential Care                Chgs from Ind Provs 1567.72
#2             EMI Residential Care                Chgs from Ind Provs 1930.32
#3             EMI Residential Care                Chgs from Ind Provs 1642.20
#4              EF Residential Care Charges from Independent Providers 1622.60
#...

zz3=aggregate(index ~ SupplierName+ Amount, data =cp.df, FUN="length")
head(arrange(zz3,-index))
#                SupplierName  Amount index
#1     REDACTED PERSONAL DATA 1567.72   274
#2     SOMERSET CARE LTD [SB] 1567.72   231
#3 ISLAND HEALTHCARE LTD [SB] 1930.32   214
#...

zz4=aggregate(index ~ SupplierNameClustered+ Amount, data =cp.df, FUN="length")
head(arrange(zz4,-index))

If I was a suspicious type, I suppose I might look for suppliers who received one of these popular amounts only once…

Let’s have a search for items declared as overpayments or refunds, perhaps as preliminary work in an investigation about why overpayments are being made…:

#Let's look for overpayments
overpayments=function(df,fcol,bcol='Amount',num=5){
  df=subset(df, grepl('((overpay)|(refund))', df[[fcol]],ignore.case=T))
  df[ order(df[,bcol]), ]
}
rf=overpayments(iw,'ExpensesType')

#View the largest "refund" transactions
head(subset(arrange(rf,Amount),select=c('Date','ServiceArea','ExpensesType','Amount')))
#        Date                           ServiceArea         ExpensesType    Amount
#1 2013-03-28 Elderly Mentally Ill Residential Care     Provider Refunds -25094.16
#2 2012-07-18                   MH Residential Care Refund of Overpaymts -24599.12
#3 2013-03-25 Elderly Mentally Ill Residential Care     Provider Refunds -23163.84

#We can also generate a variety of other reports on the "refund" transactions
head(arrange(aggregate(index ~ SupplierName+ ExpensesType+Amount, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType   Amount index
#1 THE ORCHARD HOUSE CARE HOME Provider Refunds  -434.84     8
#2 THE ORCHARD HOUSE CARE HOME Provider Refunds  -729.91     3
#3         SCIO HEALTHCARE LTD Provider Refunds  -434.84     3
#...
##Which is to say: The Orchard House Care home made 8 refunds of £434.84 and 3 of £729.91

head(arrange(aggregate(index ~ SupplierName+ ExpensesType, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType index
#1         SCIO HEALTHCARE LTD Provider Refunds    31
#2           SOMERSET CARE LTD Provider Refunds    31
#3 THE ORCHARD HOUSE CARE HOME Provider Refunds    22
#...

Another area of investigation might be “balanced” payments. Here’s one approach for finding those, based around first identifying negative payments. Let’s also refine the approach in this instance for looking for balanced payments involving the supplier involved in the largest number of negative payments.

##Find which suppliers were involved in most number of negative payments
#Identify negative payments
nn=subset(iw,Amount<=0)
#Count the number of each unique supplier
nnd=data.frame(table(nn$SupplierNameClustered))
#Display the suppliers with the largest count of negative payments
head(arrange(nnd,-Freq))

#                    Var1 Freq
#1 REDACTED PERSONAL DATA 2773
#2      SOUTHERN ELECTRIC  313
#3   BRITISH GAS BUSINESS  102
#4      SOMERSET CARE LTD   75
#...

Let’s look for balanced payments around SOUTHERN ELECTRIC…

#Specific supplier search
#Limit transactions to just transactions involving this supplier
se=subset(iw, SupplierNameClustered=='SOUTHERN ELECTRIC')

##We can also run partial matching searches...
#sw=subset(iw,grepl('SOUTHERN', iw$SupplierNameClustered))

#Now let's search for balanced payments
balanced.items=function(df,bcol){
  #Find the positive amounts
  positems=df[ df[[bcol]]>0, ]
  #Find the negative amounts
  negitems=df[ df[[bcol]]<=0, ]
  
  #Find the absolute unique negative amounts
  uniqabsnegitems=-unique(negitems[[bcol]])
  #Find the unique positive amounts
  uniqpositems=unique(positems[[bcol]])
  
  #Find matching positive and negative amounts
  balitems=intersect(uniqabsnegitems,uniqpositems)
  #Subset the data based on balanced positive and negative amounts
  #bals=subset(se,abs(Amount) %in% balitems)
  bals=df[abs(df[[bcol]]) %in% balitems,]
  #Group the data by sorting, largest absolute amounts first
  bals[ order(-abs(bals[,bcol])), ]
}

dse=balanced.items(se,'Amount')

head(subset(dse,select=c('Directorate','Date','ServiceArea','ExpensesType','Amount')))
#                              Directorate       Date                ServiceArea ExpensesType   Amount
#20423               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity -2770.35
#20424               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity  2770.35
#52004 Chief Executive, Schools & Learning 2013-01-30 Haylands - Playstreet Lane  Electricity  2511.20
#52008 Chief Executive, Schools & Learning 2013-01-31 Haylands - Playstreet Lane  Electricity -2511.20

We can also use graphical techniques to try to spot balanced payments.

#Crude example plot to highlight matched +/1 amounts
se1a=subset(se,ServiceArea=='Ryde Town Hall' & Amount>0)
se1b=subset(se,ServiceArea=='Ryde Town Hall' & Amount<=0)

require(ggplot2)
g=ggplot() + geom_point(data=se1a, aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

In this first image, we see payments over time – the red markers are “minus” amounts on the negative payments. Notice that in some cases balanced payments seem to appear on the same day. If the y-value of a red and a black marker are the same, they are balanced in value. The x-axis is time. Where there is a period of equally spaced marks over x with the same y-value, this may represent a regular scheduled payment.

ryde - energy

g=ggplot(se1a)+geom_point( aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+facet_wrap(~ExpensesType)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')
g=g+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

We can also facet out the payments by expense type.

ryde-energy facet

Graphical techniques can often help us spot patterns in the data that might be hard to spot just by looking at rows and columns worth of data. In many cases, it is worthwhile developing visual analysis skills to try out quick analyses “by eye” before trying to implement them as more comprehensive analysis scripts.

Written by Tony Hirst

August 17, 2013 at 11:08 pm

Posted in Rstats, School_Of_Data

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 787 other followers