OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Polling the News…

leave a comment »

Towards the end of last week I attended a two day symposium on Statistics in Journalism Practice and Education at the University of Sheffield. The programme was mixed, with several reviews of data journalism is or could be, and the occasional consideration of what stats might go into a statistics curriculum for students, but it got me thinking again about the way that content gets created and shunted around the news world.

Take polls, for example. At one point a comment got me idly wondering about the percentage of news copy that is derived from polls or surveys, and how it might be possible to automate the counting of such things. (My default position in this case is usually to wonder what might be possible be with the Guardian open platform content API. But I also started to wonder about how we could map the fan out from independent or commissioned polls or surveys as they get reported in the news media, then maybe start to find their way into other reports and documents by virtue of having been reported in the news.

This sort of thing is a corollary to tracking the way in which news stories might make their way from the newswires and into the papers via a bit of cut-and-pasting, as Nick Davies wrote so damningly about several years ago now in Flat Earth News, his indictment of churnalism and all that goes with it; it also reminds me of this old, old piece of Yahoo Pipes pipework where I tried to support the discovery of Media Release Related News Stories by putting university press release feeds into the same timeline view as news stories about that university.

aberdeenPressRelease

I don’t remember whether I also built a custom search engine at the time for searching over press releases and news sites for mentions of universities, but that was what came immediately to mind this time round.

So for starters, here’s a quick Google Custom Search Engine that searches over a variety of polling organisation and news media websites looking for polls and surveys – Churnalism Times (Polls & Surveys Edition).

Here’s part of the setup, showing the page URL patterns to be search over.

List the sites you want to search over

I added additional refinements to the tab that searches over the news organisations so only pull out pages where “poll” or “survey” is mentioned. Note that if these words are indexed in the chrome around the news story (eg in a banner or sidebar), then we can get a false positive hit on the page (i.e. pull back a page where an irrelevant story is mentioned because a poll is linked to in the sidebar).

Add refinements

From way back when when I took an interest in search more than I do now, I thought Google was trying to find ways of distinguishing content from furniture, but I’m not so sure any more…

Anyway, here’s an example of a search into polls and surveys published by some of the big pollsters:

tuitionFeesPoll

And an example of results from the news orgs:

Tuition fees media

For what it’s worth I also put together a custom search engine for searching over press releases – Churnalism Times (PR wires edition):

PR search

The best way of using this is to just past in a quote, or part of a quote, from a news story, in double quotes, to see which PR notice it came from…

PR search for quote

To make life easier, an old bookmarklet generator I produced way back when on an Arcadia fellowship at the Cambridge University Library, can be used to knock up a simple bookmarklet that will let you highlight a chunk of text and then search for it – get-selection bookmarklet generator.

Give it a sensible title; then this is the URL chunk you need to add:

https://www.google.com/cse/publicurl?cx=016419300868826941330:wvfrmcn2oxc&q=

bookmarklet genrator

Sigh.. I used to have so much fun…

PS it actually makes more sense to enclose the selected quote in quotes. Here’s a tweaked version of the bookmarklet code I grabbed from my installation of it in Chrome:

javascript:(function()%7Bvar t%3Dwindow.getSelection%3Fwindow.getSelection().toString()%3Adocument.selection.createRange().text%3Bwindow.location%3D%27https%3A%2F%2Fwww.google.com%2Fcse%2Fpublicurl%3Fcx%3D016419300868826941330%3Awvfrmcn2oxc%26q%3D"%27%2Bt%2B%27"%27%3B%7D)()

PPS I’ve started to add additional search domains to the PR search engine to include political speeches.

Written by Tony Hirst

February 6, 2014 at 7:26 pm

Baudelaire, Intellectual Talisman Along the Way to Impressionism

leave a comment »

During tumultuous times there is often an individual, an intellectual talisman if you like, who watches events unfold and extracts the essence of what is happening into a text, which then provides a handbook for the oppressed. For the frustrated Paris-based artists battling with the Academy during the second half of the nineteenth century, Baudelaire was that individual, his essay, The Painter of Modern Life, the text.

… He claimed that ‘for the sketch of manners, the depiction of bourgeois life … [sic] there is a rapidity of movement which calls for an equal speed of execution from the artist’. …

… Baudelaire passionately believed that it was incumbent upon living artists to document their time, recognizing the unique position that a talented painter or sculptor finds him or herself in: ‘Few men are gifted with the capacity of seeing; there are fewer still who possess the power of expression …’ … He challenged artists to find in modern life ‘the eternal from the the transitory’. That, he thought, was the essential purpose of art – to capture the universal in the everyday, which was particular to their here and now: the present.

And the way to do that was by immersing oneself in the day-to-day of metropolitan living: watching, thinking, feeling and finally recording.

Will Gompertz, What Are You Looking At?, pp.28-29

Written by Tony Hirst

February 6, 2014 at 10:31 am

Posted in Anything you want

Tagged with

Quoting Tukey on Visual Storytelling with Data

with 4 comments

Time was when I used to be a reasonably competent scholar, digging into the literature chasing down what folk actually said, and chasing forward to see whether claims had been refuted. Then I fell out of love with the academic literature – too many papers that said nothing, too many papers that contained errors, too many papers…

…but as we start production on a new OU course on “data”, I’m having to get back in to the literature so I can defuse any claims that what I want to say is wrong by showing that someone else has said it before (which, if what is said has been peer reviewed, makes it right…).

One thing I’d forgotten about chasing thought lines through the literature was that at times it can be quite fun… and that it can quite often turn up memorable, or at least quotable, lines.

For example, last night I had a quick skim through some papers by folk hero in the statistics world, John Tukey, and turned up the following:

The habit of building one technique on another — of assembling procedures like something made of erector-set parts — can be especially useful in dealing with data. So too is looking at the same thing in many ways or many things in the same way; an ability to generalize in profitable ways and a liking for a massive search for order. Mathematicians understand how subtle assumptions can make great differences and are used to trying to trace the paths by which this occurs. The mathematician’s great disadvantage in approaching data is his—or her—attitude toward the words “hypothesis” and “hypotheses”.

When you come to deal with real data, formalized models for its behavior are not hypotheses in the mathematician’s sense… . Instead these formalized models are reference situations—base points, if you like — things against which you compare the data you actually have to see how it differs. There are many challenges to all the skills of mathematicians — except implicit trust in hypotheses — in doing just this.

Since no model is to be believed in, no optimization for a single model can offer more than distant guidance. What is needed, and is never more than approximately at hand, is guidance about what to do in a sequence of ever more realistic situations. The analyst of data is lucky if he has some insight into a few terms of this sequence, particularly those not yet mathematized.

Picturing of Data Picturing of data is the extreme case. Why do we use pictures? Most crucially to see behavior we had not explicitly anticipated as possible – for what pictures are best at is revealing the unanticipated; crucially, often as a way of making it easier to perceive and understand things that would otherwise be painfully complex. These are the important uses of pictures.

We can, and too often do, use picturing unimportantly, often wastefully, as a way of supporting the feeble in heart in their belief that something we have just found is really true. For this last purpose, when and if important, we usually need to look at a summary.

Sometimes we can summarize the data neatly with a few numbers, as when we report:
– a fitted line—two numbers,
– an estimated spread of the residuals around the “best” line—one more number,
– a confidence interval for the slope of the “best” line—two final numbers.

When we can summarize matters this simply in numbers, we hardly need a picture and often lose by going to it. When the simplest useful summary involves many more numbers, a picture can be very helpful. To meet our major commitment of asking what lies beyond, in the example asking “What is happening beyond what the line describes!”, a picture can be essential.

The main tasks of pictures are then:
– to reveal the unexpected,
– to make the complex easier to perceive.

Either may be effective for that which is important above all: suggesting the next step in analysis, or offering the next insight. In doing either of these there is much room for mathematics and novelty.

How do we decide what is a “picture” and what is not? The more we feel that we can “taste, touch, and handle” the more we are dealing with a picture. Whether it looks like a graph, or is a list of a few numbers is not important. Tangibility is important—what we strive for most.

[Tukey, John W. "Mathematics and the picturing of data." In Proceedings of the international congress of mathematicians, vol. 2, pp. 523-531. 1975.]

Wonderful!

Or how about these quotes from Tukey, John W. “Data-based graphics: visual display in the decades to come.” Statistical Science 5, no. 3 (1990): 327-339?

Firstly, on exploratory versus explanatory graphics:

I intend to treat making visual displays as something done by many people who want to communicate – often, on the one hand, to communicate identified phenomena to others, and often, on the other, to communicate unidentified phenomena to themselves. This broad clientele needs a “consumer product,” not an art course. To focus on a broad array of users is not to deny the existence of artists of visual communication, only to recognize how few they are and how small a share in the total volume of communication they can contribute. For such artists,very many statements that follow deserve escape clauses or caveats.

More thoughts on exploration and explanation (transfer of recognition), as well as a distinction between exploration and prospecting:

We all need to be clear that visual display can be very effective in serving two quite different functions, but only if used in correspondingly different ways. On the one hand, it can be essential in helping – or, even, in permitting – us to search in some data for phenomena, just as a prospector searches for gold or uranium.

Our task differs from the usual prospector’s task, in that we are concerned both with phenomena that do occur and with those that might occur but do not. On the other hand, visual display can be very helpful in transferring (to reader, viewer or listener) a recognition of the appearances that indicate the phenomena that deserve report. Indeed, when sufficient precomputation drives appropriately specialized displays, visual display can even also convey the statistical significances or non-significances of these appearances.

There is no reason why a good strategy for prospecting will also be a good strategy for transfer. We can expect some aspects of prospecting strategy (and most techniques) to carry over, but other aspects may not. (We say prospecting because we are optimistic that we may in due course have good lists of possible phenomena. If we do not know what we seek, we are “exploring” not “prospecting.”)

One major difference is in prospecting’s freedom to use multiple pictures. If it takes 5 or 10 kinds of pictures to adequately explore one narrow aspect of the data, the only question is: Will 5 or 10 pictures be needed, or can be condense this to 3 or 4 pictures, without appreciable loss? If “yes” we condense; if “no” we stick to the 5 or 10.

If it takes 500 to 1000 (quite different) pictures, however, our choice can only be between finding a relatively general way to do relatively well with many fewer pictures and asking the computer to sort out some number, perhaps 10 or 20 pictures of “greatest interest.”

In doing transfer, once we have one set of pictures to do what is needed, economy of paper (or plastic) and time (to mention or to read) push even harder toward “no more pictures than needed.” But, even here, we must be very careful not to insist, as a necessity not a desideratum, that a single picture can do it all. If it takes two pictures to convey the message effectively, we must use two.

For prospecting, we will want a bundle of pictures, probably of quite different kinds, so chosen that someone will reveal the presence of any one of possible phenomena, of potentially interesting behaviors, which will often have to be inadequately diverse. Developing, and improving, bundles of pictures for selected combinations of kinds of aspects and kinds of situations will be a continuing task for a long time.

For transfer, we will need a few good styles to transfer each phenomenon of possible importance, so that, when more than one phenomenon deserves transfer, we can choose compatible styles and try to transfer two, or even all, of these phenomena in a single picture. (We can look for the opportunity to do this, but, when we cannot find it, we will use two, or more, pictures as necessary.)

For prospecting, we look at long lists of what might occur – and expect to use many pictures. For transfer, we select short lists of what must be made plain – and use as few pictures as will serve us well.

Then on the power of visual representations, whilst recognising they may not always be up to the job…:

The greatest possibilities of visual display lie in vividness and inescapability of the intended message. A visual display can stop your mental flow in its tracks and make you think. A visual display can force you to notice what you never expected to see. (“Why, that scatter diagram has a hole in the middle!”) On the other hand, if one has to work almost as hard to drag something out of a visual display as one would to drag it out of a table of numbers, the visual display is a poor second to the table, which can easily provide so much more precision. (Here, as elsewhere, artists may deserve an escape clause.)

On design:

Another important aspect of impact is immediacy. One should see the intended at once; one should not even have to wait for it to gradually appear. If a visual display lacks immediacy in thrusting before us one of the phenomena for whose presentation it had been assigned responsibility, we ought to ask why and use the answer to modify the display so its impact will be more immediate.

(For a great example of how to progressively refine a graphic to support the making of a particular point, see this Storytelling With Data post on multifaceted data and story.)

Tukey, who we must recall was writing at a time when powerful statistical graphics tools such as ggplot were still yet to be implemented, also suggests that lessons are to be learned from graphic design for the production of effective statistical charts:

The art of statistical graphics was for a long time a pen-and-pencil cottage industry, with the top professionals skilled with the drafting or mapping pen. In the meantime, graphic designers, especially for books, have had access to different sorts of techniques (the techniques of graphic communication), such as grays of different screen weights, against which, for instance, both white and black lines (and curves) are effective. They also have a set of principles shared in part with the Fine Arts (some written down by Leonardo da Vinci). I do not understand all this well enough to try to tell you about “visual centers” and how attention usually moves when one looks at a picture, but I do know enough to find this area important – and to tell you that more of us need to learn a lot more about it.

Data – what is it good for?

Almost everything we do with data involves comparison – most often between two or more values derived from the data, sometimes between one value derived from the data and some mental reference or standard. The dedication of Richard Hamming’s book on numerical analysis reads “The purpose of computation is insight, not numbers.” We need a book on visual display that at least implies “The purpose of display is comparison (recognition of phenomena), not numbers.”

Tukey also encourages us to think about what data represents, and how it is represented:

Much of what we want to know about the world is naturally expressed as phenomena, as potentially interesting things that can be described in non numerical words. That an economic growth rate has been declining steadily throughout President X’s administration, for example, is a phenomenon, while the fact that the GNP has a given value is a number. With exceptions like “I owe him 27 dollars!” numbers are, when we look deeply enough, mainly of interest because they can be assembled, often only through analysis, to describe phenomena. To me phenomena are the main actors, numbers are the supporting cast. Clearly we most need help with the main actors.

If you really want numbers, presumably for later assembly into a phenomenon, a table is likely to serve you best. The graphic map of Napoleon’s incursion into Russia that so stirs Tufte’s imagination and admiration does quite well in showing the relevant phenomena, in giving the answers to “About where?”, “About when?” and “With roughly what fraction of the original army left?” It serves certain phenomena well. But if we want numbers, we can do better either by reading the digits that may be attached to the graphic – a simple but often effective form of table – or by going to a conventional table.

The questions that visual display (in some graphic mode) answers best are phenomenological (in the sense of the first sentence of this section). For instance:

* Is the value small, medium or large?
* Is the difference, or change, up, down or neutral?
* Is the difference, or change, small, medium or large?
* Do the successive changes grow, shrink or stay roughly constant?
* What about change in ratio terms, perhaps thought of as percent of previous?
* Does the vertical scatter change, as we move from left to right?
* Is the scatter pattern doughnut-shaped?

One way that we will enhance the usefulness of visual display is to find new phenomena of potential interest and then learn how to make displays that will be likely to reveal them, when they are present.

The absence of a positive phenomenon is itself a phenomenon! Such absences as:

* the values are all about the same
* there does not seem to be any definite curvature
* the vertical scatter does not seem to change, as we go from left to right!

are certainly potentially interesting. (We can all find instances where they are interesting.) Thus they are, honestly, phenomena in themselves. We need to be able to view apparent absence of specific phenomena effectively as well as noticing them when they are present! This is one of the reasons why fitting scatter plots with summarizing devices like middle traces (Tukey, 1977a, page 279 ff.) can be important.

Phenomena are also picked up later in the paper:

A graph or chart should not be just another form of table, in which we can look up the facts. If it is to do its part effectively, its focus – or so I believe – will have to be one or more phenomena.

Indeed, the requirement that we can directly read values from a chart seems to be something Tukey takes issue with:

As one who denies “reading off numbers” as the prime purpose of visual display, I can only denounce evaluating displays in terms of how well (given careful study) people read numbers off. If such an approach were to guide us well, it would have to be a very unusual accident.

He even goes so far as to suggest that we might consider being flexible in the way we geometrically map from measurement scales to points on a canvas, moving away from proportionality if that helps us see a phenomenon better:

The purpose of display is to make messages about phenomena clear. There is no place for a doctrinaire approach to “truth in geometry.” We must be honest and say what we did, but this need not mean plotting raw data.

The point is that we have a choice, not only in (x, y)-plots, but more generally. Planned disproportionality needs to be a widely-available option, one that requires the partnership of computation and display.

Tukey is also willing to rethink how we use familiar charts:

Take simple bar charts as an example, which I would define much more generally than many classical authors. Why have they survived? Not because they are geometrically true, and not because they lead to good numerical estimates by the viewer! In my thoughts, their virtue lies in the fact that we can all compare two bars, perhaps only roughly, in two quite different ways, both “About how much difference?” and “About what ratio?” The latter, of course, is often translated into “About how much percent change?” (Going on to three or more successive bars, we can see globally whether the changes in amount are nearly the same, but asking the same question about ratios – rather than differences-requires either tedious assessment of ratios between adjacent bars for one adjacent pair after another…

So – what other Tukey papers should I read?

Written by Tony Hirst

January 23, 2014 at 1:53 pm

Posted in Rstats, School_Of_Data

Tagged with ,

Using One Programming Language In the Context of Another – Python and R

with 2 comments

Over the last couple of years, I’ve settled into using R an python as my languages of choice for doing stuff:

  • R, because RStudio is a nice environment, I can blend code and text using R markdown and knitr, ggplot2 and Rcharts make generating graphics easy, and reshapers such as plyr make wrangling with data realtvely easy(?!) once you get into the swing of it… (though sometimes OpenRefine can be easier…;-)
  • python, because it’s an all round general purpose thing with lots of handy libraries, good for scraping, and a joy to work with in iPython notebook…

Sometimes, however, you know – or remember – how to do one thing in one language that you’re not sure how to do in another. Or you find a library that is just right for the task hand but it’s in the other language to the one in which you’re working, and routing the data out and back again can be a pain.

How handy it would be if you could make use of one language in the context of another? Well, it seems as if we can (note: I haven’t tried any of these recipes yet…):

Using R inside Python Programs

Whilst python has a range of plotting tools available for it, such as matplotlib, I haven’t found anything quite as a expressive as R’s ggplot2 (there is a python port of ggplot underway but it’s still early days and the syntax, as well as the functionality, is still far from complete as compared to the original [though not a far as it was given the recent update;-)] ). So how handy would it be to be able to throw a pandas data frame, for example, into an R data frame and then use ggplot to render a graphic?

The Rpy and Rpy2 libraries support exactly that, allowing you to run R code within a python programme. For an example, see this Example of using ggplot2 from IPython notebook.

There also seems to be some magic help for running R in iPython notebooks and some experimental integrational work going on in pandas: pandas: rpy2 / R interface.

(See also: ggplot2 in Python: A major barrier broken.)

Using python Inside R

Whilst one of the things I often want to do in python is plot R style ggplots, one of the hurdles I often encounter in R is getting data in in the first place. For example, the data may come from a third party source that needs screenscraping, or via a web API that has a python wrapper but not an R one. Python is my preferred tool for writing scrapers, so is there a quick way I can add a python data grabber into my R context? It seems as if there is: rPython, though the way code is included looks rather clunky and WIndows support appears to be moot. What would be nice would be for RStudio to include some magic, or be able to support python based chunks…

(See also: Calling Python from R with rPython.)

(Note: I’m currently working on the production of an Open University course on data management and use, and I can imagine the upset about overcomplicating matters if I mooted this sort of blended approach in the course materials. But this is exactly the sort of pragmatic use that technologists use code for – as a tool that comes to hand and that can be used quickly and relatively efficiently in concert with other tools, at least when you’re working in a problem solving (rather than production) mode.)

Written by Tony Hirst

January 22, 2014 at 12:11 pm

Posted in Infoskills, Rstats

Tagged with , , ,

Is the UK Government Selling You Off?

with 4 comments

Not content with selling off public services, is the government doing all it can to monetise us by means other than taxation by looking for ways of selling off aggregated data harvested from our interaction as users of public services?

For example, “Better information means better care” (door drop/junk mail flyer) goes the slogan that masks the notice that informs you of the right to opt out [how to opt out] of a system in which your care data may be sold on to commercial third parties, in a suitably anonymised form of course… (as per this, perhaps?).

The intention is presumably laudable – better health research? – but when you sell to one person you tend to sell to another… So when I saw this story – Data Broker Was Selling Lists Of Rape Victims, Alcoholics, and ‘Erectile Dysfunction Sufferers’ – I wondered whether care.data could end up going the same way?

Despite all the stories about the care.data release, I have no idea which bit of legislation covers it (thanks, reporters…not); so even if I could make sense of the legalese, I don’t actually know where to read what the legislation says the HSCIC (presumably) can do in relation to sale of care data, how much it can charge, any limits on what the data can be used for etc.

I did think there might be a clause or two in the Health and Social Care Act 2012, but if there is it didn’t jump out at me. (What am I supposed to do next? Ask a volunteer librarian? Ask my MP to help me find out which bit of law applies, and then how to interpret it, as well as game it a little to see how far the letter if not the spirit of the law could be pushed in commercially exploiting the data? Could the data make it as far as Experian, or Wonga, for example, and if so, how might it in principle be used there? Or how about in ad exchanges?)

A little more digging around the HSCIC Data flows transition model turned up some block diagrams showing how data used for commissioning could flow around, but I couldn’t find anything similar as far as sale of care.data to arbitrary third parties goes.

NHS commissioning data flows

(That’s another reason to check the legislation – there may be a list of what sorts of company is allowed to access care.data for now, but the legislation may also use Henry VIII’th clauses or other schedule devices to define by what ministerial whim additional recipients or classes of recipient can be added to the list…)

What else? Over on the Open Knowledge Foundation blog (disclaimer: I work for the Open Knowledge Foundation’s School of Data for 1 day a week), I see a guest post from Scraperwiki’s Francis Irving/@frabcus about the UK Government Performance Platform (The best data opens itself on UK Gov’s Performance Platform). The platform reports the number of applications for tax discs over time, for example, or the claims for carer’s allowance. But these headline reports make me think: there is presumably much finer grained data below the level of these reports, presumably tied (for digital channel uptake of this services at least) to Government Gateway IDs. And to what extent is this aggregated personal data sellable? Is the release of this data any different in kind to the release of the other national statistics or personal information containing registers (such as the electoral roll) that the government publish either freely or commercially?

Time was when putting together a jigsaw of the bits and pieces of information you could find out about a person meant doing a big jigsaw with little pieces. Are we heading towards a smaller jigsaw with much bigger pieces – Google, Facebook, your mobile operator, your broadband provider, your supermarket, your government, your health service?

PS related, in the selling off stakes? Sale of mortgage style student loan book completed. Or this ill thought out (by me) post – Confused by Government Spending, Indirectly… – around government encouraging home owners to take out shared ownership deals with UK gov so it can sell that loan book off at a later date?

Written by Tony Hirst

January 21, 2014 at 12:26 pm

Revisiting Emergent Social Positioning

leave a comment »

Prompted by an email request, I’ve revisited the code I used to generate emergent social positioning maps in Twitter as an iPython notebook that reuses chunks of code from, as well as the virtual machine used to support, Matthew A. Russell’s Mining the Social Web (2nd edition) [code]).

You can see the notebook here: emergent social positioning iPython notebook [src].

As a reminder, the social positioning maps show folk commonly followed by the followers of a particular twitter user.

Written by Tony Hirst

January 14, 2014 at 11:56 am

Posted in Anything you want

Tagged with

The Artist’s Role…

with 2 comments

As far as [Duchamp] was concerned, the role in society of an artist was akin to that of a philosopher; it didn’t even matter if he or she could paint or draw. An artist’s job was not to give aesthetic pleasure – designers could do that; it was to step back from the world and attempt to make sense or comment on it through the presentation of ideas that had no functional purpose other than themselves.

– Will Gompertz, What Are You Looking At? p. 10

Written by Tony Hirst

January 13, 2014 at 10:12 pm

Posted in Anything you want

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 725 other followers