OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Socially Mapping the Isle of Wight – @onthewight Twitter ESP

leave a comment »

Having dusted off and reversioned my Twitter emergent social positioning (ESP) code, and in advance of starting to think about what sorts of analyses I might properly start running, here’s a look back at what I was doing before in terms of charting where particular Twitter accounts sat amongst the other accounts commonly followed by the target account’s followers.

No longer having a whitelisted Twitter API key means the sample sizes I’m running are smaller than they used to be, to maybe that’s a good thing becuase it means I’ll have to start working properly on the methodology…

Anyway, here’s a quick snapshot of where I think hyperlocal news bloggers @onthewight might be situated on Twitter…

onthewight twitter esp

The view aims to map out accounts that are followed by 10 or more people from a sample of about 200 or so followers of @onthewight. The network is layed out according to a force directed layout algorithm with a dash of aesthetic tweaking; nodes are coloured based on community grouping as identified using the Gephi modularity statistic. Which has it’s issues, but it’s a start. The nodes are sized in the first case according to PageRank.

The quick take home from this original sketchmap is that there are a bunch of key information providers in the middle group, local accounts on the left, and slebs on the right.

If we look more closely at the key information providers, they seem to make sense…

key info providers IW

These folk are likely to be either competitors of @onthewight, or prospects who might be worth approaching for advertising on the basis that @onthewight’s followers also follow the target account. (Of course, you could argue that because they share followers, there’s no point also using @onthewight as a channel. Except that @onthewight also has a popular blog presence, which would be where any ads were placed. (The @onthewight Twitter feed is generally news announcements and live reporting.) A better case could probably be made by looking at the follower profiles of the prospects, along with the ESP maps for the prospects, to see how well the audiences match, what additional reach could be offered, etc etc.

A broad brush view over the island community is a bit more cluttered:

wightlife1

If we tweak the layout a little, rerun PageRank to resize the nodes (note this will no longer take into account contributions from the other communities) and tweak the layout, again using a force directed algorithm, we get a bit less of a mess, though the map is still hard to read. Arts to the top, perhaps, Cowes to the right?

wightlife2

Again, with a bit more data, or perhaps a bit more of a think about what sort of map would be useful (and hence, what sort of data to collect), this sort of map might become useful for B2B marketing marketing purposes on the Island. (I’m not really interested in, erm, the plebs such as myself… i.e. people rather than bizs or slebs; though a pleb interest/demographic/reach analysis would probably be the one that would be most useful to take to prospects?).

If we look at the celebrity common follows, again resized and re-layed out, we see what I guess is a typical spread (it’s some time since I looked at these – not sure what the base line is, though @stephenfry still seems to feature high up in the background radiation count).

IW celebrity outlook

For bigger companies with their own marketing $, I guess this sort of map is the sort of place to look for potential celebrity endorsements to reinforce a message (folk following these accounts are already aware of @onthewight because they follow @onthewight) as well as potentially widen reach. But I guess the endorsement as reinforcement is more valuable as a legitimising thing?

Hmm…

Just got to work out what to do next, now, and how to start tightening this up and making it useful rather than just of passing interest…

PS A related chart that could be plotted using Facebook data would be to grab down all the likes of the friends of a person of company on Facebook, though I’m not not sure how that would work if their account is a page as a opposed to a “person”? I’m not so hot on Facebook API/permissions etc, or what sort of information page owners can get about their audience? Also, I’m not sure about the extent to which I can get likes from folk who aren’t my friends or who haven’t granted me app permissions? I used to be able to grab lists of people from groups and trawl through their likes, but I’m not sure default Facebook permissions make that as easy pickings now compared to a year or two ago? (The advantage of Twitter is that the friend/follow data is open on most accounts…)

Written by Tony Hirst

February 7, 2014 at 6:32 pm

Posted in Anything you want

Tagged with ,

Progress Tracking Google Docs as Tasks?

with 2 comments

As part of a new course I’m working on, the course team has been making use of shared Google docs for working up the course proposal and “D0″ (zero’th draft; key topics to be covered in each of the weekly sessions). Although the course production hasn’t been approved yet, we’ve started drafting the actual course materials, with an agreement to share them for comment via Google docs.

The approach I’ve taken is to created a shared folder with the rest of the course teams, and set up documents for each of the weekly sessions I’ve taken the lead on.

TM351 files

The documents in this folder are all available to other members of the course team – for reference and /or comment – at any time, and represent the “live”/most current version of each document I’m working on. I suspect that others in the course team may take a more cautious approach, only sharing a doc when it’s in a suitable state for handover – or at least, comment – but that’s fine too. My docs can of course be used that way as well – no-one has to look at them until I do “hand them over” for full comment at the end of the first draft stage.

But what if others, such as the course team chair or course manager, do want to keep check on progress over the coming weeks?

The file listing shown above doesn’t give a lot away about the stare of each document, not even a file size, only when it was last worked on. So it struck me that it might be useful to have a visual indicator (such as a horizontal progress bar) about the progress on each document so that someone looking at the listing would know whether there was any point opening a document to have a look inside at all…

..because at the current time, a lot of the docs are just stubs, identifying tasks to be done.

Progress could be measured by proxy indicators, such as file size, “page count” equivalent, or line count. In these cases, the progress meter could be updated automatically. Additional insight could be provided by associating a target line count or page length metadata element, providing additional feedback to the author about progress with respect to that target. If a document exceeds the planned length, the progress meter should carry on going, possibly with a different colour denoting the overrun.

There are a couple of problems at least with this approach – documents that are being worked on may blend scruffy working notes along with actual “finished” text; several versions of the same paragraph may exist as authors try out different approaches, all adding to the line count. Long copied chunks from other sources may be in the text as working references, and so on.

So how about an additional piece of metadata for docs additionally tagged as “task” type in which a user can set a quick progress percentage estimate (a slider widget would make this easy to update) that is displayed in a bar on the file listing. Anyone checking the folder could then – at a glance – see which docs were worth looking at based on progress within the document-as-task. (Of course, having metadata available also opens up the possibility of additional mission creeping features, rulesets for generating alerts when a doc hits a particular percentage completion, for example.)

I’m not looking for more project management tools to take time away from a task, but in this case think the simple addition of a “progress” metadata element could weave an element of project management support into this sort of workflow? (changing the title of the doc would be another way – eg adding (20% done) to the title…

Thinks: hmm, I procrastinating, aren’t I? I should really be working on one of those docs…;-)

Written by Tony Hirst

February 7, 2014 at 10:55 am

Posted in Infoskills

Polling the News…

leave a comment »

Towards the end of last week I attended a two day symposium on Statistics in Journalism Practice and Education at the University of Sheffield. The programme was mixed, with several reviews of data journalism is or could be, and the occasional consideration of what stats might go into a statistics curriculum for students, but it got me thinking again about the way that content gets created and shunted around the news world.

Take polls, for example. At one point a comment got me idly wondering about the percentage of news copy that is derived from polls or surveys, and how it might be possible to automate the counting of such things. (My default position in this case is usually to wonder what might be possible be with the Guardian open platform content API. But I also started to wonder about how we could map the fan out from independent or commissioned polls or surveys as they get reported in the news media, then maybe start to find their way into other reports and documents by virtue of having been reported in the news.

This sort of thing is a corollary to tracking the way in which news stories might make their way from the newswires and into the papers via a bit of cut-and-pasting, as Nick Davies wrote so damningly about several years ago now in Flat Earth News, his indictment of churnalism and all that goes with it; it also reminds me of this old, old piece of Yahoo Pipes pipework where I tried to support the discovery of Media Release Related News Stories by putting university press release feeds into the same timeline view as news stories about that university.

aberdeenPressRelease

I don’t remember whether I also built a custom search engine at the time for searching over press releases and news sites for mentions of universities, but that was what came immediately to mind this time round.

So for starters, here’s a quick Google Custom Search Engine that searches over a variety of polling organisation and news media websites looking for polls and surveys – Churnalism Times (Polls & Surveys Edition).

Here’s part of the setup, showing the page URL patterns to be search over.

List the sites you want to search over

I added additional refinements to the tab that searches over the news organisations so only pull out pages where “poll” or “survey” is mentioned. Note that if these words are indexed in the chrome around the news story (eg in a banner or sidebar), then we can get a false positive hit on the page (i.e. pull back a page where an irrelevant story is mentioned because a poll is linked to in the sidebar).

Add refinements

From way back when when I took an interest in search more than I do now, I thought Google was trying to find ways of distinguishing content from furniture, but I’m not so sure any more…

Anyway, here’s an example of a search into polls and surveys published by some of the big pollsters:

tuitionFeesPoll

And an example of results from the news orgs:

Tuition fees media

For what it’s worth I also put together a custom search engine for searching over press releases – Churnalism Times (PR wires edition):

PR search

The best way of using this is to just past in a quote, or part of a quote, from a news story, in double quotes, to see which PR notice it came from…

PR search for quote

To make life easier, an old bookmarklet generator I produced way back when on an Arcadia fellowship at the Cambridge University Library, can be used to knock up a simple bookmarklet that will let you highlight a chunk of text and then search for it – get-selection bookmarklet generator.

Give it a sensible title; then this is the URL chunk you need to add:

https://www.google.com/cse/publicurl?cx=016419300868826941330:wvfrmcn2oxc&q=

bookmarklet genrator

Sigh.. I used to have so much fun…

PS it actually makes more sense to enclose the selected quote in quotes. Here’s a tweaked version of the bookmarklet code I grabbed from my installation of it in Chrome:

javascript:(function()%7Bvar t%3Dwindow.getSelection%3Fwindow.getSelection().toString()%3Adocument.selection.createRange().text%3Bwindow.location%3D%27https%3A%2F%2Fwww.google.com%2Fcse%2Fpublicurl%3Fcx%3D016419300868826941330%3Awvfrmcn2oxc%26q%3D"%27%2Bt%2B%27"%27%3B%7D)()

PPS I’ve started to add additional search domains to the PR search engine to include political speeches.

Written by Tony Hirst

February 6, 2014 at 7:26 pm

Baudelaire, Intellectual Talisman Along the Way to Impressionism

leave a comment »

During tumultuous times there is often an individual, an intellectual talisman if you like, who watches events unfold and extracts the essence of what is happening into a text, which then provides a handbook for the oppressed. For the frustrated Paris-based artists battling with the Academy during the second half of the nineteenth century, Baudelaire was that individual, his essay, The Painter of Modern Life, the text.

… He claimed that ‘for the sketch of manners, the depiction of bourgeois life … [sic] there is a rapidity of movement which calls for an equal speed of execution from the artist’. …

… Baudelaire passionately believed that it was incumbent upon living artists to document their time, recognizing the unique position that a talented painter or sculptor finds him or herself in: ‘Few men are gifted with the capacity of seeing; there are fewer still who possess the power of expression …’ … He challenged artists to find in modern life ‘the eternal from the the transitory’. That, he thought, was the essential purpose of art – to capture the universal in the everyday, which was particular to their here and now: the present.

And the way to do that was by immersing oneself in the day-to-day of metropolitan living: watching, thinking, feeling and finally recording.

Will Gompertz, What Are You Looking At?, pp.28-29

Written by Tony Hirst

February 6, 2014 at 10:31 am

Posted in Anything you want

Tagged with

Quoting Tukey on Visual Storytelling with Data

with 4 comments

Time was when I used to be a reasonably competent scholar, digging into the literature chasing down what folk actually said, and chasing forward to see whether claims had been refuted. Then I fell out of love with the academic literature – too many papers that said nothing, too many papers that contained errors, too many papers…

…but as we start production on a new OU course on “data”, I’m having to get back in to the literature so I can defuse any claims that what I want to say is wrong by showing that someone else has said it before (which, if what is said has been peer reviewed, makes it right…).

One thing I’d forgotten about chasing thought lines through the literature was that at times it can be quite fun… and that it can quite often turn up memorable, or at least quotable, lines.

For example, last night I had a quick skim through some papers by folk hero in the statistics world, John Tukey, and turned up the following:

The habit of building one technique on another — of assembling procedures like something made of erector-set parts — can be especially useful in dealing with data. So too is looking at the same thing in many ways or many things in the same way; an ability to generalize in profitable ways and a liking for a massive search for order. Mathematicians understand how subtle assumptions can make great differences and are used to trying to trace the paths by which this occurs. The mathematician’s great disadvantage in approaching data is his—or her—attitude toward the words “hypothesis” and “hypotheses”.

When you come to deal with real data, formalized models for its behavior are not hypotheses in the mathematician’s sense… . Instead these formalized models are reference situations—base points, if you like — things against which you compare the data you actually have to see how it differs. There are many challenges to all the skills of mathematicians — except implicit trust in hypotheses — in doing just this.

Since no model is to be believed in, no optimization for a single model can offer more than distant guidance. What is needed, and is never more than approximately at hand, is guidance about what to do in a sequence of ever more realistic situations. The analyst of data is lucky if he has some insight into a few terms of this sequence, particularly those not yet mathematized.

Picturing of Data Picturing of data is the extreme case. Why do we use pictures? Most crucially to see behavior we had not explicitly anticipated as possible – for what pictures are best at is revealing the unanticipated; crucially, often as a way of making it easier to perceive and understand things that would otherwise be painfully complex. These are the important uses of pictures.

We can, and too often do, use picturing unimportantly, often wastefully, as a way of supporting the feeble in heart in their belief that something we have just found is really true. For this last purpose, when and if important, we usually need to look at a summary.

Sometimes we can summarize the data neatly with a few numbers, as when we report:
– a fitted line—two numbers,
– an estimated spread of the residuals around the “best” line—one more number,
– a confidence interval for the slope of the “best” line—two final numbers.

When we can summarize matters this simply in numbers, we hardly need a picture and often lose by going to it. When the simplest useful summary involves many more numbers, a picture can be very helpful. To meet our major commitment of asking what lies beyond, in the example asking “What is happening beyond what the line describes!”, a picture can be essential.

The main tasks of pictures are then:
– to reveal the unexpected,
– to make the complex easier to perceive.

Either may be effective for that which is important above all: suggesting the next step in analysis, or offering the next insight. In doing either of these there is much room for mathematics and novelty.

How do we decide what is a “picture” and what is not? The more we feel that we can “taste, touch, and handle” the more we are dealing with a picture. Whether it looks like a graph, or is a list of a few numbers is not important. Tangibility is important—what we strive for most.

[Tukey, John W. "Mathematics and the picturing of data." In Proceedings of the international congress of mathematicians, vol. 2, pp. 523-531. 1975.]

Wonderful!

Or how about these quotes from Tukey, John W. “Data-based graphics: visual display in the decades to come.” Statistical Science 5, no. 3 (1990): 327-339?

Firstly, on exploratory versus explanatory graphics:

I intend to treat making visual displays as something done by many people who want to communicate – often, on the one hand, to communicate identified phenomena to others, and often, on the other, to communicate unidentified phenomena to themselves. This broad clientele needs a “consumer product,” not an art course. To focus on a broad array of users is not to deny the existence of artists of visual communication, only to recognize how few they are and how small a share in the total volume of communication they can contribute. For such artists,very many statements that follow deserve escape clauses or caveats.

More thoughts on exploration and explanation (transfer of recognition), as well as a distinction between exploration and prospecting:

We all need to be clear that visual display can be very effective in serving two quite different functions, but only if used in correspondingly different ways. On the one hand, it can be essential in helping – or, even, in permitting – us to search in some data for phenomena, just as a prospector searches for gold or uranium.

Our task differs from the usual prospector’s task, in that we are concerned both with phenomena that do occur and with those that might occur but do not. On the other hand, visual display can be very helpful in transferring (to reader, viewer or listener) a recognition of the appearances that indicate the phenomena that deserve report. Indeed, when sufficient precomputation drives appropriately specialized displays, visual display can even also convey the statistical significances or non-significances of these appearances.

There is no reason why a good strategy for prospecting will also be a good strategy for transfer. We can expect some aspects of prospecting strategy (and most techniques) to carry over, but other aspects may not. (We say prospecting because we are optimistic that we may in due course have good lists of possible phenomena. If we do not know what we seek, we are “exploring” not “prospecting.”)

One major difference is in prospecting’s freedom to use multiple pictures. If it takes 5 or 10 kinds of pictures to adequately explore one narrow aspect of the data, the only question is: Will 5 or 10 pictures be needed, or can be condense this to 3 or 4 pictures, without appreciable loss? If “yes” we condense; if “no” we stick to the 5 or 10.

If it takes 500 to 1000 (quite different) pictures, however, our choice can only be between finding a relatively general way to do relatively well with many fewer pictures and asking the computer to sort out some number, perhaps 10 or 20 pictures of “greatest interest.”

In doing transfer, once we have one set of pictures to do what is needed, economy of paper (or plastic) and time (to mention or to read) push even harder toward “no more pictures than needed.” But, even here, we must be very careful not to insist, as a necessity not a desideratum, that a single picture can do it all. If it takes two pictures to convey the message effectively, we must use two.

For prospecting, we will want a bundle of pictures, probably of quite different kinds, so chosen that someone will reveal the presence of any one of possible phenomena, of potentially interesting behaviors, which will often have to be inadequately diverse. Developing, and improving, bundles of pictures for selected combinations of kinds of aspects and kinds of situations will be a continuing task for a long time.

For transfer, we will need a few good styles to transfer each phenomenon of possible importance, so that, when more than one phenomenon deserves transfer, we can choose compatible styles and try to transfer two, or even all, of these phenomena in a single picture. (We can look for the opportunity to do this, but, when we cannot find it, we will use two, or more, pictures as necessary.)

For prospecting, we look at long lists of what might occur – and expect to use many pictures. For transfer, we select short lists of what must be made plain – and use as few pictures as will serve us well.

Then on the power of visual representations, whilst recognising they may not always be up to the job…:

The greatest possibilities of visual display lie in vividness and inescapability of the intended message. A visual display can stop your mental flow in its tracks and make you think. A visual display can force you to notice what you never expected to see. (“Why, that scatter diagram has a hole in the middle!”) On the other hand, if one has to work almost as hard to drag something out of a visual display as one would to drag it out of a table of numbers, the visual display is a poor second to the table, which can easily provide so much more precision. (Here, as elsewhere, artists may deserve an escape clause.)

On design:

Another important aspect of impact is immediacy. One should see the intended at once; one should not even have to wait for it to gradually appear. If a visual display lacks immediacy in thrusting before us one of the phenomena for whose presentation it had been assigned responsibility, we ought to ask why and use the answer to modify the display so its impact will be more immediate.

(For a great example of how to progressively refine a graphic to support the making of a particular point, see this Storytelling With Data post on multifaceted data and story.)

Tukey, who we must recall was writing at a time when powerful statistical graphics tools such as ggplot were still yet to be implemented, also suggests that lessons are to be learned from graphic design for the production of effective statistical charts:

The art of statistical graphics was for a long time a pen-and-pencil cottage industry, with the top professionals skilled with the drafting or mapping pen. In the meantime, graphic designers, especially for books, have had access to different sorts of techniques (the techniques of graphic communication), such as grays of different screen weights, against which, for instance, both white and black lines (and curves) are effective. They also have a set of principles shared in part with the Fine Arts (some written down by Leonardo da Vinci). I do not understand all this well enough to try to tell you about “visual centers” and how attention usually moves when one looks at a picture, but I do know enough to find this area important – and to tell you that more of us need to learn a lot more about it.

Data – what is it good for?

Almost everything we do with data involves comparison – most often between two or more values derived from the data, sometimes between one value derived from the data and some mental reference or standard. The dedication of Richard Hamming’s book on numerical analysis reads “The purpose of computation is insight, not numbers.” We need a book on visual display that at least implies “The purpose of display is comparison (recognition of phenomena), not numbers.”

Tukey also encourages us to think about what data represents, and how it is represented:

Much of what we want to know about the world is naturally expressed as phenomena, as potentially interesting things that can be described in non numerical words. That an economic growth rate has been declining steadily throughout President X’s administration, for example, is a phenomenon, while the fact that the GNP has a given value is a number. With exceptions like “I owe him 27 dollars!” numbers are, when we look deeply enough, mainly of interest because they can be assembled, often only through analysis, to describe phenomena. To me phenomena are the main actors, numbers are the supporting cast. Clearly we most need help with the main actors.

If you really want numbers, presumably for later assembly into a phenomenon, a table is likely to serve you best. The graphic map of Napoleon’s incursion into Russia that so stirs Tufte’s imagination and admiration does quite well in showing the relevant phenomena, in giving the answers to “About where?”, “About when?” and “With roughly what fraction of the original army left?” It serves certain phenomena well. But if we want numbers, we can do better either by reading the digits that may be attached to the graphic – a simple but often effective form of table – or by going to a conventional table.

The questions that visual display (in some graphic mode) answers best are phenomenological (in the sense of the first sentence of this section). For instance:

* Is the value small, medium or large?
* Is the difference, or change, up, down or neutral?
* Is the difference, or change, small, medium or large?
* Do the successive changes grow, shrink or stay roughly constant?
* What about change in ratio terms, perhaps thought of as percent of previous?
* Does the vertical scatter change, as we move from left to right?
* Is the scatter pattern doughnut-shaped?

One way that we will enhance the usefulness of visual display is to find new phenomena of potential interest and then learn how to make displays that will be likely to reveal them, when they are present.

The absence of a positive phenomenon is itself a phenomenon! Such absences as:

* the values are all about the same
* there does not seem to be any definite curvature
* the vertical scatter does not seem to change, as we go from left to right!

are certainly potentially interesting. (We can all find instances where they are interesting.) Thus they are, honestly, phenomena in themselves. We need to be able to view apparent absence of specific phenomena effectively as well as noticing them when they are present! This is one of the reasons why fitting scatter plots with summarizing devices like middle traces (Tukey, 1977a, page 279 ff.) can be important.

Phenomena are also picked up later in the paper:

A graph or chart should not be just another form of table, in which we can look up the facts. If it is to do its part effectively, its focus – or so I believe – will have to be one or more phenomena.

Indeed, the requirement that we can directly read values from a chart seems to be something Tukey takes issue with:

As one who denies “reading off numbers” as the prime purpose of visual display, I can only denounce evaluating displays in terms of how well (given careful study) people read numbers off. If such an approach were to guide us well, it would have to be a very unusual accident.

He even goes so far as to suggest that we might consider being flexible in the way we geometrically map from measurement scales to points on a canvas, moving away from proportionality if that helps us see a phenomenon better:

The purpose of display is to make messages about phenomena clear. There is no place for a doctrinaire approach to “truth in geometry.” We must be honest and say what we did, but this need not mean plotting raw data.

The point is that we have a choice, not only in (x, y)-plots, but more generally. Planned disproportionality needs to be a widely-available option, one that requires the partnership of computation and display.

Tukey is also willing to rethink how we use familiar charts:

Take simple bar charts as an example, which I would define much more generally than many classical authors. Why have they survived? Not because they are geometrically true, and not because they lead to good numerical estimates by the viewer! In my thoughts, their virtue lies in the fact that we can all compare two bars, perhaps only roughly, in two quite different ways, both “About how much difference?” and “About what ratio?” The latter, of course, is often translated into “About how much percent change?” (Going on to three or more successive bars, we can see globally whether the changes in amount are nearly the same, but asking the same question about ratios – rather than differences-requires either tedious assessment of ratios between adjacent bars for one adjacent pair after another…

So – what other Tukey papers should I read?

Written by Tony Hirst

January 23, 2014 at 1:53 pm

Posted in Rstats, School_Of_Data

Tagged with ,

Using One Programming Language In the Context of Another – Python and R

with 2 comments

Over the last couple of years, I’ve settled into using R an python as my languages of choice for doing stuff:

  • R, because RStudio is a nice environment, I can blend code and text using R markdown and knitr, ggplot2 and Rcharts make generating graphics easy, and reshapers such as plyr make wrangling with data realtvely easy(?!) once you get into the swing of it… (though sometimes OpenRefine can be easier…;-)
  • python, because it’s an all round general purpose thing with lots of handy libraries, good for scraping, and a joy to work with in iPython notebook…

Sometimes, however, you know – or remember – how to do one thing in one language that you’re not sure how to do in another. Or you find a library that is just right for the task hand but it’s in the other language to the one in which you’re working, and routing the data out and back again can be a pain.

How handy it would be if you could make use of one language in the context of another? Well, it seems as if we can (note: I haven’t tried any of these recipes yet…):

Using R inside Python Programs

Whilst python has a range of plotting tools available for it, such as matplotlib, I haven’t found anything quite as a expressive as R’s ggplot2 (there is a python port of ggplot underway but it’s still early days and the syntax, as well as the functionality, is still far from complete as compared to the original [though not a far as it was given the recent update;-)] ). So how handy would it be to be able to throw a pandas data frame, for example, into an R data frame and then use ggplot to render a graphic?

The Rpy and Rpy2 libraries support exactly that, allowing you to run R code within a python programme. For an example, see this Example of using ggplot2 from IPython notebook.

There also seems to be some magic help for running R in iPython notebooks and some experimental integrational work going on in pandas: pandas: rpy2 / R interface.

(See also: ggplot2 in Python: A major barrier broken.)

Using python Inside R

Whilst one of the things I often want to do in python is plot R style ggplots, one of the hurdles I often encounter in R is getting data in in the first place. For example, the data may come from a third party source that needs screenscraping, or via a web API that has a python wrapper but not an R one. Python is my preferred tool for writing scrapers, so is there a quick way I can add a python data grabber into my R context? It seems as if there is: rPython, though the way code is included looks rather clunky and WIndows support appears to be moot. What would be nice would be for RStudio to include some magic, or be able to support python based chunks…

(See also: Calling Python from R with rPython.)

(Note: I’m currently working on the production of an Open University course on data management and use, and I can imagine the upset about overcomplicating matters if I mooted this sort of blended approach in the course materials. But this is exactly the sort of pragmatic use that technologists use code for – as a tool that comes to hand and that can be used quickly and relatively efficiently in concert with other tools, at least when you’re working in a problem solving (rather than production) mode.)

Written by Tony Hirst

January 22, 2014 at 12:11 pm

Posted in Infoskills, Rstats

Tagged with , , ,

Is the UK Government Selling You Off?

with 4 comments

Not content with selling off public services, is the government doing all it can to monetise us by means other than taxation by looking for ways of selling off aggregated data harvested from our interaction as users of public services?

For example, “Better information means better care” (door drop/junk mail flyer) goes the slogan that masks the notice that informs you of the right to opt out [how to opt out] of a system in which your care data may be sold on to commercial third parties, in a suitably anonymised form of course… (as per this, perhaps?).

The intention is presumably laudable – better health research? – but when you sell to one person you tend to sell to another… So when I saw this story – Data Broker Was Selling Lists Of Rape Victims, Alcoholics, and ‘Erectile Dysfunction Sufferers’ – I wondered whether care.data could end up going the same way?

Despite all the stories about the care.data release, I have no idea which bit of legislation covers it (thanks, reporters…not); so even if I could make sense of the legalese, I don’t actually know where to read what the legislation says the HSCIC (presumably) can do in relation to sale of care data, how much it can charge, any limits on what the data can be used for etc.

I did think there might be a clause or two in the Health and Social Care Act 2012, but if there is it didn’t jump out at me. (What am I supposed to do next? Ask a volunteer librarian? Ask my MP to help me find out which bit of law applies, and then how to interpret it, as well as game it a little to see how far the letter if not the spirit of the law could be pushed in commercially exploiting the data? Could the data make it as far as Experian, or Wonga, for example, and if so, how might it in principle be used there? Or how about in ad exchanges?)

A little more digging around the HSCIC Data flows transition model turned up some block diagrams showing how data used for commissioning could flow around, but I couldn’t find anything similar as far as sale of care.data to arbitrary third parties goes.

NHS commissioning data flows

(That’s another reason to check the legislation – there may be a list of what sorts of company is allowed to access care.data for now, but the legislation may also use Henry VIII’th clauses or other schedule devices to define by what ministerial whim additional recipients or classes of recipient can be added to the list…)

What else? Over on the Open Knowledge Foundation blog (disclaimer: I work for the Open Knowledge Foundation’s School of Data for 1 day a week), I see a guest post from Scraperwiki’s Francis Irving/@frabcus about the UK Government Performance Platform (The best data opens itself on UK Gov’s Performance Platform). The platform reports the number of applications for tax discs over time, for example, or the claims for carer’s allowance. But these headline reports make me think: there is presumably much finer grained data below the level of these reports, presumably tied (for digital channel uptake of this services at least) to Government Gateway IDs. And to what extent is this aggregated personal data sellable? Is the release of this data any different in kind to the release of the other national statistics or personal information containing registers (such as the electoral roll) that the government publish either freely or commercially?

Time was when putting together a jigsaw of the bits and pieces of information you could find out about a person meant doing a big jigsaw with little pieces. Are we heading towards a smaller jigsaw with much bigger pieces – Google, Facebook, your mobile operator, your broadband provider, your supermarket, your government, your health service?

PS related, in the selling off stakes? Sale of mortgage style student loan book completed. Or this ill thought out (by me) post – Confused by Government Spending, Indirectly… – around government encouraging home owners to take out shared ownership deals with UK gov so it can sell that loan book off at a later date?

Written by Tony Hirst

January 21, 2014 at 12:26 pm

Follow

Get every new post delivered to your Inbox.

Join 729 other followers