OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Visualising Pandas DataFrames With IPythonBlocks – Proof of Concept

with 2 comments

A few weeks ago I came across IPythonBlocks, a Python library developed to support the teaching of Python programming. The library provides an HTML grid that can be manipulated using simple programming constructs, presenting the outcome of the operations in a visually meaningful way.

As part of a new third level OU course we’re putting together on databases and data wrangling, I’ve been getting to grips with the python pandas library. This library provides a dataframe based framework for data analysis and data-styled programming that bears a significant resemblance to R’s notion of dataframes and vectorised computing. pandas also provides a range of dataframe based operations that resemble SQL style operations – joining tables, for example, and performing grouping style summary operations.

One of the things we’re quite keen to do as a course team is identify visually appealing ways of illustrating a variety of data manipulating operations; so I wondered whether we might be able to use ipythonblocks as a basis for visualising – and debugging – pandas dataframe operations.

I’ve posted a demo IPython notebook here: ipythonblocks/pandas proof of concept [nbviewer preview]. In it, I’ve started to sketch out some simple functions for visualising pandas dataframes using ipythonblocks blocks.

For example, the following minimal function finds the size and shape of a pandas dataframe and uses it to configure a simple block:

def pBlockGrid(df):
    (y,x)=df.shape
    return BlockGrid(x,y)

We can also colour individual blocks – the following example uses colour to reveal the different datatypes of columns within a dataframe:

ipythinblocks pandas type colour

A more elaborate function attempts to visualise the outcome of merging two data frames:

ipythonblocks pandas demo

The green colour identifies key columns, the red and blue cells data elements from the left and right joined dataframes respectively, and the black cells NA/NaN cells.

One thing I started wondering about that I have to admit quite excited me (?!;-) was whether it would be possible to extend the pandas dataframe itself with methods for producing ipythonblocks visual representations of the state of a dataframe, or the effect of dataframe based operations such as .concat() and .merge() on source dataframes.

If you have any comments on this approach, suggestions for additional or alternative ways of visualising dataframe transformations, or thoughts about how to extend pandas dataframes with ipythonblocks style visualisations of those datastructures and/or the operations that can be applied to them, please let me know via the comments:-)

PS some thoughts on a possible pandas interface:

  • DataFrame().blocks() to show the blocks
  • .cat(blocks=True) and .merge(blocks=True) to return (df, blocks)
  • DataFrame().blocks(blockProperties={}) and eg .merge(blocks=True, blockProperties={})
  • blockProperties: showNA=True|False, color_base=(), color_NA=(), color_left=(), color_right=(), color_gradient=[] (eg for a .cat() on many dataframes), colorView=structure|datatypes|missing (the colorView reveals the datatypes of the columns, the structure origins of cells returned from a .merge() or .cat(), or a view of missing data (reveal NA/NaN etc over a base color), colorTypes={} (to set the colors for different datatypes)

Written by Tony Hirst

March 26, 2014 at 11:37 pm

First Signs (For Me) of Linked Data Being Properly Linked…?!

with 3 comments

As anyone who’s followed this blog for some time will know, my relationship with Linked Data has been an off and on again one over the years. At the current time, it’s largely off – all my OpenRefine installs seem to have given up the ghost as far as reconciliation and linking services go, and I have no idea where the problem lies (whether with the plugins, the installs, with Java, with the endpoints, with the reconciliations or linkages I’m trying to establish).

My dabblings with pulling data in from Wikipedia/DBpedia to Gephi (eg as described in Visualising Related Entries in Wikipedia Using Gephi and the various associated follow-on posts) continue to be hit and miss due to the vagaries of DBpedia and the huge gaps in infobox structured data across Wikipedia itself.

With OpenRefine not doing its thing for me, I haven’t been able to use that app as the glue to bind together queries made across different Linked Data services, albeit in piecemeal fashion. Because from the occasional sideline view I have of the Linked Data world, I haven’t seen any obvious way of actually linking data sets other than by pulling identifiers in to a new OpenRefine column (or wherever) from one service, then using those identifiers to pull in data from another endpoint into another column, and so on…

So all is generally not well.

However, a recent post by the Ordnance Survey’s John Goodwin (aka @gothwin) caught my eye the other day: Federating SPARQL Queries Across Government Linked Data. It seems that federated queries can now be made across several endpoints.

John gives an example using data from the Ordnance Survey SPARQL endpoint and an endpoint published by the Environment Agency:

The Environment Agency has published a number of its open data offerings as linked data … A relatively straight forward SPARQL query will get you a list of bathing waters, their name and the district they are in.

[S]uppose we just want a list of bathing water areas in South East England – how would we do that? This is where SPARQL federation comes in. The information about which European Regions districts are in is held in the Ordnance Survey linked data. If you hop over the the Ordnance Survey SPARQL endpoint explorer you can run [a] query to find all districts in South East England along with their names …

Using the SERVICE keyword we can bring these two queries together to find all bathing waters in South East England, and the districts they are in:

And here’s the query John shows, as run against the Ordnance Survey SPARQL endpoint

SELECT ?x ?name ?districtname WHERE {
  ?x a <http://environment.data.gov.uk/def/bathing-water/BathingWater> .
  ?x <http://www.w3.org/2000/01/rdf-schema#label> ?name .
  ?x <http://statistics.data.gov.uk/def/administrative-geography/district> ?district .
  SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql>
    ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within> <http://data.ordnancesurvey.co.uk/id/7000000000041421> .
    ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  }
} ORDER BY ?districtname

In a follow on post, John goes even further “by linking up data from Ordnance Survey, the Office of National Statistics, the Department of Communities and Local Government and Hampshire County Council”.

So that’s four endpoints – the original one against which the query is first fired, and three others…

SELECT ?districtname ?imdrank ?changeorder ?opdate ?councilwebsite ?siteaddress WHERE {
  ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within <http://data.ordnancesurvey.co.uk/id/7000000000017765> .
  ?district a <http://data.ordnancesurvey.co.uk/ontology/admingeo/District> .
  ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  SERVICE <http://opendatacommunities.org/sparql> {
    ?s <http://purl.org/linked-data/sdmx/2009/dimension#refArea> ?district .
    ?s <http://opendatacommunities.org/def/IMD#IMD-rank> ?imdrank . 
    ?authority <http://opendatacommunities.org/def/local-government/governs> ?district .
    ?authority <http://xmlns.com/foaf/0.1/page> ?councilwebsite .
  }
  ?district <http://www.w3.org/2002/07/owl#sameAs> ?onsdist .
  SERVICE <http://statistics.data.gov.uk/sparql> {
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/originatingChangeOrder> ?changeorder .
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/operativedate> ?opdate .
  }
  SERVICE <http://linkeddata.hants.gov.uk/sparql> {
    ?landsupsite <http://data.ordnancesurvey.co.uk/ontology/admingeo/district> ?district .
    ?landsupsite a <http://linkeddata.hants.gov.uk/def/land-supply/LandSupplySite> .
    ?landsupsite <http://www.ordnancesurvey.co.uk/ontology/BuildingsAndPlaces/v1.1/BuildingsAndPlaces.owl#hasAddress> ?siteaddress .
  }
}

Now we’re getting somewhere….

Written by Tony Hirst

March 25, 2014 at 3:25 pm

Posted in Anything you want

Tagged with

Experimenting With R – Point to Point Mapping With Great Circles

with 5 comments

I’ve started doodling again… This time, around maps, looking for recipes that make life easier plotting lines to connect points on maps. The most attractive maps seem to use great circles to connect one point with another, these providing the shortest path between two points when you consider the Earth as a sphere.

Here’s one quick experiment (based on the Flowing Data blog post How to map connections with great circles), for an R/Shiny app that allows you to upload a CSV file containing a couple of location columns (at least) and an optional “amount” column, and it’ll then draw lines between the points on each row.

greatcircle map demo

The app requires us to solve several problems, including:

  • how to geocode the locations
  • how to plot the lines as great circles
  • how to upload the CSV file
  • how to select the from and two columns from the CSV file
  • how to optionally select a valid numerical column for setting line thickness

Let’s start with the geocoder. For convenience, I’m going to use the Google geocoder via the geocode() function from the ggmap library.

#Locations are in two columns, *fr* and *to* in the *dummy* dataframe
#If locations are duplicated in from/to columns, dedupe so we don't geocode same location more than once
locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)
#Run the geocoder against each location, then transpose and bind the results into a dataframe
cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 

The locs data is a vector of locations:

                    place
1              London, UK
2            Cambridge,UK
3            Paris,France
4       Sydney, Australia
5           Paris, France
6             New York,US
7 Cape Town, South Africa

The sapply(locs$place,geocode, USE.NAMES=F) function returns data that looks like:

    [,1]       [,2]     [,3]     [,4]      [,5]     [,6]      [,7]     
lon -0.1254872 0.121817 2.352222 151.207   2.352222 -74.00597 18.42406 
lat 51.50852   52.20534 48.85661 -33.86749 48.85661 40.71435  -33.92487

The transpose (t() gives us:

     lon        lat      
[1,] -0.1254872 51.50852 
[2,] 0.121817   52.20534 
[3,] 2.352222   48.85661 
[4,] 151.207    -33.86749
[5,] 2.352222   48.85661 
[6,] -74.00597  40.71435 
[7,] 18.42406   -33.92487

The cbind() binds each location with its lat and lon value:

                    place        lon       lat
1              London, UK -0.1254872  51.50852
2            Cambridge,UK   0.121817  52.20534
3            Paris,France   2.352222  48.85661
4       Sydney, Australia    151.207 -33.86749
5           Paris, France   2.352222  48.85661
6             New York,US  -74.00597  40.71435
7 Cape Town, South Africa   18.42406 -33.92487

Code that provides a minimal example for uploading the data from a CSV file on the desktop to the Shiny app, then creating dynamic drop lists containing column names, can be found here: Simple file geocoder (R/shiny app).

The following snippet may be generally useful for getting a list of column names from a data frame that correspond to numerical columns:

#Get a list of column names for numerical columns in data frame df
nums <- sapply(df, is.numeric)
names(nums[nums])

The code for the full application can be found as a runnable gist in RStudio from here: R/Shiny app – great circle mapping. [In RStudio, install.packages("shiny"); library(shiny); runGist(9690079). The gist contains a dummy data file if you want to download it to try it out...]

Here’s the code explicitly…

The global.R file loads the necessary packages, installing them if they are missing:

#global.R

##This should detect and install missing packages before loading them - hopefully!
list.of.packages <- c("shiny", "ggmap","maps","geosphere")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages,function(x){library(x,character.only=TRUE)}) 

The ui.R file builds the Shiny app’s user interface. The drop down column selector lists are populated dynamically with the names of the columns in the data file once it is uploaded. An optional Amount column can be selected – the corresponding list only displays the names of numerical columns. (The lists of location columns to be geocoded should really be limited to non-numerical columns.) The action button prevents the geocoding routines firing until the user is ready – select the columns appropriately before geocoding (error messages are not handled very nicely;-)

#ui.R
shinyUI(pageWithSidebar(
  headerPanel("Great Circle Map demo"),
  
  sidebarPanel(
    #Provide a dialogue to upload a file
    fileInput('datafile', 'Choose CSV file',
              accept=c('text/csv', 'text/comma-separated-values,text/plain')),
    #Define some dynamic UI elements - these will be lists containing file column names
    uiOutput("fromCol"),
    uiOutput("toCol"),
    #Do we want to make use of an amount column to tweak line properties?
    uiOutput("amountflag"),
    #If we do, we need more options...
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("amountCol")
    ),
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("lineSelector")
    ),
    #We don't want the geocoder firing until we're ready...
    actionButton("getgeo", "Get geodata")
    
  ),
  mainPanel(
    tableOutput("filetable"),
    tableOutput("geotable"),
    plotOutput("geoplot")
  )
))

The server.R file contains the server logic for the app. One thing to note is the way we isolate some of the variables in the geocoder reactive function. (Reactive functions fire when one of the external variables they contain changes. To prevent the function firing when a variable it contains changes, we need to isolate it. (See the docs for me; for example, Shiny Lesson 7: Reactive outputs or Isolation: avoiding dependency.)

#server.R

shinyServer(function(input, output) {

  #Handle the file upload
  filedata <- reactive({
    infile <- input$datafile
    if (is.null(infile)) {
      # User has not uploaded a file yet
      return(NULL)
    }
    read.csv(infile$datapath)
  })

  #Populate the list boxes in the UI with column names from the uploaded file  
  output$toCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("to", "To:",items)
  })
  
  output$fromCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("from", "From:",items)
  })
  
  #If we want to make use of an amount column, we need to be able to say so...
  output$amountflag <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    checkboxInput("amountflag", "Use values?", FALSE)
  })

  output$amountCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    #Let's only show numeric columns
    nums <- sapply(df, is.numeric)
    items=names(nums[nums])
    names(items)=items
    selectInput("amount", "Amount:",items)
  })
  
  #Allow different line styles to be selected
  output$lineSelector <- renderUI({
    radioButtons("lineselector", "Line type:",
                 c("Uniform" = "uniform",
                   "Thickness proportional" = "thickprop",
                   "Colour proportional" = "colprop"))
  })
  
  #Display the data table - handy for debugging; if the file is large, need to limit the data displayed [TO DO]
  output$filetable <- renderTable({
    filedata()
  })
  
  #The geocoding bit... Isolate variables so we don't keep firing this...
  geodata <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (is.null(df)) return(NULL)
    
    isolate({
      dummy=filedata()
      fr=input$from
      to=input$to
      locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)      
      cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 
    })
  })

  #Weave the goecoded data into the data frame we made from the CSV file
  geodata2 <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (input$amountflag != 0) {
      maxval=max(df[input$amount],na.rm=T)
      minval=min(df[input$amount],na.rm=T)
      df$b8g43bds=10*df[input$amount]/maxval
    }
    gf=geodata()
    df=merge(df,gf,by.x=input$from,by.y='place')
    merge(df,gf,by.x=input$to,by.y='place')
  })
  
  #Preview the geocoded data
  output$geotable <- renderTable({
    if (input$getgeo == 0) return(NULL)
    geodata2()
  })
  
  #Plot the data on a map...
  output$geoplot<- renderPlot({
    if (input$getgeo == 0) return(map("world"))
    #Method pinched from: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
    map("world")
    df=geodata2()
    
    pal <- colorRampPalette(c("blue", "red"))
    colors <- pal(100)
    
    for (j in 1:nrow(df)){
      inter <- gcIntermediate(c(df[j,]$lon.x[[1]], df[j,]$lat.x[[1]]), c(df[j,]$lon.y[[1]], df[j,]$lat.y[[1]]), n=100, addStartEnd=TRUE)

      #We could possibly do more styling based on user preferences?
      if (input$amountflag == 0) lines(inter, col="red", lwd=0.8)
      else {
        if (input$lineselector == 'colprop') {
          colindex <- round( (df[j,]$b8g43bds[[1]]/10) * length(colors) )
          lines(inter, col=colors[colindex], lwd=0.8)
        } else if (input$lineselector == 'thickprop') {
          lines(inter, col="red", lwd=df[j,]$b8g43bds[[1]])
        } else lines(inter, col="red", lwd=0.8)
      } 
    } 
  })

})

So that’s the start of it… this app could be further developed in several ways, for example allowing the user to filter or colour displayed lines according to factor values in a further column (commodity type, for example), or produce a lattice of maps based on facet values in a column.

I also need to figure how to to save maps, and maybe produce zoomable ones. If geocoded points all lay within a blinding box limited to a particular geographical area, scaling the map view to show just that area might be useful.

Other techniques might include using proportional symbols (circles) at line landing points to show the sum of values incoming to that point, or some of values outgoing, or the difference between the two; (maybe use green for incoming outgoing, then size by the absolute difference?)

Written by Tony Hirst

March 24, 2014 at 11:17 am

Recreational Data

with 2 comments

Part of my weekend ritual is to buy the weekend papers and have a go at the recreational maths problems that are Sudoku and Killer. I also look for news stories with a data angle that might prompt a bit of recreational data activity…

In a paper that may or may not have been presented at the First European Congress of Mathematics in Paris, July, 1992, Prof. David Singmaster reflected on “The Unreasonable Utility of Recreational Mathematics”.

unreasonableUtility

To begin with, it is worth considering what is meant by recreational
mathematics.

First, recreational mathematics is mathematics that is fun and popular – that is, the problems should be understandable to the interested layman, though the solutions may be harder. (However, if the solution is too hard, this may shift the topic from recreational toward the serious – e.g. Fermat’s Last Theorem, the Four Colour Theorem or the Mandelbrot Set.)

Secondly, recreational mathematics is mathematics that is fun and used as either as a diversion from serious mathematics or as a way of making serious mathematics understandable or palatable. These are the pedagogic uses of recreational mathematics. They are already present in the oldest known mathematics and continue to the present day.

These two aspects of recreational mathematics – the popular and the pedagogic – overlap considerably and there is no clear boundary between them and “serious” mathematics.

How is recreational mathematics useful?

Firstly, recreational problems are often the basis of serious mathematics. The most obvious fields are probability and graph theory where popular problems have been a major (or the dominant) stimulus to the creation and evolution of the subject. …

Secondly, recreational mathematics has frequently turned up ideas of genuine but non-obvious utility. …

Anyone who has tried to do anything with “real world” data knows how much of a puzzle it can represent: from finding the data, to getting hold of it, to getting it into a state and a shape where you can actually work with it, to analysing it, charting it, looking for pattern and structure within it, having a conversation with it, getting it to tell you one of the many stories it may represent, there are tricks to be learned and problems to be solved. And they’re fun.

An obvious definition [of recreational mathematics] is that it is mathematics that is fun, but almost any mathematician will say that he enjoys his work, even if he is studying eigenvalues of elliptic differential operators, so this definition would encompass almost all mathematics and hence is too general. There are two, somewhat overlapping, definitions that cover most of what is meant by recreational mathematics.

…the two definitions described above.

So how might we define “recreational data”. For me, recreational data activities are, in who or in part, data investigations, involving one or more steps of the data lifecycle (discovery, acquisition, cleaning, analysis, visualisation, storytelling). They are the activities I engage in when I look for, or behind, the numbers that appear in a news story. They’re the stories I read on FullFact, or listen to on the OU/BBC co-pro More or Less; they’re at the heart of the beautiful little book that is The Tiger That Isn’t; recreational data is what I do in the “Diary of a Data Sleuth” posts on OpenLearn.

Recreational data is about the joy of trying to find stories in data.

Recreational data is, or can be, the data journalism you do for yourself or the sense you make of the stats in the sports pages.

Recreational data is a safe place to practice – I tinker with Twitter and formulate charts around Formula One. But remember this: “recreational problems are often the basis of serious [practice]“. The “work” I did around Visualising Twitter User Timeline Activity in R? I can (and do) reuse that code as the basis of other timeline analyses. The puzzle of plotting connected concepts on Wikipedia I described in Visualising Related Entries in Wikipedia Using Gephi? It’s a pattern I can keep on playing with.

If you think you might like to do some doodle of your own with some data, why not check out the School Of Data. Or watch out on OpenLearn for some follow up stories from the OU/BBC co-pro of Hans Rosling’s award winning Don’t Panic

Written by Tony Hirst

March 21, 2014 at 9:56 am

So Is This Guerrillla Research?

with 2 comments

A couple of days ago I delivered a workshop with Martin Weller on the topic of “Guerrilla Research”.

guerrilapdf

The session was run under the #elesig banner, and was the result of an invitation to work through the germ of an idea that was a blog post Martin had published in October 2013, The Art Of Guerrilla Research.

In that post, Martin had posted a short list of what he saw as “guerrilla research” characteristics:

  1. It can be done by one or two researchers and does not require a team
  2. It relies on existing open data, information and tools
  3. It is fairly quick to realise
  4. It is often disseminated via blogs and social media

Looking at these principles now, as in, right now, as a I type (I don’t know what I’m going to write…), I don’t necessarily see any of these as defining, at least, not without clarification. Let’s reflect, and see how my fingers transcribe my inner voice…

In the first case, a source crowd or network may play a role in the activity, so maybe it’s the initiation of the activity that only requires one or two people?

Open data, information and tools helps, but I’d gear this more towards pre-existing data, information and tools, rather than necessarily open: if you work inside an organisation, you may be able to appropriate resources that are not open or available outside the organisation, and may even have limited access within the organisation; you may have to “steal” access to them, even; open resources do mean that other people can engage in the same activity using the same resources, though, which provides transparency and reproducibility; open resources also make inside, outside activities possible.

The activity may be quick to realise, sort of: I can quickly set a scraper going to collect data about X, and the analysis of the data may be quick to realise; but I may need the scraper to run for days, or weeks, or months; more qualifying, I think, is that the activity only requires a relatively short number of relatively quick bursts of activity.

Online means of dissemination are natural, because they’re “free”, immediate, have potentially wide reach; but I think an email to someone who can, or a letter to the local press, or an activity that is it’s own publication, such as a submission to a consultation in which the responses are all published, could also count too.

Maybe I should have looked at those principles a little more closely before the workshop…;-) And maybe I should have made reference to them in my presentation. Martin did, in his.

PS WordPress just “related” this back to me, from June, 2009: Guerrilla Education: Teaching and Learning at the Speed of News

Written by Tony Hirst

March 21, 2014 at 8:44 am

Posted in OU2.0, Thinkses

Tagged with ,

Oppia – A Learning Journey Platform From Google…

with one comment

I couldn’t get to sleep last night mulling over thoughts that had surfaced after posting Time to Drop Calculators in Favour of Notebook Programming?. This sort of thing: what goes on when you get someone to add three and four?

Part of the problem is associated with converting the written problem into mathematical notation:

3 + 4

For more complex problems it may require invoking some equations, or mathematical tricks or operations (chain rule, dot product, and so on).

3 + 4

Cast the problem into numbers then try to solve it:

3 + 4 =

That equals gets me doing some mental arithmetic. In a calculator, there’s a sequence of button presses, then the equals gives the answer.

In a notebook, I type:

3 + 4

that is, I write the program in mathematicalese, hit the right sort of return, and get the answer:

7

The mechanics of finding the right hand side by executing the operations on the left hand side are handled for me.

Try this on WolframAlpha: integral of x squared times x minus three from -3 to 4

Or don’t.. do it by hand if you prefer.

I may be able to figure out the maths bit – figure out how to cast my problem into a mathematical statement – but not necessarily have the skill to solve the problem. I can get the method marks but not do the calculation and get the result. I can write the program. But running the programme, diving 3847835 by 343, calculating the square root of 26,863 using log tables or whatever means, that’s the blocker – that could put me off trying to make use of maths, could put me off learning how to cast a problem into a mathematical form, if all that means is that I can do no more than look at the form as if it were a picture, poem, or any other piece of useless abstraction.

So why don’t we help people see that casting the problem into the mathematical form is the creative bit, the bit the machines can’t do. Because the machines can do the mechanical bit:

wolfram alpha

Maybe this is the approach that the folk over at Computer Based Math are thinking about (h/t Simon Knight/@sjgknight for the link), or maybe it isn’t… But I get the feeling I need to look over what they’re up to… I also note Conrad Wolfram is behind it; we kept crossing paths, a few years ago…I was taken by his passion, and ideas, about how we should be helping folk see that maths can be useful, and how you can use it, but there was always the commercial blocker, the need for Mathematica licenses, the TM; as in Computer-Based Math™.

Then tonight, another example of interactivity, wired in to a new “learning journey” platform that again @sjgknight informs me is released out of Google 20% time…: Oppia (Oppia: a tool for interactive learning).

Here’s an example….

oppia project euler 1

The radio button choice determines where we go next on the learning journey:

oppia euler2

Nice – interactive coding environment… 3 + 4 …

What happens if I make a mistake?

oppia euler 3

History of what I did wrong, inline, which is richer than a normal notebook style, where my repeated attempts would overwrite previous ones…

Depending how many common incorrect or systematic errors we can identify, we may be able to add richer diagnostic next step pathways…

..but then, eventually, success:

oppia euler 4

The platform is designed as a social one where users can create their own learning journeys and collaborate on their development with others. Licensing is mandated as “CC-BY-SA 4.0 with a waiver of the attribution requirement”. The code for the platform is also open.

The learning journey model is richer and potentially far more complex in graph structure terms than I remember the attempts developed for the [redacted] SocialLearn platform, but the vision appears similar. SocialLearn was also more heavily geared to narrative textual elements in the exposition; by contrast, the current editing tools in Oppia make you feel as if using too much text is not a Good Thing.

So – how are these put together… The blurb suggests it should be easy, but Google folk are clever folk (and I’m not sure how successful they’ve been getting their previous geek style learning platform attempts into education)… here’s an example learning journey – it’s a state machine:

Example learning design in oppia

Each block can be edited:

oppia state editor

When creating new blocks, the first thing you need is some content:

oppia - content - interaction

Then some interaction. For the interactions, a range of input types you might expect:

oppia interaction inputs

and some you might not. For example, these are the interactive/executable coding style blocks you can use:

oppia progr languages dialogue

There’s also a map input, though I’m not sure what dialogic information you can get from it when you use it?

After the interaction definition, you can define a set of rules that determine where the next step takes you, depending on the input received.

oppia state rules

The rule definitions allow you to trap on the answer provide by the interaction dialogue, optionally provide some feedback, and then identify the next step.

oppia - rules

The rule branches are determined by the interaction type. For radio buttons, rules are triggered on the selected answer. For text inputs, simple string distance measures:

oppia text rules

For numeric inputs, various bounds:

oppia numeric input

For the map, what looks like a point within a particular distance of a target point?

oppia map rule

The choice of programming languages currently available in the code interaction type kinda sets the tone about who might play with this…but then, maybe as I suggested to colleague Ray Corrigan yesterday, “I don’t think we’ve started to consider consequences of 2nd half of the chessboard programming languages in/for edu yet?”

All in all – I don’t know… getting the design right is what’ll make for a successful learning journey, and that’s not something you can necessarily do quickly. The interface is as with many Google interfaces, not the friendliest I’ve seen (function over form, but bad form can sometimes get in the way of the function…).

I was interested to see they’re calling the learning journeys explorations. The Digital Worlds course I ran some time ago used that word too, in the guise of Topic Explorations, but they were a little more open ended, and used a question mechanic used to guide reading within a set of suggested resources.

Anyway, one to watch, perhaps, erm, maybe… No badges as yet, but that would be candy to offer at the end of a course, as well as a way of splashing your own brand via the badges. But before that, a state machine design mountain to climb…

Written by Tony Hirst

February 27, 2014 at 9:14 pm

Posted in OU2.0

care.data Redux

with one comment

In a comment based conversation with Anne-Marie Cunningham/@amcunningham last night, it seems I’d made a few errors in the post Demographically Classed, mistakenly attributing the use of HES data by actuaries in the Extending the Critical Path report to the SIAS when it should have been a working group of (I think?!) the Institute and Faculty of Actuaries (IFoA). I’d also messed up in assuming that the HES data was annotated with ACORN and MOSAIC data by the researchers, a mistaken assumption that begged the question as to how that linkage was actually done. Anne-Marie did the journalistic thing and called the press office (seems not many hacks did…) and discovered that “The researchers did not do any data linkage. This was all done by NHSIC. They did receive it fully coded. They only received 1st half of postcode and age group. There was no information on which hospitals people had attended.” Thanks, Anne-Marie:-)

Note – that last point could be interesting: it would suggest that in the analysis the health effects were decoupled from the facility where folk were treated?

Here are a few further quick notes adding to the previous post:

- the data that will be shared by GPs will be in coded form. An example of the coding scheme is provided in this post on the A Better NHS blog – Care dot data. The actual coding scheme can be found in this spreadsheet from the HSCIC: Code set – specification for the data to be extracted from GP electronic records and described in Care Episode Statistics: Technical Specification of the GP Extract. The tech spec uses the following diagram to explain the process (p19):

HES care.data process

I’m intrigued as to what they man by the ‘non-relational database’…?

As far as the IFoA report goes, an annotated version of this diagram to show how the geodemographic data from Experian and CACI was added, and then how personally identifiable data was stripped before the dataset was handed over to the IFoA ,would have been a useful contribution to the methodology section. I think over the next year or two, folk are going to have to spend some time being clear about the methodology in terms of “transparency” around ethics, anonymisation, privacy etc, whilst the governance issues get clarified and baked into workaday processes and practice.

Getting a more detailed idea of what data will flow and how filters may actually work under various opt-out regimes around various data sharing pathways requires a little more detail. The Falkland Surgery in Newbury provides a useful summary of what data in general GP practices share, including care.data sharing. The site also usefully provides a map of the control-codes that preclude various sharing principles (As simple as I [original site publisher] can make it!):

datasharing control

Returning the to care episode statistics reporting structure, the architecture to support reuse is depicted on p21 of the tech spec as follows:

care data reuse

There also appear to be two main summary pages of resources relating to care data that may be worth exploring further as a starting point: Care.data and Technology, systems and data – Data and information. Further resources are available more generally on Information governance (NHS England).

As I mentioned in my previous post on this topic, I’m not so concerned about personal privacy/personal data leakage as I am about trying to think trough the possible consequences of making bulk data releases available that can be used as the basis for N=All/large scale data modelling (which can suffer from dangerous (non)sampling errors/bias when folk start to opt-out), the results of which are then used to influence the development of and then algorithmic implementation of, policy. This issue is touched on in by blogger and IT, Intellectual Property and Media Law lecturer at the University of East Anglia Law School, Paul Bernal, in his post Care.data and the community…:

The second factor here, and one that seems to be missed (either deliberately or through naïveté) is the number of other, less obvious and potentially far less desirable uses that this kind of data can be put to. Things like raising insurance premiums or health-care costs for those with particular conditions, as demonstrated by the most recent story, are potentially deeply damaging – but they are only the start of the possibilities. Health data can also be used to establish credit ratings, by potential employers, and other related areas – and without any transparency or hope of appeal, as such things may well be calculated by algorithm, with the algorithms protected as trade secrets, and the decisions made automatically. For some particularly vulnerable groups this could be absolutely critical – people with HIV, for example, who might face all kinds of discrimination. Or, to pick a seemingly less extreme and far more numerous group, people with mental health issues. Algorithms could be set up to find anyone with any kind of history of mental health issues – prescriptions for anti-depressants, for example – and filter them out of job applicants, seeing them as potential ‘trouble’. Discriminatory? Absolutely. Illegal? Absolutely. Impossible? Absolutely not – and the experience over recent years of the use of black-lists for people connected with union activity (see for example here) shows that unscrupulous employers might well not just use but encourage the kind of filtering that would ensure that anyone seen as ‘risky’ was avoided. In a climate where there are many more applicants than places for any job, discovering that you have been discriminated against is very, very hard.

This last part is a larger privacy issue – health data is just a part of the equation, and can be added to an already potent mix of data, from the self-profiling of social networks like Facebook to the behavioural targeting of the advertising industry to search-history analytics from Google. Why, then, does care.data matter, if all the rest of it is ‘out there’? Partly because it can confirm and enrich the data gathered in other ways – as the Telegraph story seems to confirm – and partly because it makes it easy for the profilers, and that’s something we really should avoid. They already have too much power over people – we should be reducing that power, not adding to it. [my emphasis]

There are many trivial reasons why large datasets can become biased (for example, see The Hidden Biases in Big Data), but there are also deeper reasons why wee need to start paying more attention to “big” data models and the algorithms that are derived from and applied to them (for example, It’s Not Privacy, and It’s Not Fair [Cynthia Dwork & Deirdre K. Mulligan] and Big Data, Predictive Algorithms and the Virtues of Transparency (Part One) [John Danaher]).

The combined HES’n’insurance report, and the care.data debacle provides an opportunity to start to discuss some of these issues around the use of data, the ways in which data can be combined, the undoubted rise in data brokerage services. So for example, a quick pop over to CCR Dataand they’ll do some data enhancement for you (“We have access to the most accurate and validated sources of information, ensuring the best results for you. There are a host of variables available which provide effective business intelligence [including] [t]elephone number appending, [d]ate of [b]irth, MOSAIC”), [e]nhance your database with email addresses using our email append data enrichment service or wealth profiling. Lovely…

Written by Tony Hirst

February 26, 2014 at 3:26 pm

Posted in Data

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 725 other followers