OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

A Nudge Here, A Nudge There, But With Meaning..

leave a comment »

A handful of posts caught my attention yesterday around the whole data thang…

First up, a quote on the New Aesthetic blog: “the state-of-the-art method for shaping ideas is not to coerce overtly but to seduce covertly, from a foundation of knowledge”, referencing an article on Medium: Is the Internet good or bad? Yes. The quote includes mention of an Adweek article (this one? Marketers Should Take Note of When Women Feel Least Attractive; see also a response and the original press release) that “noted that women feel less attractive on Mondays, and that this might be the best time to advertise make-up to them.”

I took this as a cautionary tale about the way in which “big data” qua theoryless statistical models based on the uncontrolled, if large, samples that make up “found” datasets, to pick up on a phrase used by Tim Harford in Big data: are we making a big mistake? [h/t @schmerg et al]) can be used to malevolent affect. (Thanks to @devonwalshe for highlighting that it’s not the data we should blame (“the data itself has no agency, so a little pointless to blame … Just sensitive to tech fear. Shifts blame from people to things.”) but the motivations and actions of the people who make use of the data.)

Which is to say – there’s ethics involved. As an extreme example, consider the possible “weaponisation” of data, for example in the context of PSYOP – “psychological operations” (are they still called that?) As the New Aesthetic quote, and the full Medium article itself, explain, the way in which data models allow messages to be shaped, targeted and tailored provides companies and politicians with a form of soft power that encourage us “to click, willingly, on a choice that has been engineered for us”. (This unpicks further – not only are we modelled so that the prompts are issued to us at an opportune time, but the choices we are provided with may also have been identified algorithmically.)

So that’s one thing…

Around about the same time, I also spotted a news announcement that Dunnhumby – an early bellwether of how to make the most of #midata consumer data – has bought “advertising technology” firm Sociomantic (press release): “dunnhumby will combine its extensive insights on the shopping preferences of 400 million consumers with Sociomantic’s intelligent digital-advertising technology and real-time data from more than 700 million online consumers to dramatically improve how advertising is planned, personalized and evaluated. For the first time, marketing content can be dynamically created specifically for an individual in real-time based on their interests and shopping preferences, and delivered across online media and mobile devices.” Good, oh…

A post on the Dunnhumby blog (It’s Time to Revolutionise Digital Advertising) provides further insight about what we might expect next:

We have decided to buy the company because the combination of Sociomantic’s technological capability and dunnhumby’s insight from 430m shoppers worldwide will create a new opportunity to make the online experience a lot better, because for the first time we will be able to make online content personalised for people, based on what they actually like, want and need. It is what we have been doing with loyalty programs and personalised offers for years – done with scale and speed in the digital world.

So what will we actually do to make that online experience better for customers? First, because we know our customers, what they see will be relevant and based on who they are, what they are interested in and what they shop for. It’s the same insight that powers Clubcard vouchers in the UK which are tailored to what customers shop for both online and in-store. Second, because we understand what customers actually buy online or in-store, we can tell advertisers how advertising needs to change and how they can help customers with information they value. Of course there is a clear benefit to advertisers, because they can spend their budgets only where they are talking to the right audience in the right way with the right content at the right time, measuring what works, what doesn’t and taking out a lot of guesswork. The real benefit though must be to customers whose online experience will get richer, simpler and more enjoyable. The free internet content we enjoy today is paid for by advertising, we just want to make it advertising offers and content you will enjoy too.

This needs probing further – are Dunnhumby proposing merging data about actual shopping habits in physical and online store with user cookies so that ads can be served based on actual consumption? (See for example Centralising User Tracking on the Web. How far has this got, I wonder? Seems like it may be here on mobile devices? Google’s New ‘Advertising ID’ Is Now Live And Tracking Android Phones — This Is What It Looks Like. Here’s the Android developer docs on Advertising ID. See also GigaOm on As advertisers phase out cookies, what’s the alternative?, eg in context of “known identifiers” (like email addresses and usernames) and “stable identifiers” (persistent device or browser level identifiers).)

That’s the second thing…

For some reason, it’s all starting to make me think of supersaturated solutions

PS FWIW, the OU/BBC co-produced Bang Goes the Theory (BBC1) had a “Big Data” episode recently – depending on when you read this, you may still be able to watch it here: Bang Goes the Theory – Series 8 – Episode 3: Big Data

Written by Tony Hirst

April 4, 2014 at 11:40 am

Posted in Thinkses

Tagged with

Mixing Stuff Up

with 2 comments

Remember mashups? Five years or so ago they were all the rage. At their heart, they provided ways of combining things that already existed to do new things. This is a lazy approach, and one I favour.

One of the key inspirations for me in this idea combinatorial tech, or tech combinatorics, is Jon Udell. His Library Lookup project blew me away in its creativity (the use of bookmarklets, the way the project encouraged you to access one IT service from another, the using of “linked data”, common/core-canonical identifiers to bridge services and leverage or enrich one from another, and so on) and was the spark that fired many of my own doodlings. (Just thinking about it again excites me now…)

As Jon wrote on his blog yesterday (Shiny old tech) (my emphasis):

What does worry me, a bit, is the recent public conversation about ageism in tech. I’m 20 years past the point at which Vinod Khosla would have me fade into the sunset. And I think differently about innovation than Silicon Valley does. I don’t think we lack new ideas. I think we lack creative recombination of proven tech, and the execution and follow-through required to surface its latent value.

Elm City is one example of that. Another is my current project, Thali, Yaron Goland’s bid to create the peer-to-peer web that I’ve long envisioned. Thali is not a new idea. It is a creative recombination of proven tech: Couchbase, mutual SSL authentication, Tor hidden services. To make Thali possible, Yaron is making solid contributions to Thali’s open source foundations. Though younger than me, he is beyond Vinod Khosla’s sell-by date. But he is innovating in a profoundly important way.

Can we draw a clearer distinction between innovation and novelty?

Creative recombination.

I often think of this in terms of appropriation (eg Appropriating Technology, Appropriating IT: innovative uses of emerging technologies or Appropriating IT: Glue Steps).

Or repurposing, a form of reuse that differs from the intended original use.

Openness helps here. Open technologies allow users to innovate without permission. Open licensing is just part of that open technology jigsaw; open standards another; open access and accessibility a third. Open interfaces accessed sideways. And so on.

Looking back over archived blog posts from five, six, seven years ago, the web used to be such fun. An open playground, full of opportunities for creative recombination. Now we have Facebook, where authenticated APIs give you access to local social neighbourhoods, but little more. Now we have Google using link redirection and link pollution at every opportunity. Services once open are closed according to economic imperatives (and maybe scaling issues; maybe some creative recombinations are too costly to support when a network scales). Maybe my memory of a time when the web was more open is a false memory?

Creative recombination, ftw.

PS just spotted this (Walking on custard), via @plymuni. If you don’t see why it’s relevant, you probably don’t get the sense of this post!

Written by Tony Hirst

April 3, 2014 at 9:21 am

Visualising Pandas DataFrames With IPythonBlocks – Proof of Concept

with 2 comments

A few weeks ago I came across IPythonBlocks, a Python library developed to support the teaching of Python programming. The library provides an HTML grid that can be manipulated using simple programming constructs, presenting the outcome of the operations in a visually meaningful way.

As part of a new third level OU course we’re putting together on databases and data wrangling, I’ve been getting to grips with the python pandas library. This library provides a dataframe based framework for data analysis and data-styled programming that bears a significant resemblance to R’s notion of dataframes and vectorised computing. pandas also provides a range of dataframe based operations that resemble SQL style operations – joining tables, for example, and performing grouping style summary operations.

One of the things we’re quite keen to do as a course team is identify visually appealing ways of illustrating a variety of data manipulating operations; so I wondered whether we might be able to use ipythonblocks as a basis for visualising – and debugging – pandas dataframe operations.

I’ve posted a demo IPython notebook here: ipythonblocks/pandas proof of concept [nbviewer preview]. In it, I’ve started to sketch out some simple functions for visualising pandas dataframes using ipythonblocks blocks.

For example, the following minimal function finds the size and shape of a pandas dataframe and uses it to configure a simple block:

def pBlockGrid(df):
    (y,x)=df.shape
    return BlockGrid(x,y)

We can also colour individual blocks – the following example uses colour to reveal the different datatypes of columns within a dataframe:

ipythinblocks pandas type colour

A more elaborate function attempts to visualise the outcome of merging two data frames:

ipythonblocks pandas demo

The green colour identifies key columns, the red and blue cells data elements from the left and right joined dataframes respectively, and the black cells NA/NaN cells.

One thing I started wondering about that I have to admit quite excited me (?!;-) was whether it would be possible to extend the pandas dataframe itself with methods for producing ipythonblocks visual representations of the state of a dataframe, or the effect of dataframe based operations such as .concat() and .merge() on source dataframes.

If you have any comments on this approach, suggestions for additional or alternative ways of visualising dataframe transformations, or thoughts about how to extend pandas dataframes with ipythonblocks style visualisations of those datastructures and/or the operations that can be applied to them, please let me know via the comments:-)

PS some thoughts on a possible pandas interface:

  • DataFrame().blocks() to show the blocks
  • .cat(blocks=True) and .merge(blocks=True) to return (df, blocks)
  • DataFrame().blocks(blockProperties={}) and eg .merge(blocks=True, blockProperties={})
  • blockProperties: showNA=True|False, color_base=(), color_NA=(), color_left=(), color_right=(), color_gradient=[] (eg for a .cat() on many dataframes), colorView=structure|datatypes|missing (the colorView reveals the datatypes of the columns, the structure origins of cells returned from a .merge() or .cat(), or a view of missing data (reveal NA/NaN etc over a base color), colorTypes={} (to set the colors for different datatypes)

Written by Tony Hirst

March 26, 2014 at 11:37 pm

First Signs (For Me) of Linked Data Being Properly Linked…?!

with 3 comments

As anyone who’s followed this blog for some time will know, my relationship with Linked Data has been an off and on again one over the years. At the current time, it’s largely off – all my OpenRefine installs seem to have given up the ghost as far as reconciliation and linking services go, and I have no idea where the problem lies (whether with the plugins, the installs, with Java, with the endpoints, with the reconciliations or linkages I’m trying to establish).

My dabblings with pulling data in from Wikipedia/DBpedia to Gephi (eg as described in Visualising Related Entries in Wikipedia Using Gephi and the various associated follow-on posts) continue to be hit and miss due to the vagaries of DBpedia and the huge gaps in infobox structured data across Wikipedia itself.

With OpenRefine not doing its thing for me, I haven’t been able to use that app as the glue to bind together queries made across different Linked Data services, albeit in piecemeal fashion. Because from the occasional sideline view I have of the Linked Data world, I haven’t seen any obvious way of actually linking data sets other than by pulling identifiers in to a new OpenRefine column (or wherever) from one service, then using those identifiers to pull in data from another endpoint into another column, and so on…

So all is generally not well.

However, a recent post by the Ordnance Survey’s John Goodwin (aka @gothwin) caught my eye the other day: Federating SPARQL Queries Across Government Linked Data. It seems that federated queries can now be made across several endpoints.

John gives an example using data from the Ordnance Survey SPARQL endpoint and an endpoint published by the Environment Agency:

The Environment Agency has published a number of its open data offerings as linked data … A relatively straight forward SPARQL query will get you a list of bathing waters, their name and the district they are in.

[S]uppose we just want a list of bathing water areas in South East England – how would we do that? This is where SPARQL federation comes in. The information about which European Regions districts are in is held in the Ordnance Survey linked data. If you hop over the the Ordnance Survey SPARQL endpoint explorer you can run [a] query to find all districts in South East England along with their names …

Using the SERVICE keyword we can bring these two queries together to find all bathing waters in South East England, and the districts they are in:

And here’s the query John shows, as run against the Ordnance Survey SPARQL endpoint

SELECT ?x ?name ?districtname WHERE {
  ?x a <http://environment.data.gov.uk/def/bathing-water/BathingWater> .
  ?x <http://www.w3.org/2000/01/rdf-schema#label> ?name .
  ?x <http://statistics.data.gov.uk/def/administrative-geography/district> ?district .
  SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql>
    ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within> <http://data.ordnancesurvey.co.uk/id/7000000000041421> .
    ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  }
} ORDER BY ?districtname

In a follow on post, John goes even further “by linking up data from Ordnance Survey, the Office of National Statistics, the Department of Communities and Local Government and Hampshire County Council”.

So that’s four endpoints – the original one against which the query is first fired, and three others…

SELECT ?districtname ?imdrank ?changeorder ?opdate ?councilwebsite ?siteaddress WHERE {
  ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within <http://data.ordnancesurvey.co.uk/id/7000000000017765> .
  ?district a <http://data.ordnancesurvey.co.uk/ontology/admingeo/District> .
  ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  SERVICE <http://opendatacommunities.org/sparql> {
    ?s <http://purl.org/linked-data/sdmx/2009/dimension#refArea> ?district .
    ?s <http://opendatacommunities.org/def/IMD#IMD-rank> ?imdrank . 
    ?authority <http://opendatacommunities.org/def/local-government/governs> ?district .
    ?authority <http://xmlns.com/foaf/0.1/page> ?councilwebsite .
  }
  ?district <http://www.w3.org/2002/07/owl#sameAs> ?onsdist .
  SERVICE <http://statistics.data.gov.uk/sparql> {
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/originatingChangeOrder> ?changeorder .
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/operativedate> ?opdate .
  }
  SERVICE <http://linkeddata.hants.gov.uk/sparql> {
    ?landsupsite <http://data.ordnancesurvey.co.uk/ontology/admingeo/district> ?district .
    ?landsupsite a <http://linkeddata.hants.gov.uk/def/land-supply/LandSupplySite> .
    ?landsupsite <http://www.ordnancesurvey.co.uk/ontology/BuildingsAndPlaces/v1.1/BuildingsAndPlaces.owl#hasAddress> ?siteaddress .
  }
}

Now we’re getting somewhere….

Written by Tony Hirst

March 25, 2014 at 3:25 pm

Posted in Anything you want

Tagged with

Experimenting With R – Point to Point Mapping With Great Circles

with 5 comments

I’ve started doodling again… This time, around maps, looking for recipes that make life easier plotting lines to connect points on maps. The most attractive maps seem to use great circles to connect one point with another, these providing the shortest path between two points when you consider the Earth as a sphere.

Here’s one quick experiment (based on the Flowing Data blog post How to map connections with great circles), for an R/Shiny app that allows you to upload a CSV file containing a couple of location columns (at least) and an optional “amount” column, and it’ll then draw lines between the points on each row.

greatcircle map demo

The app requires us to solve several problems, including:

  • how to geocode the locations
  • how to plot the lines as great circles
  • how to upload the CSV file
  • how to select the from and two columns from the CSV file
  • how to optionally select a valid numerical column for setting line thickness

Let’s start with the geocoder. For convenience, I’m going to use the Google geocoder via the geocode() function from the ggmap library.

#Locations are in two columns, *fr* and *to* in the *dummy* dataframe
#If locations are duplicated in from/to columns, dedupe so we don't geocode same location more than once
locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)
#Run the geocoder against each location, then transpose and bind the results into a dataframe
cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 

The locs data is a vector of locations:

                    place
1              London, UK
2            Cambridge,UK
3            Paris,France
4       Sydney, Australia
5           Paris, France
6             New York,US
7 Cape Town, South Africa

The sapply(locs$place,geocode, USE.NAMES=F) function returns data that looks like:

    [,1]       [,2]     [,3]     [,4]      [,5]     [,6]      [,7]     
lon -0.1254872 0.121817 2.352222 151.207   2.352222 -74.00597 18.42406 
lat 51.50852   52.20534 48.85661 -33.86749 48.85661 40.71435  -33.92487

The transpose (t() gives us:

     lon        lat      
[1,] -0.1254872 51.50852 
[2,] 0.121817   52.20534 
[3,] 2.352222   48.85661 
[4,] 151.207    -33.86749
[5,] 2.352222   48.85661 
[6,] -74.00597  40.71435 
[7,] 18.42406   -33.92487

The cbind() binds each location with its lat and lon value:

                    place        lon       lat
1              London, UK -0.1254872  51.50852
2            Cambridge,UK   0.121817  52.20534
3            Paris,France   2.352222  48.85661
4       Sydney, Australia    151.207 -33.86749
5           Paris, France   2.352222  48.85661
6             New York,US  -74.00597  40.71435
7 Cape Town, South Africa   18.42406 -33.92487

Code that provides a minimal example for uploading the data from a CSV file on the desktop to the Shiny app, then creating dynamic drop lists containing column names, can be found here: Simple file geocoder (R/shiny app).

The following snippet may be generally useful for getting a list of column names from a data frame that correspond to numerical columns:

#Get a list of column names for numerical columns in data frame df
nums <- sapply(df, is.numeric)
names(nums[nums])

The code for the full application can be found as a runnable gist in RStudio from here: R/Shiny app – great circle mapping. [In RStudio, install.packages("shiny"); library(shiny); runGist(9690079). The gist contains a dummy data file if you want to download it to try it out...]

Here’s the code explicitly…

The global.R file loads the necessary packages, installing them if they are missing:

#global.R

##This should detect and install missing packages before loading them - hopefully!
list.of.packages <- c("shiny", "ggmap","maps","geosphere")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages,function(x){library(x,character.only=TRUE)}) 

The ui.R file builds the Shiny app’s user interface. The drop down column selector lists are populated dynamically with the names of the columns in the data file once it is uploaded. An optional Amount column can be selected – the corresponding list only displays the names of numerical columns. (The lists of location columns to be geocoded should really be limited to non-numerical columns.) The action button prevents the geocoding routines firing until the user is ready – select the columns appropriately before geocoding (error messages are not handled very nicely;-)

#ui.R
shinyUI(pageWithSidebar(
  headerPanel("Great Circle Map demo"),
  
  sidebarPanel(
    #Provide a dialogue to upload a file
    fileInput('datafile', 'Choose CSV file',
              accept=c('text/csv', 'text/comma-separated-values,text/plain')),
    #Define some dynamic UI elements - these will be lists containing file column names
    uiOutput("fromCol"),
    uiOutput("toCol"),
    #Do we want to make use of an amount column to tweak line properties?
    uiOutput("amountflag"),
    #If we do, we need more options...
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("amountCol")
    ),
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("lineSelector")
    ),
    #We don't want the geocoder firing until we're ready...
    actionButton("getgeo", "Get geodata")
    
  ),
  mainPanel(
    tableOutput("filetable"),
    tableOutput("geotable"),
    plotOutput("geoplot")
  )
))

The server.R file contains the server logic for the app. One thing to note is the way we isolate some of the variables in the geocoder reactive function. (Reactive functions fire when one of the external variables they contain changes. To prevent the function firing when a variable it contains changes, we need to isolate it. (See the docs for me; for example, Shiny Lesson 7: Reactive outputs or Isolation: avoiding dependency.)

#server.R

shinyServer(function(input, output) {

  #Handle the file upload
  filedata <- reactive({
    infile <- input$datafile
    if (is.null(infile)) {
      # User has not uploaded a file yet
      return(NULL)
    }
    read.csv(infile$datapath)
  })

  #Populate the list boxes in the UI with column names from the uploaded file  
  output$toCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("to", "To:",items)
  })
  
  output$fromCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("from", "From:",items)
  })
  
  #If we want to make use of an amount column, we need to be able to say so...
  output$amountflag <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    checkboxInput("amountflag", "Use values?", FALSE)
  })

  output$amountCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    #Let's only show numeric columns
    nums <- sapply(df, is.numeric)
    items=names(nums[nums])
    names(items)=items
    selectInput("amount", "Amount:",items)
  })
  
  #Allow different line styles to be selected
  output$lineSelector <- renderUI({
    radioButtons("lineselector", "Line type:",
                 c("Uniform" = "uniform",
                   "Thickness proportional" = "thickprop",
                   "Colour proportional" = "colprop"))
  })
  
  #Display the data table - handy for debugging; if the file is large, need to limit the data displayed [TO DO]
  output$filetable <- renderTable({
    filedata()
  })
  
  #The geocoding bit... Isolate variables so we don't keep firing this...
  geodata <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (is.null(df)) return(NULL)
    
    isolate({
      dummy=filedata()
      fr=input$from
      to=input$to
      locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)      
      cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 
    })
  })

  #Weave the goecoded data into the data frame we made from the CSV file
  geodata2 <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (input$amountflag != 0) {
      maxval=max(df[input$amount],na.rm=T)
      minval=min(df[input$amount],na.rm=T)
      df$b8g43bds=10*df[input$amount]/maxval
    }
    gf=geodata()
    df=merge(df,gf,by.x=input$from,by.y='place')
    merge(df,gf,by.x=input$to,by.y='place')
  })
  
  #Preview the geocoded data
  output$geotable <- renderTable({
    if (input$getgeo == 0) return(NULL)
    geodata2()
  })
  
  #Plot the data on a map...
  output$geoplot<- renderPlot({
    if (input$getgeo == 0) return(map("world"))
    #Method pinched from: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
    map("world")
    df=geodata2()
    
    pal <- colorRampPalette(c("blue", "red"))
    colors <- pal(100)
    
    for (j in 1:nrow(df)){
      inter <- gcIntermediate(c(df[j,]$lon.x[[1]], df[j,]$lat.x[[1]]), c(df[j,]$lon.y[[1]], df[j,]$lat.y[[1]]), n=100, addStartEnd=TRUE)

      #We could possibly do more styling based on user preferences?
      if (input$amountflag == 0) lines(inter, col="red", lwd=0.8)
      else {
        if (input$lineselector == 'colprop') {
          colindex <- round( (df[j,]$b8g43bds[[1]]/10) * length(colors) )
          lines(inter, col=colors[colindex], lwd=0.8)
        } else if (input$lineselector == 'thickprop') {
          lines(inter, col="red", lwd=df[j,]$b8g43bds[[1]])
        } else lines(inter, col="red", lwd=0.8)
      } 
    } 
  })

})

So that’s the start of it… this app could be further developed in several ways, for example allowing the user to filter or colour displayed lines according to factor values in a further column (commodity type, for example), or produce a lattice of maps based on facet values in a column.

I also need to figure how to to save maps, and maybe produce zoomable ones. If geocoded points all lay within a blinding box limited to a particular geographical area, scaling the map view to show just that area might be useful.

Other techniques might include using proportional symbols (circles) at line landing points to show the sum of values incoming to that point, or some of values outgoing, or the difference between the two; (maybe use green for incoming outgoing, then size by the absolute difference?)

Written by Tony Hirst

March 24, 2014 at 11:17 am

Recreational Data

with 2 comments

Part of my weekend ritual is to buy the weekend papers and have a go at the recreational maths problems that are Sudoku and Killer. I also look for news stories with a data angle that might prompt a bit of recreational data activity…

In a paper that may or may not have been presented at the First European Congress of Mathematics in Paris, July, 1992, Prof. David Singmaster reflected on “The Unreasonable Utility of Recreational Mathematics”.

unreasonableUtility

To begin with, it is worth considering what is meant by recreational
mathematics.

First, recreational mathematics is mathematics that is fun and popular – that is, the problems should be understandable to the interested layman, though the solutions may be harder. (However, if the solution is too hard, this may shift the topic from recreational toward the serious – e.g. Fermat’s Last Theorem, the Four Colour Theorem or the Mandelbrot Set.)

Secondly, recreational mathematics is mathematics that is fun and used as either as a diversion from serious mathematics or as a way of making serious mathematics understandable or palatable. These are the pedagogic uses of recreational mathematics. They are already present in the oldest known mathematics and continue to the present day.

These two aspects of recreational mathematics – the popular and the pedagogic – overlap considerably and there is no clear boundary between them and “serious” mathematics.

How is recreational mathematics useful?

Firstly, recreational problems are often the basis of serious mathematics. The most obvious fields are probability and graph theory where popular problems have been a major (or the dominant) stimulus to the creation and evolution of the subject. …

Secondly, recreational mathematics has frequently turned up ideas of genuine but non-obvious utility. …

Anyone who has tried to do anything with “real world” data knows how much of a puzzle it can represent: from finding the data, to getting hold of it, to getting it into a state and a shape where you can actually work with it, to analysing it, charting it, looking for pattern and structure within it, having a conversation with it, getting it to tell you one of the many stories it may represent, there are tricks to be learned and problems to be solved. And they’re fun.

An obvious definition [of recreational mathematics] is that it is mathematics that is fun, but almost any mathematician will say that he enjoys his work, even if he is studying eigenvalues of elliptic differential operators, so this definition would encompass almost all mathematics and hence is too general. There are two, somewhat overlapping, definitions that cover most of what is meant by recreational mathematics.

…the two definitions described above.

So how might we define “recreational data”. For me, recreational data activities are, in who or in part, data investigations, involving one or more steps of the data lifecycle (discovery, acquisition, cleaning, analysis, visualisation, storytelling). They are the activities I engage in when I look for, or behind, the numbers that appear in a news story. They’re the stories I read on FullFact, or listen to on the OU/BBC co-pro More or Less; they’re at the heart of the beautiful little book that is The Tiger That Isn’t; recreational data is what I do in the “Diary of a Data Sleuth” posts on OpenLearn.

Recreational data is about the joy of trying to find stories in data.

Recreational data is, or can be, the data journalism you do for yourself or the sense you make of the stats in the sports pages.

Recreational data is a safe place to practice – I tinker with Twitter and formulate charts around Formula One. But remember this: “recreational problems are often the basis of serious [practice]“. The “work” I did around Visualising Twitter User Timeline Activity in R? I can (and do) reuse that code as the basis of other timeline analyses. The puzzle of plotting connected concepts on Wikipedia I described in Visualising Related Entries in Wikipedia Using Gephi? It’s a pattern I can keep on playing with.

If you think you might like to do some doodle of your own with some data, why not check out the School Of Data. Or watch out on OpenLearn for some follow up stories from the OU/BBC co-pro of Hans Rosling’s award winning Don’t Panic

Written by Tony Hirst

March 21, 2014 at 9:56 am

So Is This Guerrillla Research?

with 2 comments

A couple of days ago I delivered a workshop with Martin Weller on the topic of “Guerrilla Research”.

guerrilapdf

The session was run under the #elesig banner, and was the result of an invitation to work through the germ of an idea that was a blog post Martin had published in October 2013, The Art Of Guerrilla Research.

In that post, Martin had posted a short list of what he saw as “guerrilla research” characteristics:

  1. It can be done by one or two researchers and does not require a team
  2. It relies on existing open data, information and tools
  3. It is fairly quick to realise
  4. It is often disseminated via blogs and social media

Looking at these principles now, as in, right now, as a I type (I don’t know what I’m going to write…), I don’t necessarily see any of these as defining, at least, not without clarification. Let’s reflect, and see how my fingers transcribe my inner voice…

In the first case, a source crowd or network may play a role in the activity, so maybe it’s the initiation of the activity that only requires one or two people?

Open data, information and tools helps, but I’d gear this more towards pre-existing data, information and tools, rather than necessarily open: if you work inside an organisation, you may be able to appropriate resources that are not open or available outside the organisation, and may even have limited access within the organisation; you may have to “steal” access to them, even; open resources do mean that other people can engage in the same activity using the same resources, though, which provides transparency and reproducibility; open resources also make inside, outside activities possible.

The activity may be quick to realise, sort of: I can quickly set a scraper going to collect data about X, and the analysis of the data may be quick to realise; but I may need the scraper to run for days, or weeks, or months; more qualifying, I think, is that the activity only requires a relatively short number of relatively quick bursts of activity.

Online means of dissemination are natural, because they’re “free”, immediate, have potentially wide reach; but I think an email to someone who can, or a letter to the local press, or an activity that is it’s own publication, such as a submission to a consultation in which the responses are all published, could also count too.

Maybe I should have looked at those principles a little more closely before the workshop…;-) And maybe I should have made reference to them in my presentation. Martin did, in his.

PS WordPress just “related” this back to me, from June, 2009: Guerrilla Education: Teaching and Learning at the Speed of News

Written by Tony Hirst

March 21, 2014 at 8:44 am

Posted in OU2.0, Thinkses

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 729 other followers