OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘onlinejournalismblog’ Category

Merging Data Sets Based on Partially Matched Data Elements

A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes,although the recipe I’m describing here is intended to be a general purpose partial matcher. (See the comments for alternative methods for the special case of reconciling country names with country codes.)

The original data set had the form:

RANK,ECONOMY,PERCENTAGE OF INDIVIDUALS USING THE INTERNET 2011
1,Iceland,95
2,Norway,94
3,Netherlands,92.3
4,Sweden,91
5,Luxembourg,90.9
6,Denmark,90
7,Finland,89.4
8,Qatar,86.2
9,New Zealand,86
10,Switzerland,85.2
11,Liechtenstein,85
12,S. Korea,83.8
13,Germany,83
14,Canada,83
15,United Kingdom,82
16,Antigua & Barbuda,82

I’m not sure what country code listing was being used, but it probably looked something like this list of ISO Country Codes by Country:

ANDORRA;AD
ANGOLA;AO
ANGUILLA;AI
ANTARCTICA;AQ
ANTIGUA AND BARBUDA;AG
ARGENTINA;AR
ARMENIA;AM
ARUBA;AW
AUSTRALIA;AU
AUSTRIA;AT
AZERBAIJAN;AZ
BAHAMAS;BS
BAHRAIN;BH

It’s easy enough to reduce all the country names to lowercase characters so that we can try to match them exactly without worrying about any differences in capitalisation, but how do we match country names that don’t match exactly – ANTIGUA AND BARBUDA and Antigua & Barbuda, for example, or Central African Rep. and Central African Republic?

One trick is to use one of the well known partial string matching algorithms, such as the Levenshtein Distance. Here’s a recipe I hacked together that first tries to find an exact match on country names by attempting to merge the two country lists directly, and then tries to partially match any remaining unmatched names in the original list. A signature function is constructed to help out the partial matching attempt that reduces all words in the country name to lower case, sorts them alphabetically, and then concatenates them with no adjoining spaces.

(The signature idea was cribbed from the fingerprint that is available in Google Refine and that I employed in Merging Datasets with Common Columns in Google Refine.)

[@downes commented that the code wasn’t really written with clarity in mind – so I’ve added some comments…]

#Load in the data from the URLs:
PercentageUsingTheNet=read.csv(url('http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv, encoding='MACROMAN'))
ccode=read.csv(url('http://www.iso.org/iso/country_names_and_code_elements_txt'),sep=';')

##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case:
### convert all characters to lower case (tolower())
### split the string into a vector (unlist()) of separate words (strsplit())
### sort the words alphabetically (sort())
### and then concatenate them with no spaces (paste(y,collapse='')).
##So for example, United Kingdom would become kingdomunited
##To extend this function, we might also remove stopwords such as 'the' and 'of', for example (not shown).
signature=function(x){
  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
  return(sig)
}

#The partialMatch function takes two wordlists as vectors (x,y) and an optional distance threshold (levDist)
#The aim is to find words in the second list (y) that match or partially match words in the first (x)
partialMatch=function(x,y,levDist=0.1){
  #Create a data framecontainind the signature for each word
  xx=data.frame(sig=sapply(x, signature),row.names=NULL)
  yy=data.frame(sig=sapply(y, signature),row.names=NULL)
  #Add the original words to the data frame too...
  xx$raw=x
  yy$raw=y
  #We only want words that have a signature...
  xx=subset(xx,subset=(sig!=''))

  #The first matching pass - are there any rows in the two lists that have exactly the same signature?
  xy=merge(xx,yy,by='sig',all=T)
  matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
  #?I think matched=xy[ complete.cases(raw.x,raw.y) ] might also work here?
  #Label the items with identical signatures as being 'Duplicate' matches
  matched$pass="Duplicate"

  #Grab the rows from the first list that were unmatched - that is, no matching item from the second list appears
  todo=subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x))
  #We've grabbed the signature and original raw text from the first list that haven't been matched up yet
  #Name the columns so we know what's what
  colnames(todo)=c('sig','raw')

  #This is the partial matching magic - agrep finds items in the second list that are within a 
  ## certain Levenshtein distance of items in the first list.
  ##Note that I'm finding the distance between signatures.
  todo$partials= as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T))

  #Bring the original text into the partial match list based on the sig key.
  todo=merge(todo,yy,by.x='partials',by.y='sig')

  #Find the items that were actually partially matched, and pull out the columns relating to signatures and raw text
  partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("sig","raw.x","raw.y"))
  #Label these rows as partial match items
  partial.matched$pass="Partial"
  #Add the set of partially matched items to the set of duplicate matched items
  matched=rbind(matched,partial.matched)
  
  #Find the rows that still haven't been matched
  un.matched=subset(todo,subset=(is.na(raw.x)),select=c("sig","raw.x","raw.y"))

  #If there are any unmatched rows, add them to the list of matched rows, but labelled as such
  if (nrow(un.matched)>0){
    un.matched$pass="Unmatched"
    matched=rbind(matched,un.matched)
  }

  #Grab the columns of raw text from x and y from the matched list, along with how they were matched/are unmatched
  matched=subset(matched,select=c("raw.x","raw.y","pass"))
  #Ideally, the length of this should be the same as the length of valid rows in the original first list (x)

  return(matched)
}

#A rogue character in @coneee's data file borked things for me, so I originally needed to do a character code conversion first
#PercentageUsingTheNet$ECONOMY=iconv(PercentageUsingTheNet$ECONOMY)
#Loading the CSV with the encoding attribute set (h/t Kent Johnson) seemed to work properly though...

#Call the partial match function using two vectors
#The aim is to find items in the second vector that partially match items in the first
#The function returns the first list annotated with partial match items from the second and a match type
matches=partialMatch(PercentageUsingTheNet$ECONOMY,ccode$Country.Name)

As ever, this code was arrived at by tinkering and searching on Stack Overflow (using search terms along the lines of “partial match R” and “R levenshtein”). If you can improve on it, please do so and paste a link to the improved code, or a code fragment itself, in the comments below:-)

UPDATE: via the comments, the following suggestion that I don’t have time to check right now…
#Bring the original text into the partial match list based on the sig key.
-todo=merge(todo,yy,by.x=’partials’,by.y=’sig’)
+todo=merge(todo,yy,by.x=’partials’,by.y=’sig’,all.x=T)

#Label these rows as partial match items
-partial.matched$pass=”Partial”
+if (nrow(partial.matched) > 0) partial.matched$pass=”Partial”

Thanks:-)

When we run the script and look at the contents of the matches dataframe, this is an example of what we get:

This data frame can then be merged with the originally loaded data to give us the required country code annotations:

a=PercentageUsingTheNet
b=ccode
#Merge the original data set with the ISO country code country name keys
aa=merge(a,matches,by.x='ECONOMY',by.y='raw.x',all.x=T)
#Merge in the ISO country codes
aa=merge(aa,b,by.x='raw.y',by.y='Country.Name',all.x=T)
aa=subset(aa,select=c('ECONOMY','RANK','PERCENTAGE.OF.INDIVIDUALS.USING.THE.INTERNET.2011','ISO.3166.1.alpha.2.code'))

Something like this, for example:

Unfortunately, not every country in the original data set is matched:

In particular, here are the unmatched items and what they presumably should have been matched with:

Lao P.D.R. - LAO PEOPLE'S DEMOCRATIC REPUBLIC
Syria - SYRIAN ARAB REPUBLIC
S. Korea - KOREA, REPUBLIC OF
Bolivia - BOLIVIA, PLURINATIONAL STATE OF
Russia - RUSSIAN FEDERATION
Guinea Bissau - GUINEA-BISSAU
St. Vincent & Grenadines - SAINT VINCENT AND THE GRENADINES
S. Sudan - SOUTH SUDAN
Eq. Guinea - EQUATORIAL GUINEA
Congo (Dem. Rep.) - CONGO, THE DEMOCRATIC REPUBLIC OF THE
Congo (Kinshasa) - CONGO
Slovak Republic - SLOVAKIA
Iran - IRAN, ISLAMIC REPUBLIC OF
TFYR Macedonia - MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF

We could try to increase the Levenshtein Distance within which a partial match is suggested, but then we run the risk of introducing false positives.

Removing stopwords as part of the signature function may help, for example in the case of St. Vincent & Grenadines. (Also, I’m not sure what went wrong with Guinea Bissau?) We could presumably also make something of the abbreviations (S., Dem., Rep., Eq., P. D. R.), using these as NGrams or stubs in a regular expression? So for example, convert Lao P. D. R. to a regular expression invoking searches for a three word phrase in which consecutive words start with P, D, R. (Of course, this would require matching across the full country name rather than the signature.)

The Google Refine documentation page Clustering In Depth (Methods and theory behind the clustering functionality in Google Refine) also identifies several other fuzzy/partial matching algorithms that could be tried here. In which case, I guess I need to rewrite the above function so that it can accommodate different matching algorithms? IF you manage to implement any of these other fuzzy matching approaches, please post a link in the comments.

PS If you use any other approaches for partial matching of strings, please feel free to share how you do it in the comments below…:-) For example:
theBioBucket: Merging Dataframes by Partly Matching String. This post also introduces adist(), an R function that returns the Levenshtein distance between two strings.

Written by Tony Hirst

September 26, 2012 at 9:00 pm

Local Council Announcements via Newspapers, and Maybe Hyperlocal Blogs, Too…?

In a post on local council declarations of designated public place, I remarked on the following clause in The Local Authorities (Alcohol Consumption in Designated Public Places) Regulations 2007:

“5. Before making an order, a local authority shall cause to be published in a newspaper circulating in its area a notice— (a)identifying specifically or by description the place proposed to be identified;“

and idly wondered: how is a “newspaper circulating in its area” defined?

In Statutory Instrument 2012 No. 2089, The Local Authorities (Executive Arrangements) (Meetings and Access to Information) (England) Regulations 2012, which comes into force on September 10th, 2012, there is the following note on interpretations to be used within the regulations:

“newspaper” includes—
(a) a news agency which systematically carries on the business of selling and supplying reports or information to the newspapers; and
(b) any organisation which is systematically engaged in collecting news—
(i) for sound or television broadcasts;
(ii) for inclusion in programmes to be included in any programme service within the meaning of the Broadcasting Act 1990(6) other than a sound or television broadcasting service within the meaning of Part 3 or Part 1 of that Act respectively; or
(iii) for use in electronic or any other format to provide news to the public by means of the internet; [my emphasis]

So, for the purposes of those regulations, a newspaper includes any organisation which is systematically engaged in collecting news for use in electronic or any other format to provide news to the public by means of the internet. (The 2007 regs interpretation section don’t clarify the meaning of “newspapers”.)

This presumably means, for example, that under regulation 14(2), my local hyperlocal blog, Ventnorblog, could request as a newspaper a copy of any of the documents available for public inspection following payment to the Isle of Wight Council of “postage, copying or other necessary charge for transmission”. Assuming Ventnorblog passes the test of (a) being an organisation, that (b) is systematically engaged in collecting news, and for backwards compatibility with other regulations can be show to be (c) circulating in the local council area.

If this interpretation of newspapers applies more widely, (for example, if this interpretation is now applied across outstanding Local Government regulations), it also suggests that whereas councils would traditionally have had to place an advert in their local (print) newspaper, now they can do it via a hyperlocal blog?

So this might also include announcements about Licensing of public entertainments [1(4)], or adult shops [2(2)], tattoo shops [13(6)], temporary markets [37(1)], traffic orders [17(2a)], etc etc

PS Hmm, I wonder, is there a single list somewhere detailing all the legislation that requires local councils to “publish in a newspaper circulating in the area”…?

PPS via @robandale, this Local Government Information Unit initiative on Reforming Statutory Notices. It links to a survey for local councils relating to “how many notices are produced, how much they cost, how effective you believe they are etc.” to try to get a baseline on current practice.

A quick peek at the Isle of Wight Council Armchair Auditor (as run by Ventnorblog and updated from Adrian Short’s original version) gives us an idea of how much the Isle of Wight Council spends on “Advertising and Publicity” in areas such as “Traffic Management” (so presumably statutory notices relating to roadworks, road closures etc?) with the local rag, the Isle of Wight County Press.

Presumably a search on OpenlyLocal’s council spending dashboard would turn up similar spending categories for other councils? (It could be quite interesting to try exploring that… I’m not sure if data on the OpenlyLocal spending dashboard has been updated lately, though? I think OpenSpending take their data from OpenlyLocal, so that isn’t much additional help. And the DCLG opendatacommunities.org site only has budget related finance data at the moment?)

Hmm..thinks.. you could get an idea of how much external spending burden different bits of legislation impose on councils from their spending data, couldn’t you?

Written by Tony Hirst

August 23, 2012 at 11:26 am

Olympics Swimming Lap Charts from the New York Times

Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:

Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:

(Great examples of how to read a race history charts can be found on the IntelligentF1 blog. For the general case, see The IntelligentF1 model.)

Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.

I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:

One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.

A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):

(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)

All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.

A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:

Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.

[Here’s an insight from @kevinQ about how the New York Times dataviz team put this graphic together: Shifts in rankings. Apparently, they’d done similar views in previous years using a Flash component, but the current iteration uses d3.js]

The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.

Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:

(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)

Of course, without the data, it can be difficult trying to generate these charts…

…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)

PS for an example of the lapcharting style used to track the hole by hole scoring across a multi-round golf tournament, see Andy Cotgreave’s Golf Analytics.

Written by Tony Hirst

August 8, 2012 at 12:29 pm

London Olympics 2012 Medal Tables At A Glance?

Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.

Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data
).

Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.

Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.

The data I have scraped is represented using rows of the form:

Country, Event, Gold, Silver, Bronze

where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)

I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:

  • MedalType, Country, Event
  • MedalType, Event, Country
  • Event, MedalType, Country
  • Event, Country, MedalType
  • Country, MedalType, Event
  • Country, Event, MedalType

Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.

First up, ordered by Medal, Country, Event:

This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:

The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.

The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:

The Country-Event-Medal view gives us a view of the relative strength in depth of each country in each event:

and the Country Medal Event view allows us to then tunnel in on the gold winning events:

I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.

It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)

Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.

PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances

PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship

Written by Tony Hirst

August 6, 2012 at 9:07 pm

Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column

One of the things I’ve kept stumbling over in Google Refine is how to use it to reshape a data set, so I had a little play last week and worked out a couple of new (to me) recipes.

The first relates to reshaping data by creating new rows based on columns. For example, suppose we have a data set that has rows relating to Olympics events, and columns relating to Medals, with cell entries detailing the country that won each medal type:

However, suppose that you need to get the data into a different shape – maybe one line per country with an additional column specifying the medal type. Something like this, for example:

How can we generate that sort of view from the original data set? Here’s one way, that works when the columns you want to split into row values are contiguous (that is, next to each other). From the first column in the set of columns you want to be transformed, select Transpose > Transpose cells across columns into rows:

We now set the original selected column headers to be the cell value within a new column – MedalType – and the original cell values the value within a Country column:

(Note that we could also just transform the data into a single column. For example, suppose we had columns relating to courses currently taken by a particular student (Course 1, Course 2, Course 3), with a course code as cell value and one, two or three columns populated per student. If we wanted one row per student per course, we could just map the three columns onto a single column – CourseCode – and assign multiple rows to each student, then filtering out rows with a blank value in the CourseCOde column as required.)

Ticking the Fill down in other columns checkbox ensures that the appropriate Sport and Event values are copied in to the newly created rows:

Having worked out how to do that oft-required bit of data reshaping, I thought I could probably have another go at something that has been troubling me for ages – how to generate multiple rows from a single row where one of the columns contains JSON data (maybe pulled from a web service/API) that contains multiple items. This is a “mate in three” sort of problem, so here’s how I started to try to work it back. Given that I now know how to map columns onto rows, can I work out how to map different results in a JSON response onto different columns?

For example, here’s a result from the Facebook API for a search on a particular OU course code and the word open in a Facebook group name:

{“data”:[{“version”:1,”name”:”U101 (Open University) start date February 2012″,”id”:”325165900838311″},{“version”:1,”name”:”Open university, u101- design thinking, October 2011″,”id”:”250227311674865″},{“version”:1,”name”:”Feb 2011 Starters U101 Design Thinking – Open University”,”id”:”121552081246861″},{“version”:1,”name”:”Open University – U101 Design Thinking, Feburary 2011″,”id”:”167769429928476″}],”paging”:{“next”:…etc…}}

It returns a couple of results in the data element, in particular group name and group ID. Here’s one way I found of creating one row per group… Start off by creating a new column based on the JSON data column that parses the results in the data element into a list:

We can then iterate over the list items in this new column using the forEach grel command. The join command then joins the elements within each list item, specifically the group ID and name values in each result:

forEach(value.parseJson(),v,[v.id,v.name].join('||'))

You’ll notice that for multiple results, this produces a list of joined items, which we can also join together by extending the GREL expression:

forEach(value.parseJson(),v,[v.id,v.name].join('||')).join('::')

We now have a column that contains ‘||’ and ‘::’ separated items – :: separates individual group results from each other, || separates the id and name for each particular group.

Given we know how to create rows from multiple columns, we could try to split this column into separate columns using Edit column > Split into separate columns. This would create one column per result, which we could then transform into rows, as we did above. Whilst I don’t recommend this route in this particular case, here’s how we could go about doing it…

A far better approach is to use the Edit cells > split multi-valued cells option to automatically create new rows based on splitting the elements in a single column:

Note, however that this creates blanks in the other columns, so we need to Edit cells > Fill down to fill in the blanks in any other columns we want to refer to. After doing that, we end up with something like this:

We could now split the groupPairs column using the || separator to create two columns – Group ID and group name – giving us one row per group, and separate columns identifying the course, group name and group ID.

If the above route seems a little complicated, fear not…Once you apply it, it starts to make sense!

Written by Tony Hirst

July 30, 2012 at 11:50 am

Searching for a Map of Designated Public Places…

A discussion, earlier, about whether it was now illegal to drink in public…

…I thought not, think not, at least, not generally… My understanding was, that local authorities can set up controlled, alcohol free zones and create some sort of civil offence for being caught drinking alcohol there. (As it is, councils can set up regions where public consumption of alcohol may be prohibited and this prohibition may be enforced by the police.) So surely there must be an #opendata powered ‘no drinking here’ map around somewhere? The sort of thing that might result from a newspaper hack day, something that could provide a handy layer on a pub map? I couldn’t find one, though…

I did a websearch, turned up The Local Authorities (Alcohol Consumption in Designated Public Places) Regulations 2007, which does indeed appear to be the bit of legislation that regulates drinking alcohol in public, along with a link to a corresponding guidance note: Home Office circular 013 / 2007:

16. The provisions of the CJPA [Criminal Justice and Police Act 2001, Chapter 2 Provisions for combatting alcohol-related disorder] should not lead to a comprehensive ban on drinking in the open air.

17. It is the case that where there have been no problems of nuisance or annoyance to the public or disorder having been associated with drinking in that place, then a designation order … would not be appropriate. However, experience to date on introducing DPPOs has found that introducing an Order can lead to nuisance or annoyance to the public or disorder associated with public drinking being displaced into immediately adjacent areas that have not been designated for this purpose. … It might therefore be appropriate for a local authority to designate a public area beyond that which is experiencing the immediate problems caused by anti-social drinking if police evidence suggests that the existing problem is likely to be displaced once the DPPO was in place. In which case the designated area could include the area to which the existing problems might be displaced.

Creepy, creep, creep…

This, I thought, was interesting too, in the guidance note:

37. To ensure that the public have full access to information about designation orders made under section 13 of the Act and for monitoring arrangements, Regulation 9 requires all local authorities to send a copy of any designation order to the Secretary of State as soon as reasonably practicable after it has been made.

38. The Home Office will continue to maintain a list of all areas designated under the 2001 Act on the Home Office website: www.crimereduction.gov.uk/alcoholorders01.htm [I’m not convinced that URL works any more…?]

39. In addition, local authorities may wish to consider publicising designation orders made on their own websites, in addition to the publicity requirements of the accompanying Regulations, to help to ensure full public accessibility to this information.

So I’m thinking: this sort of thing could be a great candidate for a guidance note from the Home Office to local councils recommending ways of releasing information about the extent of designation orders as open geodata. (Related? Update from ONS on data interoperability (“Overcoming the incompatibility of statistical and geographic information systems”).)

I couldn’t immediately find a search on data.gov.uk that would turn up related datasets (though presumably the Home Office is aggregating this data, even if it’s just in a filing cabinet or mail folder somewhere*), but a quick websearch for Designated Public Places site:gov.uk intitle:council turned up a wide selection of local council websites along with their myriad ways of interpreting how to release the data. I’m not sure if any of them release the data as geodata, though? Maybe this would be an appropriate test of the scope of the Protection of Freedoms Act Part 6 regulations on the right to request data as data (I need to check them again…)?

* The Home Office did release a table of designated public places in response to an FOI request about designated public place orders, although not as data… But it got me wondering: if I scheduled a monthly FOI request to the Home Office requesting the data on a monthly basis, would they soon stop fulfilling the requests as timewasting? How about if we got a rota going?! Is there any notion of a longitudinal/persistent FOI request, that just keeps on giving (could I request the list of designated public places the Home Office has been informed about over the last year, along with a monthly update of requests in the previous month (or previous month but one, or whatever is reasonable…) over the next 18 months, or two years, or for the life of the regulation, or until such a time as the data is published as open data on a regular basis?

As for the report to government that a local authority must make on passing a designation order – 9. A copy of any order shall be sent to the Secretary of State as soon as reasonably practicable after it has been made. – it seems that how the area denoted as a public space is described is moot: “5. Before making an order, a local authority shall cause to be published in a newspaper circulating in its area a notice— (a)identifying specifically or by description the place proposed to be identified;“. Hmmm, two things jump out there…

Firstly, a local authority shall cause to be published in a newspaper circulating in its area [my emphasis; how is a newspaper circulating in its area defined? Do all areas of England have a non-national newspaper circulating in that area? Does this implicitly denote some “official channel” responsibility on local newspapers for the communication of local government notices?]. Hmmm…..

Secondly, the area identified specifically or by description. On commencement, the order must also be made public by “identifying the place which has been identified in the order”, again “in a newspaper circulating in its area”. But I wonder – is there an opportunity there to require something along the lines of and published using an appropriate open data standard in a open public data repository, and maybe further require that this open public data copy is the one that is used as part of the submission informing the Home Office about the regulation? And if we go overboard, how about we further require that each enacted and proposed order is published as such along with a machine readable geodata description and that a single aggregate files containing all that Local Authority’s currently and planned Designated Public Spaces are also published (so one URL for all current spaces, one for all planned ones). Just by the by, does anyone know of any local councils publishing boundary date/shapefiles that mark out their Designated Public Spaces? Please let me know via the comments, if so…

A couple of other, very loosely (alcohol) related, things I found along the way:

  • Local Alcohol Profiles for England: the aim appears to have been the collation of, and a way of exploring, a “national alcohol dataset”, that maps alcohol related health indicators on a PCT (Primary Care Trust) and LA (local authority) basis. What this immediately got me wondering was: did they produce any tooling, recipes or infrastructure that would it make a few clicks easy to pull together a national tobacco dataset and associated website, for example? And then I found the Local Tobacco Control Profiles for England toolkit on the London Health Observatory website, along with a load of other public health observatories and it made me remember – again – just how many data sensemaking websites there already are out there…
  • UK Alcohol Strategy – maybe some leads into other datasets/data stories?

PS I wonder if any of the London Boroughs or councils hosting regional events have recently declared any new Designated Public Spaces #becauseOfTheOlympics.

Written by Tony Hirst

July 23, 2012 at 10:07 pm

Posted in Data, onlinejournalismblog, Policy

Tagged with

Quick and Dirty Recipe: Merging (Concatenating) Multiple CSV files (ODA Spending)

There’s been a flurry of tweets over the last few days about LOCOG’s exemption from FOI (example LOCOG response to an FOI request), but the Olympic Delivery Authority (ODA, one of the owner stakeholders) is rather more open, and publishes its spends over £25k: ODA Finance: Transparency Reports.

CSV files containing spend on a monthly basis are available from the site, using a consistent CSV file format each time (I think…). For what it’s worth, I thought it might be worth sharing a pragmatic, though not ideal, Mac/Linux/unixtools commandline recipe for generating a single file containing all this data.

  1. Right-click and download each of the CSV files on the page to the same directory (eg odaSpending) on your local machine. (There are easier ways of doing this – I tried wget on the command line, but got an Access Denied response (workaround anyone?); there are probably more than a few browser extensions/plugins that will also download all the files linked to from a page. If so, you just want to grab the csv files; if you get them all, from the command line, just copy the csv files to a new directory: eg mkdir csvfiles;cp *.csv csvfiles)
  2. On the commandline, change directory to the files directory – eg cd odaSpending/csvfiles; then join all the files together: files=*; cat $files > odaspending.csv
  3. You should now have a big file, odaspending.csv, containing all the data, although it will also contain multiple header rows (each csv file had its own header row). Open the file in a text editor (I use TextWrangler), copy from the start of the first line to the start of the second (ie copy the header row, including the end of line/carriage return), then do a Find on the header and global Replace with nothing replacing the search string. Then, depending where you started the replace, maybe paste the header (if required) back into the first row

To turn the data file into something you can explore more interactively, upload it to something like Google Fusion Tables, as I did here (data to May 2012): ODA Spending in Google Fusion Tables

Note that this recipe is a pragmatic one. Unix gurus would surely be able to work out far more efficient scripts that concatenate the files after stripping out the header in all but the first file, for example, or that maybe even check the columns are the same etc etc. But if you want something quick and dirty, this is one way of doing it… (Please feel free to add alternative recipes for achieving the same thing in the comments…)

PS here’s an example of one sort of report you can then create in Fusion Tables – ODA spend with G4S; here’s another: Seconded staff

Written by Tony Hirst

July 16, 2012 at 10:16 am

Follow

Get every new post delivered to your Inbox.

Join 864 other followers