OpenSpending – OUseful.Info, the blog…

Local Council Spending Data – Time Series Charts

In What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations? I started wondering about the extent to which local spending transparency data might play a role in supporting consultation around new budgets.

As a first pass, I’ve popped up a quick application up at http://glimmer.rstudio.com/psychemedia/iwspend2013_14/ [if that’s broken, try this one] (shiny code here) that demonstrates various ways of looking at open spending data from the Isle of Wight council. You can pass form items in via the URL (except to set the Directorate – oops!), and also search using regular expressions, but at the moment still need to hit the Search button to actually run the search. NOTE – there’s a little bug – you need to hit the Search button to get it to show data; note – selecting All directorates and no filter terms can be a bit slow to display anything…

Examples:

– http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?expensesType=(oil)|(gas)|(electricity)

– http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?serviceArea=mainland

– http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?supplierName=capita

I’ve started exploring various views over the data, but these need thinking through properly (in particular with respect to finding out views that may actually be useful!)

Hmm… did the budget change directorate?!

Some more views over the suppliers tab – I started experimenting with some tabular views in the suppliers tab too…

This is all very “shiny” of course, but is it useful? From these early glimpses over the data, can you think of any ways that a look at the spending data might help support budget consultations? What views over the data, in particular, might support such an aim, and what sort of stories might we be able to tell around this sort of data?

What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations?

It seems to be that time of year when my local council is consulting about next year’s budgets, running a series of roadshows (which I missed) backed up by an online consultation (but no briefing or discussion documents?)

The online consultation opens with asking for opinions about the extent to which the council should support various services:

Adult Social Care, Asset & Estates Management, Bereavement Services, Central Support Services (Legal, Finance, ICT, Human Resources, Shared Services, Corporate & Democratic Core), Childrens Social Care, Coastal Management, Community Safety, Concessionary Fares, Coroner, Cowes Chain Ferry, Culture & Heritage, Economic Development, Education Psychology/Special Education Needs, Education Welfare Service, Elections & Land Charges, Environmental Health, Licensing, Trading Standards, Fire & Rescue, Harbours, Highways, Housing, Leisure & Recreation, Libraries, Parking Services, PFI Contract, Planning & Building Control, Public Conveniences, Public Health, Registrars, Revenues & Benefits, Schools Infrastructure/buildings, School Music Service, School Transport, Sports & Play Development, Subsidised Buses, Tourism, Waste Management, Workforce Development, Youth Service

The council also publishes a Budget Book which identifies intended savings in three directorate areas (Economy & Environment, Community Wellbeing & Social Care, Schools & Learning).

Budgets are allocated in different service areas according to the Service Reporting Code of Practice for Local Authorities (SeRCOP):

The Revenue Budget Service Analysis breaks things down further within each service area:

Adult social care

Central Services and Capital Costs

Childrens Social Care

Education

Cultural and related services

Environmental Services

Planning and development services

Fire and Rescue Services

Highways and Transport Services

Housing Services

Local spending data

Naive as I am, I started wondering about the extent to which we might be able to map spending items released through the council’s Spending and Finance (Transparency) data releases.

So for example, if we know a service area has a particular annual budget, and we also get to see certain declarations of spend within that are (maybe reconciled to contracts and other procurement information), then we might start to be able to pull together a picture about the range of activities associated with the service area based on the trace shadows those activities throw off in the form of spending items.

If a budget cut is proposed to a service area, and it applies proportionally to the spend, or to particular areas of the spend, we might start to get a feeling about what the practical consequences of the cut might actually be.

So that’s the proposition – but how far can we get when we look at the spending data itself?

Reviewing the spending data to date for the 2013-14 financial year, we see spend associated with the following directorates:

Childrens Services
Community Wellbeing & Social Care
Corporate
Economy & Environment
Resources
Chief Executive, Schools & Learning

It’s not necessarily immediately obvious how these map on to the services listed in the budget book (for example, is Childrens Social Care in Childrens Services or Community Wellbeing & Social Care?

The spending data released by the Isle of Wight Council also has two other relevant columns: the Service Area and the Expenses Type. Unfortunately, the Service Area descriptions seem to be rather ad hoc, and whilst the Expense Area terms may follow some standard vocabulary, it’s not obvious which, or how these terms are associated with particular summary spending areas that appear in the budget book.

“music”

As a very quick example of the sort of thing we can start to look for, here is an example of spends associated with the keyword music appearing in the spending Service Area:

The School Music Service is one of the areas the council consultation asked about, so if there is to be a cut to the funding of this service that impacts on spend, we might get a feel for what levels of savings are possible on transparently declared spend in that area:

So responsibility for the service changed directorate in July?

Here’s the overall cumulative spend:

So the transparently declared spend isn’t that much (the majority of spend – not declared in the spending data – is presumably on salaries?), and what there is declared is mainly on travel. As armchair auditors or budget consultees, I guess this provides us with the question: is just over £10k on staff vehicle mileage over the last 6 months a large amount?

We could possibly also get an estimate of the amount of mileage involved (Isle of Wight council mileage rates). As a lower bound on the mileage, if it was all charged at the highest rate (65p per mile) and all the Staff Vehicle mileage was on miles at that rate, it would come to 10490.62 * 100 / 65.0, or just over 16,100 miles; that is, slightly over 2,500 miles per month, or 100 miles per working day.

But again: is this a large number? When you consider that the School Music Service provides “weekly instrumental and vocal tuition” to “some 3,000 pupils a week in 49 primary, secondary and special schools … from a team of 24 qualified and experienced staff” (via about the School Music Service), it doesn’t seem so much?

(I am mindful that doing such exercises can lead to hatchet jobs around small amounts of money on particular public services, and am quite wary of giving examples as a result… What I do hope to show, though, is how this sort of data investigation does do can encourage you to go looking for other sources of data to help make sense of, and further put into context, the data you do have…)

“elderly”

In terms of adult social care, services provided to “older’ people (presumably, 65 and over) appear to be identified in the budget book headings as one coherent group. So what affect might budget cuts in this area have on spending, based on spending items that might be associated with this area?

If we search for mention of the key term elderly, here’s what we get in accumulated spend to date:

                                           Service.Area     sumTot
                   Elderly Mentally Ill Residential Care 3511701.13
                                  Elderly Frail Homecare 2084136.78
                             Elderly Residential Daycare 1725826.30
                            Elderly Frail Nursing Island 1579067.00
                          Elderly Frail Residential Care 1169120.25
                        Elderly Frail Residential Income 1116792.64
                          Elderly Frail Nursing - Island  669237.95
                     Elderly Mentally Ill Nursing Island  558575.44
                          Elderly Frail Managed Accounts  230445.28
                     Elderly Mental Ill Nursing - Island  216668.02
                           Elderly Mentally Ill Homecare  154026.30
                          Elderly Frail Nursing Mainland   33553.91
                                   Elderly Frail Daycare   23377.70
                   Elderly Mentally Ill Nursing Mainland   11761.68
                   Elderly Mental Ill Nursing - Mainland   10243.80
                   Elderly Mentally Ill Managed Accounts    9257.41
                            Elderly Mentally Ill Daycare    4549.10
                        Elderly Frail Nursing - Mainland    4515.84
                                Elderly Frail Other Care    2711.83
                         Elderly Mentally Ill Other Care    2550.16
                          Elderly Frail Personal Budgets    1900.64
                       Elderly Frail Community Equipment     924.39
                   Elderly Mentally Ill Personal Budgets     625.24
                            Elderly Frail Non-Res Income     171.15
   Elderly Frail - Day/Other care - Client Contributions      36.50
                     Elderly Mentally Ill Nursing Income   -1510.32
                            Elderly Frail Nursing Income   -2371.82

And here are the main suppliers:

                               Supplier.Name    sumTot
                       ISLAND HEALTHCARE LTD 891433.32
                      REDACTED PERSONAL DATA 836757.77
                         SCIO HEALTHCARE LTD 685499.04
                           SOMERSET CARE LTD 616713.96
               LONDON RESIDENTIAL HEALTHCARE 517033.27

We can also get a feel for what the expense types are:

                       Expenses.Type      sumTot
  Charges from Independent Providers 13504438.81
                Regular Respite Care   214573.11
           Crisis Support for Carers    17951.37
                    Fixed Telephones      924.39
                Professional Service      356.04
               Operational Equipment       36.50
                     Client Expenses       11.83
                Client Contributions    -1185.83
                    Provider Refunds  -619211.92

With many challenges facing the provision of homecare more generally, how is funding allocated in service areas containing the keyterm homecare?

                                Service.Area     sumTot
                      Elderly Frail Homecare 2084136.78
                Physical Disability Homecare  395358.15
               Elderly Mentally Ill Homecare  154026.30
                Learning Disability Homecare  123971.95
              Learning Disabilities Homecare  120884.88
                      Mental Health Homecare  117730.46
                           Dementia Homecare   28467.84
                         Homecare Reablement   20892.01
                    Homecare 18-25 year olds   12519.84
   Children Young Adults Disability Homecare    2962.08

Summary

As I mentioned, I didn’t get to go to any of the budget consultation roadshows. But I do wonder whether we can make use of spending data to help us get a feel for some of the services that budgets deliver via spend with third parties. For example, can we use such information as a view on to how cuts to budgets might play out in terms of spending cuts associated with the withdrawal or reduction of corresponding services?

In order to relate spend to budget items described in the budget book, I think there is still some way to go though. What we need is another column in the spending data that relates to headings described in the budget book. (Even better might be columns relating to SerCOP codes?)

PS Hmm, I wonder… is this the sort of topic that could make for an interesting School of Data style community based data expedition? And waht role might the Isle of Wight Armchair Auditor or OpenSpending play in such a thing?

Negative Payments in Local Spending Data

In anticipation of a new R library from School of Data data diva @mihi_tr that will wrap the OpenSpending API and providing access to OpenSpending.org data directly from within R, I thought I’d start doodling around some ideas raised in Identifying Pieces in the Spending Data Jigsaw. In particular, common payment values, repayments/refunds and “balanced payments”, that is, multiple payments where the absolute value of a negative payment matches that of an above zero payment (so far example, -£269.72 and £269.72 would be balanced payments).

The data I’ve grabbed is Isle of Wight Council transparency (spending) data for the financial year 2012/2013. The data was pulled from the Isle of Wight Council website and cleaned using OpenRefine broadly according to the recipe described in Using OpenRefine to Clean Multiple Documents in the Same Way with a couple of additions: a new SupplierNameClustered column, originally created from the SupplierName column, but then cleaned to remove initials, (eg [SD]), direct debit prefixes (DD- etc) and then clustered using a variety of clustering algorithms; and a new unique index column, created with values '_'+row.index. You can find a copy of the data here.

To start with, let’s see how we can identify “popular” transaction amounts:

#We going to use some rearrangement routines...
library(plyr)

#I'm loading in the data from a locally downloaded copy
iw <- read.csv("~/code/Rcode/eduDemos1/caseStudies/IWspending/IWccl2012_13_TransparencyTest4.csv")

#Display most common payment values
commonAmounts=function(df,bcol='Amount',num=5){
  head(arrange(data.frame(table(df[[bcol]])),-Freq),num)
}
ca=commonAmounts(iw)

ca
#     Var1 Freq
#1 1567.72 2177
#2 1930.32 1780
#3  1622.6 1347
#4 1998.08 1253
#5  1642.2 1227

This shows that the most common payment by value is for £1567.72. An obvious question to ask here is: does this correspond to some sort of “standard payment” or tariff? And if so, can we reconcile this against the extent of the delivery of a particular level of service, perhaps as disclosed elsewhere?

We can also generate a summary report to show what transactions correspond to this amount, or the most popular amounts.

#We can then get the rows corresponding to these common payments
commonPayments=function(df,bcol,commonAmounts){
  df[abs(df[[bcol]]) %in% commonAmounts,]
}
cp.df=commonPayments(iw,"Amount",ca$Var1)

More usefully, we might generate “pivot table” style summaries of how the popular payments breakdown with respect to expenses area, supplier, or some other combination of factors. So for example, we can learn that there were 1261 payments of £1567.72 booked to the EF Residential Care services area, and SOMERSET CARE LTD [SB] received 231 payments of £1567.72. (Reporting by the clustered supplier name is often more useful…)

#R does pivot tables, sort of...
#We can also run summary statistics over those common payment rows.
zz1=aggregate(index ~ ServiceArea + Amount, data =cp.df, FUN="length")
head(arrange(zz1,-index))

#                       ServiceArea  Amount index
#1              EF Residential Care 1567.72  1261
#2             EMI Residential Care 1642.20   659
#3             EMI Residential Care 1930.32   645
#4              EF Residential Care 1930.32   561
#5              EF Residential Care 1622.60   530
#6 Elderly Frail Residential Income 1622.60   36

#Another way of achieving a similar result is using the plyr count() function:
head(arrange(count(cp.df, c('ServiceArea','ExpensesType','Amount')),-freq))
#That is:
zz1.1=count(cp.df, c('ServiceArea','ExpensesType','Amount'))

zz2=aggregate(index ~ ServiceArea +ExpensesType+ Amount, data =cp.df, FUN="length")
head(arrange(zz2,-index))
#                       ServiceArea                       ExpensesType  Amount
#1              EF Residential Care                Chgs from Ind Provs 1567.72
#2             EMI Residential Care                Chgs from Ind Provs 1930.32
#3             EMI Residential Care                Chgs from Ind Provs 1642.20
#4              EF Residential Care Charges from Independent Providers 1622.60
#...

zz3=aggregate(index ~ SupplierName+ Amount, data =cp.df, FUN="length")
head(arrange(zz3,-index))
#                SupplierName  Amount index
#1     REDACTED PERSONAL DATA 1567.72   274
#2     SOMERSET CARE LTD [SB] 1567.72   231
#3 ISLAND HEALTHCARE LTD [SB] 1930.32   214
#...

zz4=aggregate(index ~ SupplierNameClustered+ Amount, data =cp.df, FUN="length")
head(arrange(zz4,-index))

If I was a suspicious type, I suppose I might look for suppliers who received one of these popular amounts only once…

Let’s have a search for items declared as overpayments or refunds, perhaps as preliminary work in an investigation about why overpayments are being made…:

#Let's look for overpayments
overpayments=function(df,fcol,bcol='Amount',num=5){
  df=subset(df, grepl('((overpay)|(refund))', df[[fcol]],ignore.case=T))
  df[ order(df[,bcol]), ]
}
rf=overpayments(iw,'ExpensesType')

#View the largest "refund" transactions
head(subset(arrange(rf,Amount),select=c('Date','ServiceArea','ExpensesType','Amount')))
#        Date                           ServiceArea         ExpensesType    Amount
#1 2013-03-28 Elderly Mentally Ill Residential Care     Provider Refunds -25094.16
#2 2012-07-18                   MH Residential Care Refund of Overpaymts -24599.12
#3 2013-03-25 Elderly Mentally Ill Residential Care     Provider Refunds -23163.84

#We can also generate a variety of other reports on the "refund" transactions
head(arrange(aggregate(index ~ SupplierName+ ExpensesType+Amount, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType   Amount index
#1 THE ORCHARD HOUSE CARE HOME Provider Refunds  -434.84     8
#2 THE ORCHARD HOUSE CARE HOME Provider Refunds  -729.91     3
#3         SCIO HEALTHCARE LTD Provider Refunds  -434.84     3
#...
##Which is to say: The Orchard House Care home made 8 refunds of £434.84 and 3 of £729.91

head(arrange(aggregate(index ~ SupplierName+ ExpensesType, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType index
#1         SCIO HEALTHCARE LTD Provider Refunds    31
#2           SOMERSET CARE LTD Provider Refunds    31
#3 THE ORCHARD HOUSE CARE HOME Provider Refunds    22
#...

Another area of investigation might be “balanced” payments. Here’s one approach for finding those, based around first identifying negative payments. Let’s also refine the approach in this instance for looking for balanced payments involving the supplier involved in the largest number of negative payments.

##Find which suppliers were involved in most number of negative payments
#Identify negative payments
nn=subset(iw,Amount<=0)
#Count the number of each unique supplier
nnd=data.frame(table(nn$SupplierNameClustered))
#Display the suppliers with the largest count of negative payments
head(arrange(nnd,-Freq))

#                    Var1 Freq
#1 REDACTED PERSONAL DATA 2773
#2      SOUTHERN ELECTRIC  313
#3   BRITISH GAS BUSINESS  102
#4      SOMERSET CARE LTD   75
#...

Let’s look for balanced payments around SOUTHERN ELECTRIC…

#Specific supplier search
#Limit transactions to just transactions involving this supplier
se=subset(iw, SupplierNameClustered=='SOUTHERN ELECTRIC')

##We can also run partial matching searches...
#sw=subset(iw,grepl('SOUTHERN', iw$SupplierNameClustered))

#Now let's search for balanced payments
balanced.items=function(df,bcol){
  #Find the positive amounts
  positems=df[ df[[bcol]]>0, ]
  #Find the negative amounts
  negitems=df[ df[[bcol]]<=0, ]
  
  #Find the absolute unique negative amounts
  uniqabsnegitems=-unique(negitems[[bcol]])
  #Find the unique positive amounts
  uniqpositems=unique(positems[[bcol]])
  
  #Find matching positive and negative amounts
  balitems=intersect(uniqabsnegitems,uniqpositems)
  #Subset the data based on balanced positive and negative amounts
  #bals=subset(se,abs(Amount) %in% balitems)
  bals=df[abs(df[[bcol]]) %in% balitems,]
  #Group the data by sorting, largest absolute amounts first
  bals[ order(-abs(bals[,bcol])), ]
}

dse=balanced.items(se,'Amount')

head(subset(dse,select=c('Directorate','Date','ServiceArea','ExpensesType','Amount')))
#                              Directorate       Date                ServiceArea ExpensesType   Amount
#20423               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity -2770.35
#20424               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity  2770.35
#52004 Chief Executive, Schools & Learning 2013-01-30 Haylands - Playstreet Lane  Electricity  2511.20
#52008 Chief Executive, Schools & Learning 2013-01-31 Haylands - Playstreet Lane  Electricity -2511.20

We can also use graphical techniques to try to spot balanced payments.

#Crude example plot to highlight matched +/1 amounts
se1a=subset(se,ServiceArea=='Ryde Town Hall' & Amount>0)
se1b=subset(se,ServiceArea=='Ryde Town Hall' & Amount<=0)

require(ggplot2)
g=ggplot() + geom_point(data=se1a, aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

In this first image, we see payments over time – the red markers are “minus” amounts on the negative payments. Notice that in some cases balanced payments seem to appear on the same day. If the y-value of a red and a black marker are the same, they are balanced in value. The x-axis is time. Where there is a period of equally spaced marks over x with the same y-value, this may represent a regular scheduled payment.

g=ggplot(se1a)+geom_point( aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+facet_wrap(~ExpensesType)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')
g=g+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

We can also facet out the payments by expense type.

Graphical techniques can often help us spot patterns in the data that might be hard to spot just by looking at rows and columns worth of data. In many cases, it is worthwhile developing visual analysis skills to try out quick analyses “by eye” before trying to implement them as more comprehensive analysis scripts.

Using OpenRefine to Clean Multiple Documents in the Same Way

When working with data that is published on a monthly basis according to the same template, it is often the case that we need to apply the same data cleaning rules to the data each time a new file is released. This recipe shows to use OpenRefine to create a reusable script for cleaning data files that get published month on month according to the same schedule.

To make things a little more concrete, consider this example. Under UK transparency regulations, local councils publish spending data for amounts over £500 on a monthly basis:

To get to the actual download link for the CSV data files requires another click… (example CSV file, if you want to play along…;-)

If we want to get this data into a site such as OpenSpending.org, we need to do a little bit of tidying of the data to get it into the format that OpenSpending expects (see for example The OpenSpending data format). As each new file is released, we need to to clean it as we have cleaned the files before. If you are a confident programmer, you could write a script to handle this process. But what can the rest of us to to try to automate this process and make life a little easier.

One way is to use OpenRefine’s ability to “replay” cleaning scripts that you have generated before. Here’s an example of how to do just that…

Let’s start by loading the data in from a CSV file on the Isle of Wight local council website – we need to copy the download link URL ourselves:

The recent spending files are in a common format (that is, they have been published according to the same template), which is something we’ll be relying on, so we could load multiple files in at once into one big data file, but in this case I’m going to take each file separately.

OpenRefine makes a good guess at the file format. One minor tweak we might make is to ignore any blank lines (OpenSpending doesn’t like blank lines!).

Here’s what the data looks like once we import it:

OpenSpending expects the data to be presented in a particular way, which is why we need to clean the data a little before we can upload it. For example, OpenSpending likes column names that are free of spaces and punctuation; it requires an amount column that just contains numbers (so no commas in the number to make it more readable!); it requires dates in the format 2013-07-15 (that is, the yyyy-mm-dd format) (and I think it needs this column to be called time?).

Here’s how we can rename the columns:

Rename each column in turn, as required – for example, remove any punctuation, use camelCase (removeing spaces and using capital letters to make work boundaries), or replace spaces with underscores (_).

Let’s look at the amount column – if we select the numeric facet we can see there are lots of things not identified as numbers:

We can preview what the non-numeric values are so we can set about tidying them up…

So commas appear to be the major issue – let’s remove them by transforming cells in that column that contain a comma by removing the comma.

We can do this by replacing a comma whenever we see one with nothing (that is, an empty character string) – value.replace(',','')

Let’s see what effect that has on the numeric facet view:

Well that seems to have worked – no non-numerics any more… (We might also use the sliders in the numeric facet to check the outlying values to see if they are plausible, or look like they may be errors.)

As far as the dates go, we have dates in the form 17.04.2013 rather than 2013-04-17 so let’s transform them into the required format. However, because there is more scope for things going wrong with this transformation, let’s put the transformed data into a new column:

Here’s how we define the contents of that column:

That is: value.toDate('dd.mm.yy').toString('yyyy-mm-dd')

The first part – value.toDate('dd.mm.yy') interprets the string as a date presented in a given format, and then transforms that data into the rewquired date format: .toString('yyyy-mm-dd')

We can quickly preview that this step has worked by transforming the column to a date type:

and then preview it using a timeline facet to check that all seems in order.

(If there were non-date or error elements, we could select them and look at the original date and transformed date to see where the problem may lie.)

We don’t want the date formatting for out OpenSpending data file, so we can undo the step that transformed the data into the timeline viewable date format:

So now we have our cleaned data file. How can we apply the same steps to another month? If you look at the history tab, you will see it offers an “Extract” option – this provides a history of the change operations we applied to the dataset.

If we copy this history to a text file somewhere, we can then make use of it again and again.

To see how, open another OpenRefine project and import the data for another month (such as this one). When you have created the project, click on the Undo/Redo history tab and select Apply:

Paste in the transformation script we grabbed from the previous project:

Here’s the script I used – https://gist.github.com/psychemedia/6087946

When we apply the script, the data is cleaned using the same operations as previously:

That is, as long as the new data file is in the same format as the previous one, and only requires the same cleaning operations, we don’t really have much to do – we can just reuse the script we generated the first time we cleaned a file of this type. And each time a new file is published, we have a quick recipe for cleaning it up so we can get it into OpenSpending:-)

Finally, we can export the data for use as required…