Accounting for the 80% – A Quick Pareto Principle Filter for pandas

Having decided (again) to try to do something with local government transparency spending data this year, I thought I take the take of generating some simple reports that just identify the significant spending areas within a particular directorate or service area.

It’s easy enough to render dozens of charts that show some bars bigger than others, and from this suggest the significant spending areas, but this still requires folk to spend time reading those charts and runs the risk that that they don’t “read” from the chart what you wanted them to… (This is one way in which titles and captions can help…)

So how about putting the Pareto Principle, or 80/20 rule, to work, where 80% of some effect or other (such as spending) is accounted for by 20% of contributory effects (such as spending in a particular area, or to a particular set of suppliers)?

In other words, is one way in to the spending data to use it simply to see what accounts for 80%, or thereabouts, of monthly spend?

Here’s a quick function that tries to do something like that, that can be applied to a pandas Series:

```def paretoXY(s, x, y,threshold=0):
''' Return a list until you account for X% of the whole and remainders are less than Y% individually.
The threshold percentage allows you to hedge your bets and check items just past the treshold. '''
#Generate percentages, and sort, and find accumulated total
#Exclude any negative payments that can make the cumulative percentage go > 100% before we get to them
df=pd.DataFrame( s[s>0].sort_values(ascending=False) )
df['pc']= 100*s/s.sum()
df['cum_pc']=df['pc'].cumsum()
#Return ordered items that either together account at least for X% of the total,
# and/or individually account for at least Y% of the total
#The threshold provides a fudge factor on both components...
return df[ (df['cum_pc']-df['pc'] <= x+ x*threshold/100) | (df['pc'] >= y-y*threshold/100) ]
```

The resulting report simply describes just the components that either make up 80% (or whatever) of the total in each area, or that represent a significant contribution (howsoever defined), in their own right, to the corresponding total.

In the above case, the report describes the significant expense types in capital or revenue streams for each directorate for a given month’s spending data.

The resulting dataframe can also be converted to a crude text report summarising percentage contributions to specific areas easily enough:

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...