Category: Anything you want

Capital, Labour and Value… I Really Don’t Understand These Terms at All…

For most of my life, I’ve managed to avoid reading much, if anything, about political theory. I have to admit I struggle reading anything from a Marxist perspective because I don’t understand what any of the words mean (I’m not convinced I even know how to pronounce some of them…), or how the logic works that tries to play them off against each other.

The closest I do get to reading political books tend to be more related to organisational theories – things like Parkinson’s Law, for example…;-)

So at a second attempt, I’ve started reading David Graeber’s “The Utopia of Rules”. Such is my level of political naivety, I can’t tell whether it’s a rant, a critique, a satire, or a nonsense.

But if nothing else it does start to introduce words in a way that gives me a jumping off point to try to make my own sense out of them. So for example, on page 37, we have a quote claimed to be from Abraham Lincoln (whether the Abraham Lincoln, or another, possibly made up one, I have no idea – I didn’t follow the footnote to check!):

Labor is prior to, and independent of, capital. Capital is only the fruit of labor, and could never have existed if labor had not first existed. labor is the superior of capital, and deserves much the higher consideration.

This followed on from the observation that “[m]ost Americans, for instance, used to subscribe to a rough-and-ready version of the labor theory of value.” Here’s my rough and ready understanding of that, in part generated as a riffed response to the Lincoln quote, as a picture:

myLabourTheoryOfValue

The abstract thing created by labour is value. The abstract thing that capital is exchanged for is value. That capital (a fiction) can create more capital through loans of capital in exchange for capital+interest repayments suggests that the value capital creates – value that corresponds to interest on capital loaned – is a fiction created from a fiction. It only becomes made real when the actor needing repay the additional fiction must acquire it somehow through their own labour, though in some situations it will also be satiated through the creation of capital-interest, that is, through the creations of other fictions.

Such is the state of my political education!

PS here some other lines I’ve particularly liked so far: from p32: “The bureaucratisation of daily life means the imposition of impersonal rules and regulations; impersonal rules and regulations, in turn, can only operate if they are backed up by the threat of force.” Which follows from p.31: “Whenever someone starts talking about the ‘free market’, it’s a good idea to look around for the man with the gun. He’s never far away.”

And on international trade (p30): “(Much of what was being called ‘international trade’ in fact consisted merely of the transfer of materials back and forth between different branches of the same corporation.)”

Yahoo Pipes Retires…

And so it seem that Yahoo Pipes, a tool I first noted here (February 08, 2007), something I created lots of recipes for (see also on the original, archived OUseful site), ran many a workshop around (and even started exploring a simple recipe book around) is to be retired (end of life annoucement)…

Pipes__Rewire_the_web

It’s not completely unexpected – I stopped using Pipes much at all several years ago, as sites that started making content available via RSS and Atom feeds then started locking it down behind simple authentication, and then OAuth…

I guess I also started to realise that the world I once imagine, as for example in my feed manifesto, We Ignore RSS at OUr Peril, wasn’t going to play out like that…

However, if you still believe in pipe dreams, all is not lost… Several years ago, Greg Gaughan took up the challenge of producing a Python library that could take a Yahoo Pipe JSON definition file and execute the pipe. Looking at the pipe2py project on github just now, it seems the project is still being maintained, so if you’re wondering what to do with your pipes, that may be worth a look…

By the by, the last time I thought Pipes might not be long for this world, I posted a couple of posts that explored how it might be possible to bulk export a set of pipe definitions as well as compiling and running your exported Yahoo Pipes.

Hmmm… thinks… it shouldn’t be too hard to get pipe2py running in a docker container, should it…?

PS I don’t think pipe2py has a graphical front end, but javascript toolkits like jsPlumb look like they may do much of the job. (It would be nice if the Yahoo Pipes team could release the Pipes UI code, of course…;-)

PPS if you you need a simple one step feed re-router, there’s always IFTT. If realtime feed/stream processing apps are more your thing, here are a couple of alternatives that I keep meaning to explore, but never seem to get round to… Node-RED, a node.js thing (from IBM?) for doing internet-of-things based inspired stream (I did intend to play with it once, but I couldn’t even figure out how to stream the data I had in…); and Streamtools (about), from The New York Times R&D Lab, that I think does something similar?

Standards or Interoperability?

An interesting piece, as ever, from Tim Davies (Slow down with the standards talk: it’s interoperability & information quality we should focus on) reflecting on the question of whether we need more standards, or better interoperability, in the world of (open) data publishing. Tim also links out to Friedrich Lindenberg’s warnings about 8 things you probably believe about your data standard, which usefully mock some of the claims often casually made about standards adoption.

My own take on many standards in the area is that conventions are the best we can hope for, and that even then they will be interpreted in variety of ways, which means you have to be forgiving when trying to read them. All manner of monstrosities have been published in the guise of being HTML or RSS, so the parsers had to do the best they could getting the mess into a consistent internal representation at the consumer side of the transaction. Publishers can help by testing that whatever they publish does appear to parse correctly with the current “industry standard” importers, ideally open code libraries. It’s then up to the application developers to decide which parser to use, or whether to write their own.

It’s all very well standardising your data interchange format, but the application developer will then want to work on that data using some other representation in a particular programming language. Even if you have a formal standard interchange format, and publishers stick to religiously and unambiguously, you will still get different parsers generating internal representations that the application code will work on that are potentially very different, and may even have different semantics. [I probably need to find some examples of that to back up that claim, don’t I?!;-)])

I also look at standards from the point of view of trying to get things done with tools that are out there. I don’t really care if a geojson feed is strictly conformant with any geojson standard that’s out there, I just need to know that something claimed to be published as as geojson works with whatever geojson parser the Leaflet Javascript library uses. I may get frustrated by the various horrors that are published using a CSV suffix, but if I can open it using pandas (a Python programming library), RStudio (an R programming environment) or OpenRefine (a data cleaning application), I can work with it.

At the data level, if councils published their spending data using the same columns and same number, character and data formats for those columns, it would make life aggregating those datasets mush easier. But even then, different councils use the same thing differently. Spending area codes, or directorate names are not necessarily standardised across councils, so just having a spending area code or directorate name column (similarly identified) in each release doesn’t necessarily help.

What is important is that data publishers are consistent in what they publish so that you can start to take into account their own local customs and work around those. Of course, internal consistency is also hard to achieve. Look down any local council spending data transaction log and you’ll find the same company described in several ways (J. Smith, J. Smith Ltd, JOHN SMITH LIMITED, and so on), some of which may match the way the same company is recorded by another council, some of which won’t…

Stories are told from the Enigma codebreaking days of how the wireless listeners could identify Morse code operators by the cadence and rhythm of their transmissions, as unique to them as any other personal signature (you know that the way you walk identifies you, right?). In open data land, I think I can spot a couple of different people entering transactions into local council spending transaction logs, where the systems aren’t using controlled vocabularies and selection box or dropdown list entry methods, but instead support free text entry… Which is say – even within a standard data format (a spending transaction schema) published using a conventional (though variously interpreted) document format (CSV) that nay be variously encoded (UTF-8, ASCII, Latin-1), the stuff in the data file may be all over the place…

An approach I have been working towards for my own use over the last year or so is to adopt a working environment for data wrangling and analysis based around the Python pandas programming library. It’s then up to me how to represent things internally within that environment, and how to get the data into that representation within that environment. The first challenge is getting the data in, the second getting it into a state where I can start to work with it, the third getting it into a state where I can start to normalise it and then aggregate it and/or combine it with other data sets.

So for example, I started doodling a wrapper for nomis and looking at importers for various development data sets. I have things call on the Food Standards Agency datasets (and when I get round to it, their API) and scrape reports from the CQC website, I download and dump Companies House data into a database, and have various scripts for calling out to various Linked Data endpoints.

Where different publishers use the same identifier schemes, I can trivially merge, join or link the various data elements. For approxi-matching, I run ad hoc reconciliation services.

All this is to say that at the end of the day, the world is messy and standardised things often aren’t. At the end of the day, integration occurs in your application, which is why it can be handy to be able to code a little, so you can whittle and fettle the data you’re presented with into a representation and form that you can work with. Wherever possible, I use libraries that claim to be able to parse particular standards and put the data into representations I can cope with, and then where data is published in various formats or standards, go for the option that I know has library support.

PS I realise this post stumbles all over the stack, from document formats (eg CSV) to data formats (or schema). But it’s also worth bearing in mind that just because two publishers use the same schema, you won’t necessarily be able to sensibly aggregate the datasets across all the columns (eg in spending data again, some council transaction codes may be parseable and include dates, accession based order numbers, department codes, others may be just be jumbles of numbers). And just because two things have the same name and the same semantics, doesn’t mean the format will be the same (2015-01-15, 15/1/15, 15 Jan 2015, etc etc)

Problems of Data Quality

One of the advantages of working with sports data, you might have thought, is that official sports results are typically good quality data. With a recent redesign of the Formula One website, the official online (web) source of results is now the FIA website.

As well as publishing timing and classification (results) data in a PDF format intended for consumption by the press, presumably, the FIA also publish “official” results via a web page.

But as I discovered today, using data from a scraper that scrapes results from the “official” web page rather than the official PDF documents is no guarantee that the “official” web page results bear any resemblance at all to the actual result.

formula_one_spanish_grand_prix_2015_q_off_class_pdf__page_2_of_2__and_Session_Classifications___Federation_Internationale_de_l_Automobile

Yet another sign that the whole F1 circus is exactly that – an enterprise promoted by clowns…

Routine Sources, Court Reporting, the Data Beat and Metadata Journalism

In The Re-Birth of the “Beat”: A hyperlocal online newsgathering model (Journalism Practice 6.5-6 (2012): 754-765), Murray Dick cites various others to suggest that routine sources are responsible for generating a significant percentage of local news reports:

Schlesinger [Schlesinger, Philip (1987) Putting ‘Reality’ Together: BBC News. Taylor & Francis: London] found that BBC news was dependent on routine sources for up to 80 per cent of its output, while later [Franklin, Bob and Murphy, David (1991) Making the Local News: Local Journalism in Context. Routledge: London] established that local press relied upon local government, courts, police, business and voluntary organisations for 67 per cent of their stories (in [Keeble, Richard (2009) Ethics for Journalists, 2nd Edition. Routledge: London], p114-15)”].

As well as human sources, news gatherers may also look to data sources at either a local level, such as local council transparency (that is, spending data), or national data sources with a local scope as part of a regular beat. For example, the NHS publish accident and emergency statistics as the provider organisation level on a weekly basis, and nomis, the official labour market statistics publisher, publish unemployment figures at a local council level on a monthly basis. Ratings agencies such as the Care Quality Commission (CQC) and the Food Standards Agency (FSA) publish inspections data for local establishments as it becomes available, and other national agencies publish data annually that can be broken down to a local level: if you want to track car MOT failures at the postcode region level, the DVLA have the data that will help you do it.

To a certain extent, adding data sources to a regular beat, or making a beat purely from data sources enables the automatic generation of data driven press releases that can be used to shorten the production process of news reports about a particular class of routine stories that are essentially reports about “the latest figures” (see, for example, my nomis Labour Market Statistics textualisation sketch).

Data sources can also be used to support the newsgathering process by processing the data in order to raise alerts or bring attention to particular facts that might otherwise go unnoticed. Where the data has a numerical basis, this might relate to sorting a national dataset on the basis of some indicator value or other and highlighting to a particular local news outlet that their local X is in the top M or bottom N of similar establishments in the rest of the country, and that there may be a story there. Where the data has a text basis, looking for keywords might pull out paragraphs or records that are of particular interest, or running a text through an entity recognition engine such as Thomson Reuters’ OpenCalais might automatically help identify individuals or organisations of interest.

In this context of this post, I will be considering the role that metadata about court cases that is contained within court lists and court registers might have to play in helping news media identify possibly newsworthy stories arising from court proceedings. I will also explore the extent to which the metadata may be processed, both in order to help identify court proceedings that may be worth reporting on, as well to produce statistical summaries that may in themselves be newsworthy and provide a more balanced view over the activity of the courts than the impression one might get about their behaviour simply from the balance of coverage provided by the media.

Continue reading

Creating Interactive Election Maps Using folium and IPython Notebooks

During the last couple of weeks of Cabinet Office Code Clubs, we’ve started to explore how we can use the python folium library to generate maps. Last week we looked at getting simple markers onto maps along with how to pull data down from a third party API (the Food Standards Agency hygiene ratings), and this week we demonstrated how to use shapefiles.

As a base dataset, I used Chris Hanretty et al.’s election forecasts data as a foil for making use of Westminster parliamentary constituency shapefiles. The dataset gives a forecast of the likelihood of each party winning a particular seat, so within a party we can essentially generate a heat map of how likely a party is to win each seat. So for example, here’s a forecast map for the Labour party

Week_8_-_Shapefiles_likelihood_LAB

Although the election data table doesn’t explicitly say which party has the highest likelihood of winning each seat, we can derive that from the data with a little bit of code to melt the original dataset into a form where a row indicates a constituency and party combination (rather than a single row per constituency, with columns for each party’s forecast), then grouping by constituency, sorting by forecast value and picking the first (highest) value. (Ties will be ignored…)

electionforecast_reshape

We can then generate a map based on the discrete categorical values of which party has the highest forecast likelihood of taking each seat.

Week_8_-_Shapefiles_likelyparty

An IPython notebook showing how to generate the maps can be found here: how to use shapefiles.

One problem with this sort of mapping technique for the election forecast data is that the areas we see coloured are representative of geographical area, not population size. Indeed, the population of each constituency is roughly similar, so our impression that the country is significantly blue is skewed by the relative areas of the forecast blue seats compared to the forecast red ones, for example.

Ways round this are to use cartograms, or regularly sized hexagonal boundaries, such as described on Benjamin Hennig’s Views of the World website, from which the following image is republished; (see also the University of Sheffield’s (old) Social and Spatial Inequalities Research Group election mapping project website):

UK_election2010_MapsCompared

(A hexagonal constituency KML file, coloured by 2010 results, and corresponding to constituencies defined for that election, can be found from this post.)

From Front Running Algorithms to Bot Fraud… Or How We’ve Lost Control of the Bits…

I’ve just finished reading Michael Lewis’ Flash Boys, a cracking read about algorithmic high frequency trading and how the code and communication systems that contribute to the way stock exchanges operate can be gamed by front-running bots. (For an earlier take, see also Scott Patterson’s Dark Pools; for more “official” takes, see things like the SEC’s regulatory ideas response to the flash crash of May 6, 2010, an SEC literature review on high frequency trading, or this Congressional Research Service report on High-Frequency Trading: Background, Concerns, and Regulatory Developments).

As the book describes, some of the strategies pursued by the HFT traders were made possible because of the way the code underlying the system was constructed. As Lessig pointed out way back way in Code and Other Laws of Cyberspace, and revisited in Codev2:

There is regulation of behavior on the Internet and in cyberspace, but that regulation is imposed primarily through code. The differences in the regulations effected through code distinguish different parts of the Internet and cyberspace. In some places, life is fairly free; in other places, it is more controlled. And the difference between these spaces is simply a difference in the architectures of control — that is, a difference in code.

The regulation imposed on the interconnected markets by code was gameable. Indeed, it seems that it could be argued that it was even designed to be gameable…

Another area in which the bots are gaming code structures is digital advertising. A highly amusing situation is described in the following graphic, taken from The Bot Baseline: Fraud in Digital Advertising (via http://www.ana.net/content/show/id/botfraud):

ANA-White_Ops_-_The_Bot_Baseline_-_Fraud_in_Digital_Advertising_pdf

A phantom layer of “ad laundering” fake websites whose traffic comes largely from bots is used to generate ad-impression revenue. (Compare this with networks of bots on social media networks that connect to each other, send each other messages, and so on, to build up “authentic” profiles of themselves, at least in terms of traffic usage dynamics. Examples: MIT Technlogy Review on Fake Persuaders; or this preprint on The Rise of Social Bots.)

As the world becomes more connected and more and more markets become exercises simply in bit exchange between bots, I suspect we’ll be seeing more and more of these phantom layer/bot audience combinations on the one hand, and high-speed, market stealing, front running algorithms on the other.

PS Not quite related, but anyway: how you’re being auctioned in realtime whenever you visit a website that carries ads – The Curse of Our Time – Tracking, Tracking Everywhere.

PPS Interesting example of bots reading the business wires and trading on the back of them: The Wolf of Wall Tweet: A Web-reading bot made millions on the options market.