In The Re-Birth of the “Beat”: A hyperlocal online newsgathering model (Journalism Practice 6.5-6 (2012): 754-765), Murray Dick cites various others to suggest that routine sources are responsible for generating a significant percentage of local news reports:
Schlesinger [Schlesinger, Philip (1987) Putting ‘Reality’ Together: BBC News. Taylor & Francis: London] found that BBC news was dependent on routine sources for up to 80 per cent of its output, while later [Franklin, Bob and Murphy, David (1991) Making the Local News: Local Journalism in Context. Routledge: London] established that local press relied upon local government, courts, police, business and voluntary organisations for 67 per cent of their stories (in [Keeble, Richard (2009) Ethics for Journalists, 2nd Edition. Routledge: London], p114-15)”].
As well as human sources, news gatherers may also look to data sources at either a local level, such as local council transparency (that is, spending data), or national data sources with a local scope as part of a regular beat. For example, the NHS publish accident and emergency statistics as the provider organisation level on a weekly basis, and nomis, the official labour market statistics publisher, publish unemployment figures at a local council level on a monthly basis. Ratings agencies such as the Care Quality Commission (CQC) and the Food Standards Agency (FSA) publish inspections data for local establishments as it becomes available, and other national agencies publish data annually that can be broken down to a local level: if you want to track car MOT failures at the postcode region level, the DVLA have the data that will help you do it.
To a certain extent, adding data sources to a regular beat, or making a beat purely from data sources enables the automatic generation of data driven press releases that can be used to shorten the production process of news reports about a particular class of routine stories that are essentially reports about “the latest figures” (see, for example, my nomis Labour Market Statistics textualisation sketch).
Data sources can also be used to support the newsgathering process by processing the data in order to raise alerts or bring attention to particular facts that might otherwise go unnoticed. Where the data has a numerical basis, this might relate to sorting a national dataset on the basis of some indicator value or other and highlighting to a particular local news outlet that their local X is in the top M or bottom N of similar establishments in the rest of the country, and that there may be a story there. Where the data has a text basis, looking for keywords might pull out paragraphs or records that are of particular interest, or running a text through an entity recognition engine such as Thomson Reuters’ OpenCalais might automatically help identify individuals or organisations of interest.
In this context of this post, I will be considering the role that metadata about court cases that is contained within court lists and court registers might have to play in helping news media identify possibly newsworthy stories arising from court proceedings. I will also explore the extent to which the metadata may be processed, both in order to help identify court proceedings that may be worth reporting on, as well to produce statistical summaries that may in themselves be newsworthy and provide a more balanced view over the activity of the courts than the impression one might get about their behaviour simply from the balance of coverage provided by the media.
During the last couple of weeks of Cabinet Office Code Clubs, we’ve started to explore how we can use the python folium library to generate maps. Last week we looked at getting simple markers onto maps along with how to pull data down from a third party API (the Food Standards Agency hygiene ratings), and this week we demonstrated how to use shapefiles.
As a base dataset, I used Chris Hanretty et al.’s election forecasts data as a foil for making use of Westminster parliamentary constituency shapefiles. The dataset gives a forecast of the likelihood of each party winning a particular seat, so within a party we can essentially generate a heat map of how likely a party is to win each seat. So for example, here’s a forecast map for the Labour party
Although the election data table doesn’t explicitly say which party has the highest likelihood of winning each seat, we can derive that from the data with a little bit of code to melt the original dataset into a form where a row indicates a constituency and party combination (rather than a single row per constituency, with columns for each party’s forecast), then grouping by constituency, sorting by forecast value and picking the first (highest) value. (Ties will be ignored…)
We can then generate a map based on the discrete categorical values of which party has the highest forecast likelihood of taking each seat.
An IPython notebook showing how to generate the maps can be found here: how to use shapefiles.
One problem with this sort of mapping technique for the election forecast data is that the areas we see coloured are representative of geographical area, not population size. Indeed, the population of each constituency is roughly similar, so our impression that the country is significantly blue is skewed by the relative areas of the forecast blue seats compared to the forecast red ones, for example.
Ways round this are to use cartograms, or regularly sized hexagonal boundaries, such as described on Benjamin Hennig’s Views of the World website, from which the following image is republished; (see also the University of Sheffield’s (old) Social and Spatial Inequalities Research Group election mapping project website):
(A hexagonal constituency KML file, coloured by 2010 results, and corresponding to constituencies defined for that election, can be found from this post.)
I’ve just finished reading Michael Lewis’ Flash Boys, a cracking read about algorithmic high frequency trading and how the code and communication systems that contribute to the way stock exchanges operate can be gamed by front-running bots. (For an earlier take, see also Scott Patterson’s Dark Pools; for more “official” takes, see things like the SEC’s regulatory ideas response to the flash crash of May 6, 2010, an SEC literature review on high frequency trading, or this Congressional Research Service report on High-Frequency Trading: Background, Concerns, and Regulatory Developments).
As the book describes, some of the strategies pursued by the HFT traders were made possible because of the way the code underlying the system was constructed. As Lessig pointed out way back way in Code and Other Laws of Cyberspace, and revisited in Codev2:
There is regulation of behavior on the Internet and in cyberspace, but that regulation is imposed primarily through code. The differences in the regulations effected through code distinguish different parts of the Internet and cyberspace. In some places, life is fairly free; in other places, it is more controlled. And the difference between these spaces is simply a difference in the architectures of control — that is, a difference in code.
The regulation imposed on the interconnected markets by code was gameable. Indeed, it seems that it could be argued that it was even designed to be gameable…
Another area in which the bots are gaming code structures is digital advertising. A highly amusing situation is described in the following graphic, taken from The Bot Baseline: Fraud in Digital Advertising (via http://www.ana.net/content/show/id/botfraud):
A phantom layer of “ad laundering” fake websites whose traffic comes largely from bots is used to generate ad-impression revenue. (Compare this with networks of bots on social media networks that connect to each other, send each other messages, and so on, to build up “authentic” profiles of themselves, at least in terms of traffic usage dynamics. Examples: MIT Technlogy Review on Fake Persuaders; or this preprint on The Rise of Social Bots.)
As the world becomes more connected and more and more markets become exercises simply in bit exchange between bots, I suspect we’ll be seeing more and more of these phantom layer/bot audience combinations on the one hand, and high-speed, market stealing, front running algorithms on the other.
PS Not quite related, but anyway: how you’re being auctioned in realtime whenever you visit a website that carries ads – The Curse of Our Time – Tracking, Tracking Everywhere.
PPS Interesting example of bots reading the business wires and trading on the back of them: The Wolf of Wall Tweet: A Web-reading bot made millions on the options market.
For the last few years, I’ve been skulking round the edges of the whole “data journalism” thing, pondering it, dabbling with related tools, technologies and ideas, but never really trying to find out what the actual practice might be. After a couple of twitter chats and a phone conversation with Mark Woodward (Johnston Press), one of the participants at the BBC College of Journalism data training day held earlier this year, I spent a couple of days last week in the Harrogate Advertiser newsroom, pitching questions to investigations reporter and resident data journalist Ruby Kitchen, and listening in on the development of an investigations feature into food inspection ratings in and around the Harrogate area.
Here’s a quick debrief-to-self of some of the things that came to mind…
There’s not a lot of time available and there’s still “traditional” work to be done
One of Ruby’s takes on the story was to name low ranking locations, and contact each one that was going to be named to give them a right to response. Contacting a couple of dozen locations takes time and diplomacy (which even then seemed to provoke a variety of responses!), as does then writing those responses into the story in a fair and consistent way.
Even simple facts can take the lead in a small story
…for example, x% of schools attained the level 5 rating, something that can then also be contextualised and qualified by comparing it to other categories of establishment or national, regional or neighbouring locale averages. As a data junkie, it can be easy to count things by group, perhaps overlooking a journalistic take that many of these counts could be used as the basis of a quick filler story or space-filling, info-snack glanceable breakout box in a larger story.
Is the story tellable?
Looking at data, you can find all sorts of things that are perhaps interesting in their subtlety or detail, but if you can’t communicate a headline or what’s interesting in a few words, it maybe doesn’t fit… (Which is not to say that data reporting needs to be dumbed down or simplistic…) Related to this is the “so what?” question..? (I guess for news, if you wouldn’t share it in the pub or over dinner have read it – that is, if you wouldn’t remark on it – you’d have to ask: is it really that interesting? (Hmm… is “Liking” the same as remarking on something? I get the feeling it’s less engaged…)
There’s a huge difference between the tinkering I do and production warez
I have all manner of pseudo-workflows that allow me to generate quick sketches in an exploratory data analysis sort of way, but things that work for the individual “researcher” are not the same as things can work in a production environment. For example, I knocked up a quick interactive map using the folium library in an IPython notebook, but there are several problems with this:
- to regenerate the map requires someone having an IPython notebook environment set up and appropriate libraries installed
- there isn’t much time available… so you need to think about what to offer. For example:
- the map I produced was a simple one – just markers and popups. At the time, I hadn’t worked out how to colour the markers or add new icons to them (and I still don’t have a route for putting numbers into the markers…), so the look is quite simple (and clunky)
- there is no faceted navigation – so you can’t for example search for particular sorts of establishment or locations with a particular rating.
Given more time, it would have been possible to consider richer, faceted navigation, for example, but for a one off, what’s reasonable? If a publisher starts to use more and more maps, one possible workflow may to be iterate on previous precedents. (To an extent, I tried to do this with things I’ve posted on the OU OpenLearn site over the years. For example, first step was to get a marker displaying map embedded, which required certain technical things being put in place the first time but could then be reused for future maps. Next up was a map with user submitted marker locations – this represented an extension of the original solution, but again resulted in a new precedent that could be reused and in turn extended or iterated on again.)
This suggests an interesting development process in which ever richer components can perhaps be developed iteratively over an extended period of time or set of either related or independent stories, as the components are used in more and more stories. Where a media group has different independent publications, other ways of iterating are possible…
The whole tech angle also suggests that a great stumbling block to folk getting (interactive) data elements up on a story page is not just the discovery of the data, the processing and cleaning of it, and the generation of the initial sketch to check it could be something that could add to the telling of a story, (each of which may require a particular set of skills), but also the whole raft of production related issues that then result (which could require a whole raft of other technical skills (which are, for example, skills I know that I don’t really have, even given my quirky skillset…). And if the corporate IT folk take ownership of he publication element, there is then a cascade back upstream of constraints relating to how the data needs to be treated so it can fit in with the IT production system workflow.
Whilst I tend to use ggplot a lot in R for exploring datasets graphically, rather than producing presentation graphics to support the telling of a particular story. Add to that, I’m still not totally up to speed on charting in the python context, and the result is that I didn’t really (think to) explore how graphical, chart based representations might be used to support the story. One thing that charts can do – like photographs – is take up arbitrary amounts of space, which can be a Good Thing (if you need to fill the space) or a Bad Thing (is space is at a premium, or page (print or web) layout is a constraint, perhaps due to page templating allowances, for example.
Some things I didn’t consider but that now come to mind now are:
- how are charts practically handed over? (As Excel charts? as image files?)
- does a sub-editor or web-developer then process the charts somehow?
- for print, are there limitations on use of colour, line thickness, font-size and style?
Print vs Web
I didn’t really consider this, but in terms of workflow and output, are different styles of presentation required for:
- data tables
If you want data tables, there are various libraries or tools for styling charts, but again the question of workflow and the actual form in which items are handed over for print or web publication needs to be considered.
Being right/being wrong
Every cell in a data table is a “fact”. If your code is wrong and and one column, or row, or cell is wrong, that can cause trouble. When you’re tinkering in private, that doesn’t matter so much – every cell can be used as the basis for another question that can be used to test, probe or check that fact further. If you publish that cell, and it’s wrong, you’ve made a false claim… Academics are cautious and don’t usually like to commit to anything without qualifying it further (sic;-). I trust most data, metadata and my own stats skills little enough that I see stats’n’data as a source that needs corroborating, which means showing it to someone else with my conclusions and a question along the lines of “it seems to me that this data suggests that – would you agree?”. This perhaps contrasts with relaying a fact (eg a particular food hygiene score) and taking it as-is as a trusted fact, given it was published from a trusted authoritative source, obtained directly from that source, and not processed locally, but then asking the manager of that establishment for a comment about how that score came about or what action they have taken as a result of getting it.)
I’m also thinking it’d be interesting to compare the similarities and differences between journalists and academics in terms of their relative fears of being wrong…!
One of things I kept pondering – and have been pondering for months – is the extent to which templated analyses can be used to create local “press release” story packs around national datasets that can be customised for local or regional use. That’s a far more substantial topic for another day, but it was put into relief last week by my reading of Nick Carr’s The Glass Cage which got me thinking about the consequences of “robot” written stories… (More about that in a forthcoming post.)
Lots of skills issues, lots of process and workflow issues, lots of story discovery, story creation, story telling and story checking issues, lots of production constraints, lots of time constraints. Fascinating. Got me really excited again about the challenges of, and opportunities for, putting data to work in a news context…:-)
Thanks to all at the Harrogate Advertiser, in particular Ruby Kitchen for putting up with my questions and distractions, and Mark Woodward for setting it all up.
The buzz phrase for elements of this (I think?) is microservices or microservice architecture (“a particular way of designing software applications as suites of independently deployable services”, [ref.]) but the idea of being able to run apps anywhere (yes, really, again…!;-) seems to have been revitalised by the recent excitement around, and rapid pace of development of, docker containers.
Essentially, docker containers are isolated/independent containers that can be run in a single virtual machine. Containers can also be linked together within so that they can talk to each other and yet remain isolated from other containers in the same VM. Containers can also expose services to the outside world.
In my head, this is what I think various bits and pieces of it look like…
A couple of recent announcements from docker suggest to me at least one direction of travel that could be interesting for delivering distance education and remote and face-to-face training include:
- docker compose (fig, as was) – “with Compose, you define your application’s components – their containers, their configuration, links, volumes, and so on – in a single file, then you can spin everything up with a single command that does everything that needs to be done to get your application running.”
- docker machine – “a tool that makes it really easy to go from ‘zero to Docker’. Machine creates Docker Engines on your computer, on cloud providers, and/or in your data center, and then configures the Docker client to securely talk to them.” [Like boot2docker, but supports cloud as well?]
- Kitematic UI – “Kitematic completely automates the Docker installation and setup process and provides an intuitive graphical user interface (GUI) for running Docker containers on the Mac.” ) [Windows version coming soon]
I don’t think there is GUI support for configuration management provided out of docker directly, but presumably if they don’t buy up something like panamax they’ll be releasing their own version of something similar at some point soon?!
(With the data course currently in meltdown, I’m tempted to add a bit more to the confusion by suggesting we drop the monolithic VM approach and instead go for a containerised approach, which feels far more elegant to me… It seems to me that with a little bit of imagination, we could come up with a whole new way of supporting software delivery to students. eg an OU docker hub with an app container for each app we make available to students, container compositions for individual courses, a ‘starter kit’ DVD (like the old OLA CD-ROM) with a local docker hub to get folk up and running without big downloads etc etc. ..) It’s unlikely to happen of course – innovation seems to be too risky nowadays, despite the rhetoric…:-(
As well as being able to run docker containers locally or in the cloud, I also wonder about ‘plug and play’ free running containers that run on a wifi enabled Raspberry Pi that you can grab off the shelf, switch on, and immediately connect to? So for example, a couple of weeks ago Wolfram and Raspberry announced the Wolfram Language and Mathematica on Raspberry Pi, for free [Wolfram’s Raspberry Pi pages]. There are also crib sheets for how to run docker on a Raspberry Pi (the downside of this being that you need ARM based images rather than x86 ones), which could be interesting?
So pushing the thought a bit further, for the mythical submariner student who isn’t allowed to install software onto their work computer, could we give them a Raspberry Pi running their OU course software as service they could remotely connect to?!
PS by the by, at the Cabinet Office Code Club I help run for Open Knowledge last week, we had an issue with folk not being able to run OpenRefine properly on their machines. Fortunately, I’d fired up a couple of OpenRefine containers on a cloud host so we could still run the session as planned…
A few weeks ago, the UK Department for Transport (DfT) published a summary report and action plan entitled The Pathway to Driverless Cars that sets out the UK response to car manufacturers and tech companies pushing to develop, test and produce autonomous vehicles on public roads. This was followed by an announcement in the recent (March, 2015) budget where the Chancellor announced that “we are going to back our brilliant automotive industry by investing £100 million to stay ahead in the race to driverless technology” Hansard), and a new code of practice around the testing of driverless vehicles to appear sometime this Spring (2015).
One of the claims widely made in favour of autonomous vehicles/driverless cars by the automotive lobby is that in testing they have a better safety record than human drivers. (Human error plays a role in many accidents.) Whilst the story that a human driver crashes Google’s self-driving car is regularly wheeled out to illustrate how Google’s cars are much safer than humans, we don’t tend to see so many stories about when the human test driver had to take control of the vehicles, whether to avoid an accident, or because road and/or traffic conditions were “inappropriate” for “driverless” operation.
Nor do we hear much about the technology firms making road transport by safer by doing something to mitigate the role that their technology plays in causing accidents.
As Nick Carr writes in his history of automation, The Glass Cage:
It’s worth noting Silicon Valley’s concern with highway safety, though no doubt sincere, has been selective. The distractions caused by cellphones and smartphones have in recent years become a major factor in car crashes. An analysis by the National Safety Council [in the US] implicated phone use in one-fourth of all accidents on US roads in 2012. Yet Google and other top tech firms have made little or no effort to develop software to prevent people from calling, texting or using apps while driving — surely a modest undertaking compared with building a car that can drive itself.
To see what proportion of road traffic incidents involved distractions caused by mobile phones, I thought I’d check the STATS19 dataset. This openly published dataset records details of UK road accidents involving casualties reported to the police. The form used to capture the data includes information relating to up to six contributory factors, including “Driver using mobile phone”.
Unfortunately, the data collected on this part of the form is deemed to be RESTRICTED as opposed to UNCLASSIFIED (the latter classification applying to those elements released in the STATS19 dataset), which means we can’t do any stats around this from the raw data. (I think the reason the data is not released is that it may be used to help de-anonymise incident data by triangulating information contained in the dataset with information gleaned from local news reports about an incident, for example?)
The release of the 2013 annual report is supported by a set of statistical tables that break down accidents in all sorts of ways including two tables (ras50012.xls and ras50016.xls) that summarise accident counts on a local authority basis that include mobile phones as a contributory factor in reported incidents. So for example, in England in 2013 there were 384 such incidents. (It is not clear how many of the 2,669 incidents that included a “distraction in vehicle” might also have related to distractions caused by mobile phones particularly… Nor is it clear what the severity or impact of incidents with mobile phones recorded as a contributory factor actually were…
In terms of autonomous vehicle safety, and how the lobbying groups pitch their case, it would also be interesting to know how autonomous vehicles are likely to cope in the context of other contributory factors, such as vision affected by external factors (10,272 (11%) in England in 2013), pedestrian factors only (11,877 (12%)), vehicle defects (1,757, (2%)), or road environment contributed (12,436 (13%)). For cases where there was an “impairment or distraction” in general (12,162 (13%)), it would be interesting to know what would be likely to happen in an autonomous vehicle where the vehicle tried to hand control back to a supervising human driver… (Note that percentages across contributory factors do not sum to 100% – incidents may have had several contributory factors.)
As technology continues to offer ever more “solutions” to claimed problems, I’m really mindful that we need to start being more critical of it and the claims made in pushing particular solutions. In particular, three things concern me: 1) that if we look at the causes of problems that technology claims to fix, maybe technology is contributing to the problem (and the answer is not to apply more technology to treat problems caused by other technology); 2) that we don’t tend to look at the systemic consequences of applying a particular technology; 3) that we don’t tend to recognise how adopting a particular technology can lock us in to a particular set of (inflexible) technology mediated practices, nor how we change our behaviour to suit that technological solution.
On balance, I’m probably negative on the whole tech thing, even though I guess I work within it…
Every so often there’s a flurry of hype around the “internet of things”, but in many respects it’s already here – and has been for several decades. I remember as a kind being intrigued by some technical documents describing some telemetry system or other that remote water treatment plants used to transmit status information back to base. And I vaguely remember from a Maplin magazine around the time an article or two about what equipment you needed to listen in on, and decode, the radio chatter of all manner of telemetry systems.
Perhaps the difference now is a matter of scale – it’s easier to connect to the network, comms are bidirectional (you can receive as well as transmit information), and with code you can effect change on receipt of a message. The tight linkage between software and hardware – bits controlling atoms – also means that we can start to treat more and more things as “plant” whose behaviour we can remotely monitor, and govern.
A good example of how physical, consumer devices can already be controlled – or at least, disabled – by a remote operator is described in a New York Times article that crossed my wires last week, Miss a Payment? Good Luck Moving That Car, which describes how “many subprime borrowers [… in the US] must have their car outfitted with a so-called starter interrupt device, which allows lenders to remotely disable the ignition. Using the GPS technology on the devices, the lenders can also track the cars’ location and movements.” As the loan payment due date looms, it seems that some devices also emit helpful beeps to remind you…. And if your car loan agreement stipulates you’ll only drive within a particular area, I imagine that you could find it’s been geofenced. (A geofence is geographical boundary line that can be used to detect whether a GPS tracked device has passed into, or exited from, a particular region. When used to disable a device that leaves – or enters – a particular area, as for example drones flying into downtown Washington, we might consider it a form “location based management” (or “geographical rights management (GRM)”?!) that can disable activity in a particular location where someone who claims to control use of that device in that space actually exerts their control. (Think: DRM for location…))
One of the major providers “starter interrupt devices” is a company called PassTime (product list). Their products include:
- PassTime Plus, the core of their “automated collection technology”.
- Trax: “PassTime TRAX is the entry level GPS tracking product”. Includes: Pin point GPS location service, Up to Six (6) simultaneous Geo-fences.
- PassTime GPS: “provides asset protection at an economical price while utilizing the same hardware and software platform of PassTime’s Elite Pro line of products. GPS tracking and remote vehicle disable features offer customers tools for a swift recovery if needed.” Includes: Pin point GPS location service, Remote vehicle disable option, Tow-Detect Notification, Device Tamper Notification, Up to Six (6) simultaneous Geo-fences, 24-Hour Tracking, Automatic Location Heartbeat
- Elite-Pro: “the ultimate combination of GPS functionality and Automated Collection Technology”. Includes the PassTime GPS features but also mentions “Wireless Command Delivery”.
PassTime seem to like the idea of geofences so much they have patents in related technologies: PassTime Awarded Patent for Geo-Fence and Tamper Notice (US Patent: 8018329). You can find other related patents by looking up other patents held by the inventors (for example…).
You’ll be glad to know that PassTime have UK partners… in the form of The Car Finance Company, who are apparently “the world’s largest user and first company in the UK to start fitting Payment Reminder Technology to your new car”. Largest user?! According to a recent [March 12, 2015] press release announcing an extension to their agreement that “will bring 70,000 payment assurance and telematics devices to the United Kingdom”.
Here’s how The Car Finance Company spin it: The Passtime system helps remind you when your repayments are due so you can ensure you stay on track with your loan and help repair and rebuild your credit. The device is only there to help you keep your repayments up to date, it doesn’t affect your car nor does it monitor the way you drive. From the recent press release, “PassTime has been supplying Payment Assurance and GPS devices to The Car Finance Company since 2009″ (my emphasis). I’m not sure if that means the PassTime GPS (with the starter interrupt) or the Trax device? If I was a journalist, rather than a blogger, I’d probably phone them to try to clarify that…
In passing, whilst searching for providers of automotive GPS trackers in the UK (and there are lots of them – search on something like GPS fleet management, for example…) I came across this rather intrusive piece of technology, The TRACKER Mesh Network, which “uses vehicles fitted with TRACKER Locate and TRACKER Plant to pick up reply codes from stolen vehicles with an activated TRACKER unit making them even easier to locate and recover”. Which is to say, this company has an ad hoc, mobile, distributed network of sensors spread across the UK road network that listen out for each other and opportunistically track each other. It’s all good, though:
“The TRACKER Mesh Network will enable the police to extend the network of ‘eyes and ears’ to identify and locate stolen vehicles more effectively using advanced technology and allow us to stay one step ahead of criminals who are becoming more and more adept at stealing cars. This is a real opportunity for the motoring public to help us clamp down on car thieves and raises public confidence in our ability to recover their possessions and bring the offenders to justice.”
(By the by, previous notes on ANPR – Automatic Number Plate Recognition. Also, note the EU eCall accident alerting system that automatically calls for help if you have a car accident [about, UK DfT eCall cost/benefit analysis].)
This conflation of commercial and police surveillance is… to be expected. But the data’s being collected, and it won’t go away. Snowden revelations revealed the scope of security service data collection activities, and chunks of that data won’t be going away either. The scale of the data collection is such that it’s highly unlikely that we’re all being actively tracked or that this data will ever meaningfully contribute to the detection of conspiracies, but it can and will be used post hoc to create paranoid data driven fantasies about who could have have met whom, when, discussed what, and so on.
I guess where we can practically start to get concerned is in considering the ‘trickle down’ way in which access to this data will increasingly be opened up, and/or sold, to increasing numbers of agencies and organisations, both public and private. As Ed Snowden apparently commented in a session as SXSW (Snowden at SXSW: Be very concerned about the trickle down of NSA surveillance to local police), “[t]hey’ve got everything. The question becomes, Now they’re empowered. They can leak [this stuff]. It does happen at the local level. These capabilities are created. High tech. Super secret. But they inevitably bleed over to law enforcement. When they’re brand new they’re only used in the extremes. But as that transition happens, more and more people get access, they use it in newer and more and more expansive and more abusive ways.”
(Trickle down – or over-reach – applies to legislation too. For example, from a story widely reported in April, 2008: Half of councils use anti-terror laws to spy on ‘bin crimes’, although the legality of such practices was challenged: Councils warned over unlawful spying using anti-terror legislation and guidance brought in in November 2012 that required local authorities to obtain judicial approval prior to using covert techniques. (I realise I’m in danger here of conflating things not specifically related to over-reach on laws “intended” to be limited to anti-terrorism associated activities (whatever they are) with over-reach…) Other reviews: Lords Constitution Committee – Second Report – Surveillance: Citizens and the State (Jan 2009), Big Brother Watch on How RIPA has been used by local authorities and public bodies and Cataloguing the ways in which local authorities have abused their covert surveillance powers. I’m guessing a good all round starting point would be the reports of the Independent Reviewer of Terrorism Legislation.)
When it comes to processing large amounts of data, finding meaningful, rather than spurious, connections connections between things can be hard… (Correlation is not causation, right?, as Spurious Correlations wittily points out…;-)
What is more manageable is dumping people onto lists and counting things… Or querying specifics. A major problem with the extended and extensive data collection activities going on at the moment is that access to the data to allow particular queries to be made will be extended. The problem is not that all your data is being collected now, the issue is that post hoc searches over it it could be made by increasing numbers of people in the future. Like bad tempered council officers having a bad day, or loan company algorithms with dodgy parameters.
PS Schneier on connecting the dots.. Why Mass Surveillance Can’t, Won’t, And Never Has Stopped A Terrorist