Some Notes on Churnalism and a Question About Two Sided Markets

Whilst recently pondering automated content generation from original data sources again (eg as per Data Textualisation – Making Human Readable Sense of Data, or Notes on Narrative Science and Automated Insights), along with other forms of so-called “robot journalism”, I started framing for myself some of the risks associated with that approach in the context of churnalism, “the use of unchecked PR material in news” (Daniel Jackson and Kevin Moloney, ‘Inside Churnalism: PR, journalism and power relationships in flux’, Journalism Studies, 2015), “the passive processing of material which overwhelmingly tends to be supplied for them by outsiders, particularly wire agencies and PR” (Nick Davies, ‘Flat Earth News’, Vintage Books, 2009 p73), “journalists failing to perform the simple basic functions of their profession. … journalists who are no longer out gathering news but who are reduced instead to passive processors of whatever material comes their way, churning out stories, whether real event or PR artifice, important or trivial, true or false” (ibid, p59).

Davies (ibid) goes on to suggest that: “the churnalists working on the assembly line in the news factory construct national news stories from raw material which arrives along two primary conveyor belts: the Press Association and public relations” (p74).

The quality of these sources differs in the following ways: “PA is a news agency, not a newspaper. It is not attempting, nor does it claim to be attempting, to tell people the truth about the world. … The PA reporter goes to the press conference with the intention of capturing an accurate record of what is said. Whether what is said is itself a truthful account of the world is simply not their business” (p83). If this is a fair representation of what PA does, we might then expect the journalist to do some journalistic work around the story, contextualising it, perhaps seeking comment, alternative views or challenges to it.

On the other hand, PR content is material that is “clearly inherently unreliable as a source of truth, simply because it is designed to serve an interest” (p89), “whether or not it is truthful, [because it] is designed specifically to promote and to suppress stories in order to serve the interests of political, commercial and other groups” (p91). This sort of content requires additional journalistic effort in terms of verification and some reflection on the rationale behind the press release when trying to to work out the extent to which it is newsworthy or might feed into a story that is newsworthy (a press release or flurry of press releases might give you a feeling that there is a deeper story…).

The two input drivers of churnalism claimed by Davies – wire copy and PR – both play a significant role in influencing what content goes into a story. For the publisher of web-mediated news content, another filtering process may influence what content the audience sees in the form of algorithms that prioritise the placement of news stories on a website, such as the “most popular story” links. Mindful of web traffic stats, the churnalist might also be influenced by this third “reverse input” in their selection of what content is likely to do well when posting a story.

According to Jackson & Maloney (p2), “[t]he classic sociological conceptualisation of [the] process [in which “the PR practitioner trades data and opinion with journalists in exchange for favourable publicity”] is the information subsidy (Gandy 1982, Gandy, Oscar H. 1982. Beyond Agenda Setting: Information Subsidies and Public Policy)” [my emphasis]. The notion of the information subsidy was new to me, and I think is a useful one; it is explored in Gandy, Oscar H. “Information in health: Subsidised news.” Media, Culture & Society 2.2 (1980), pp103-115 as follows:

“We have suggested that the news media, traditionally seen as an independent and highly credible source for information about the environment, is in fact, dominated by purposive information supplied by PRs interested in influencing private and public decision making. We have suggested further that information subsidies of journalists and other gatekeepers operate on the basis of simple economic rules. Journalists need news, however defined, and routine sources are the easiest ways to gain that information. We have suggested further that success in providing information subsidies to one’s chosen targets is closely tied to the resources available to the subsidy giver, since considerable resources are necessary for the creation of pseudo-events dramatic enough to lure a harried reporter away from a well written press release, or from the cocktails which often accompany a special press briefing. Media visibility breeds more visibility. Appearance in the press lends status, and that status leads more quickly to a repeat appearance” p106.

Economically speaking, the media is often regarded as operating in a two sided market (for example, I’ve just popped the following onto my to-read list: Two-Sided Markets: An Overview, Jean-Charles Rochet, Jean Tirole, 2004, updated as Two-Sided Markets: A Progress Report, Jean-Charles Rochet, Jean Tirole, November 29, 2005, Rochet, Jean‐Charles, and Jean Tirole. “Platform competition in two‐sided markets.” Journal of the European Economic Association 1.4 (2003): 990-1029 and Parker, Geoffrey G., and Marshall W. Van Alstyne. “Two-sided network effects: A theory of information product design.” Management Science 51.10 (2005): 1494-1504). In the context of two sided markets in publishing, the publisher itself can be seen as operating as a platform selling content to readers (one side of the market), content which includes adverts from advertisers, and selling advertising space – and audience – to advertisers (the other side of the market).

To the extent the costs of running the platform – and hence the profitability of generating content that satisfies audience and generates audience figures (and perhaps conversion rates) that satisfies advertisers – are reduced through the provision of ready made content by PR firms, we perhaps see how the PR players might be modelled as advertisers who, rather than paying cash to the platform for access to audience, instead subsidise the costs of producing content by providing it directly. That is, to the extent that advertisers subsidise the monetary cost of accessing content by an audience (to the level of free in many cases), PR firms also subside the cost of accessing content by an audience by reducing the production costs of the platform. Maybe? (I’m not an economist and haven’t really delved very far into the two sided market literature… But when I did look (briefly), I didn’t find any two sided market analyses explictly covering PR, churnalism or information subsidies, although there are papers do do consider information subsidies in a market context, eg Patricia A. Curtin (1999) “Reevaluating Public Relations – Information Subsidies: Market-Driven Journalism and Agenda-Building Theory and Practice”, Journal of Public Relations Research, 11:1, 53-90?) So my question here is: are there any economics that do explore the idea of “information subsidy” in the context of two sided market models?

It seems to me that the information subsidy provided by PR or wire copy represents a direct efficiency or timesaving for the journalist. It is not hard to understand why journalists feel pressured into working this way:

“Despite operating in a highly competitive marketplace driven by new technology, conglomeration, deregulation, competition from free newspapers and declining circulations, newspapers’ managements have squared the circle by paying staff low salaries, shedding staff and cutting training, while simultaneously increasing output, including online content (Curran and Seaton, 1997; Davis 2003; Franklin, 1997; Murphy, 1998; Tunstall, 1996, Williams and Franklin, 2007). Time available for journalists to speak to contacts, nurture sources, become familiar with a ‘patch’ and uncover and follow up leads has become a ‘luxury’ (Pecke, 2004, p. 30)” p489.

“In such a pressurised and demoralised working environment it is all too easy for journalists to become dependent on the pre-fabricated, pre-packaged ‘news’ from resource-rich public relations organisations or the familiar and easily accessed routine source or re-writes of news agency copy” p489

The Passive Journalist: How sources dominate local news, Deirdre O’Neill and Catherine O’Connor, Journalism Practice, Vol. 2, No 3, 2008 pp487-500

Writing almost 40 years ago, David Murphy (David Murphy, “The silent watchdog: the press in local politics”, Constable, 1976) described the situation in local newsrooms then in ways that perhaps still ring true today:

When the local newspaper editor comes to create his edition for the week or the evening he has to assess: (a) the raw material – bits of information – in the light of how much space he has available, which is calculated on the ratio of news to advertisements, the advertisements being a controlling factor; (b) the cost of particular kinds of coverage; (c) the circulation pull of any particular coverage, bearing i mind the audience to which it will be directed; (d) the need to have something tin the paper by the deadline, which is at the same time up-to-date. The reporter is aware when collecting his data of these sorts of factors, because he is acquainted with the news editor’s or editor’s previous responses to similar material.

This creates a situation in which the ideal type of story is one which involves the minimum amount of investigation – preferably a single interview – or the redrafting of a public relations handout, which can be written quickly and cut from the end backwards towards the beginning without making it senseless. It must also have the strongest possible readership pull. This is the minimum-cost, maximum-utility news story (p17).

Returning to Jackson & Maloney:

“The habitual incorporation of media releases and other PR material into the news by journalists is not a new phenomenon, but the apparent change is in the scale and regularity in which this is now happening” p3.

“From time immemorial, PR practitioners have been attempting to get their copy into the news. As discussed earlier, this is typically considered an information subsidy, where the PR practitioner acts as a sort of ‘pre-reporter’ for the journalist (Supa and Zoch 2009. Supa, Dustin W., and Lynn M. Zoch. 2009. “Maximizing Media Relations through a Better Understanding of the Public Relations–journalist Relationship: A Quantitative Analysis of Changes Over the Past 23 Years.” Public Relations Journal 3 (4)). In exchange for sending them pre-packaged information that the journalist can use to write a story, the PR hopes to gain favourable coverage of their client” p7.

They also extend the idea of an information subsidy to a more pernicious one of an editorial subsidy in which the content is so well packaged that is is ready-to-go without any additional input from the journalist, placing the journalist more squarely in the role of a gatekeeper than an interpreter or contextualiser:

“Our findings on PR practice in 2013 are quite clear: for the practitioners we spoke to, the days of the monolithic media release sent to all news desks are largely over. They are preparing page-ready content customised for each publication, which is carefully targeted. They are thinking like journalists—starting with the news hook, then working in their PR copy backwards from there. Alongside the body of work that documents the growing influence of PR material in the news, we believe that the concept of the information subsidy may need expanding in light of this. The implication of churnalism is that there is more than an information subsidy taking place. Where journalists copy-paste, there is an editorial subsidy occurring too. This is significant when we think about the agenda-building process, and its associated power dimension. An editorial subsidy implies more than just setting the agenda and providing building blocks for a news story (such as basic facts, statistics, or quotes) for the journalist to add editorial framing (see Reich 2010, Reich, Zvi. 2010. ‘Measuring the Impact of PR on Published News in Increasingly Fragmented News Environments: A Multifaceted Approach.’ Journalism Studies 11 (6): 799–816.). It means a focus on the more sacred editorial element of framing stories too, which for our participants usually meant positive coverage of their client and the delivery (in print or on air) of the key campaign messages. But for most of our participants, achieving the editorial subsidy was dependent on the (journalistic) style in which it was written, and it was this that they seemed most preoccupied with when discussing their media relations practice” p13.

(In a slightly different context, this reminds me in part of The University Expert Press Room and to a lesser extent (Social Media Releases and the University Press Office.)

If journalists are simply treating PR copy as gatekeepers, then we need to consider what news values, and and what editorial values, they then apply to the content they are simply passing on. As Peter Bro & Filip Wallberg describe in “Gatekeeping in a Digital Era”, Journalism Practice, 9:1, pp92-105, (2015):

“When the concept of gatekeeping was originally introduced within journalism studies, it was employed to describe a process where a wire editor received telegrams from the wire services. From these telegrams the wire editor, known as Mr. Gates in David Manning White’s (1950 White, David Manning. 1964. ‘Introduction to the Gatekeeper.’ In People, Society, and Mass Communication, edited by Lewis Anthony Dexter and David Manning White, 160–161. New York: Macmillan) seminal study, selected what to publish. This capacity to select and reject content for publication has become a popular way of portraying the function of news reporters. In time, however, telegraphy has been succeeded by new technologies, and they have inspired new practices and principles when it comes to producing, publishing and distributing news stories.

This is a technological development that challenges us to revisit, reassess and rethink the process of gatekeeping in a digital era” p93.

“What White (1950, 384) described as a daily ‘avalanche’ from the wire services, such as United Press and Associated Press, has not vanished in news organizations even though the technological platform has changed. Many news media still publish news to their readers, listeners and viewers by way of a one-way linear process, where persons inside the newsrooms are charged with the function of selecting or rejecting news stories for publication” p96.

In the linear model where the journalist acts not simply as a gatekeeper but plays a more creative, journalistic role, one of the costs associated with performing “the simple basic functions of their profession”, as Davies put it, is checking the veracity of story via a second source. In their paper on “The Passive Journalist”, which looked at how reporters operate in local and regional news, O’Neill & O’Connor “wished to know the extent to which sources were influencing the selection and production of news and rendering the role of the local journalist essentially passive or reactive, with all the subsequent implications for the quality of local reporting and the public interest” (p490).

“The findings suggest that journalists’ reliance on a single source for stories, possibly reflecting shortage of time and resources, combined with sources’ skills in presenting positive public images, is a significant contributory factor to uncritical local press reporting. … Of the 24 per cent of articles with a secondary source, most were still framed by a primary source, with a brief alternative quote included at the end of the report. What this means in practice is a formulaic style, superficially giving the appearance of ‘objective news’, but which fails to get to the heart of the issue, or misses the real story. There was little evidence of the sifting of conflicting information or contextualising that assists readers’ understanding and makes for good journalism (Williams, 2007)” p493.

“Th[e] study found that almost two-thirds (61 per cent) of local government-sourced stories (one of the main routine source categories) had no discernible secondary sources and suggests a significant unquestioning reliance on council press officers or press releases … For example, the Yorkshire Evening Post covered a story on 22 February 2007 about local authority performance league tables (‘Three-star Rating for City Council’s Good Showing’), but framed it only in terms of the report and the views of the council leader and chief executive, with no alternative or dissenting views presented, despite the fact that the authority had dropped one star in the ratings” p493.

Standards or Interoperability?

An interesting piece, as ever, from Tim Davies (Slow down with the standards talk: it’s interoperability & information quality we should focus on) reflecting on the question of whether we need more standards, or better interoperability, in the world of (open) data publishing. Tim also links out to Friedrich Lindenberg’s warnings about 8 things you probably believe about your data standard, which usefully mock some of the claims often casually made about standards adoption.

My own take on many standards in the area is that conventions are the best we can hope for, and that even then they will be interpreted in variety of ways, which means you have to be forgiving when trying to read them. All manner of monstrosities have been published in the guise of being HTML or RSS, so the parsers had to do the best they could getting the mess into a consistent internal representation at the consumer side of the transaction. Publishers can help by testing that whatever they publish does appear to parse correctly with the current “industry standard” importers, ideally open code libraries. It’s then up to the application developers to decide which parser to use, or whether to write their own.

It’s all very well standardising your data interchange format, but the application developer will then want to work on that data using some other representation in a particular programming language. Even if you have a formal standard interchange format, and publishers stick to religiously and unambiguously, you will still get different parsers generating internal representations that the application code will work on that are potentially very different, and may even have different semantics. [I probably need to find some examples of that to back up that claim, don’t I?!;-)])

I also look at standards from the point of view of trying to get things done with tools that are out there. I don’t really care if a geojson feed is strictly conformant with any geojson standard that’s out there, I just need to know that something claimed to be published as as geojson works with whatever geojson parser the Leaflet Javascript library uses. I may get frustrated by the various horrors that are published using a CSV suffix, but if I can open it using pandas (a Python programming library), RStudio (an R programming environment) or OpenRefine (a data cleaning application), I can work with it.

At the data level, if councils published their spending data using the same columns and same number, character and data formats for those columns, it would make life aggregating those datasets mush easier. But even then, different councils use the same thing differently. Spending area codes, or directorate names are not necessarily standardised across councils, so just having a spending area code or directorate name column (similarly identified) in each release doesn’t necessarily help.

What is important is that data publishers are consistent in what they publish so that you can start to take into account their own local customs and work around those. Of course, internal consistency is also hard to achieve. Look down any local council spending data transaction log and you’ll find the same company described in several ways (J. Smith, J. Smith Ltd, JOHN SMITH LIMITED, and so on), some of which may match the way the same company is recorded by another council, some of which won’t…

Stories are told from the Enigma codebreaking days of how the wireless listeners could identify Morse code operators by the cadence and rhythm of their transmissions, as unique to them as any other personal signature (you know that the way you walk identifies you, right?). In open data land, I think I can spot a couple of different people entering transactions into local council spending transaction logs, where the systems aren’t using controlled vocabularies and selection box or dropdown list entry methods, but instead support free text entry… Which is say – even within a standard data format (a spending transaction schema) published using a conventional (though variously interpreted) document format (CSV) that nay be variously encoded (UTF-8, ASCII, Latin-1), the stuff in the data file may be all over the place…

An approach I have been working towards for my own use over the last year or so is to adopt a working environment for data wrangling and analysis based around the Python pandas programming library. It’s then up to me how to represent things internally within that environment, and how to get the data into that representation within that environment. The first challenge is getting the data in, the second getting it into a state where I can start to work with it, the third getting it into a state where I can start to normalise it and then aggregate it and/or combine it with other data sets.

So for example, I started doodling a wrapper for nomis and looking at importers for various development data sets. I have things call on the Food Standards Agency datasets (and when I get round to it, their API) and scrape reports from the CQC website, I download and dump Companies House data into a database, and have various scripts for calling out to various Linked Data endpoints.

Where different publishers use the same identifier schemes, I can trivially merge, join or link the various data elements. For approxi-matching, I run ad hoc reconciliation services.

All this is to say that at the end of the day, the world is messy and standardised things often aren’t. At the end of the day, integration occurs in your application, which is why it can be handy to be able to code a little, so you can whittle and fettle the data you’re presented with into a representation and form that you can work with. Wherever possible, I use libraries that claim to be able to parse particular standards and put the data into representations I can cope with, and then where data is published in various formats or standards, go for the option that I know has library support.

PS I realise this post stumbles all over the stack, from document formats (eg CSV) to data formats (or schema). But it’s also worth bearing in mind that just because two publishers use the same schema, you won’t necessarily be able to sensibly aggregate the datasets across all the columns (eg in spending data again, some council transaction codes may be parseable and include dates, accession based order numbers, department codes, others may be just be jumbles of numbers). And just because two things have the same name and the same semantics, doesn’t mean the format will be the same (2015-01-15, 15/1/15, 15 Jan 2015, etc etc)

Things I Take for Granted #287 – Grabbing Stuff from Web Form Drop Down Lists

Over the years, I’ve collected lots of little hacks for tinkering with various data sets. Here’s an example…

A form on a web page with country names that map to code values:

Advanced_Search_-_ROARMAP

If we want to generate a two column look up table from the names on the list to the values that encode them, we can look to the HTML source, grab the list of elements, then use a regular expression to to extract the names and values and rewrite them in two column, tab separated text file, with one item per line:

regexp form exractor

NOTE: the last character in the replace is \n (newline character). I grabbed the screenshot when the cursor was blinking on:-(

Kiteflying Around Containers – A Better Alternative to Course VMs?

Eighteen months or so ago, I started looking at ways in which we might use a virtual machine to bundle up a variety of interoperating software applications for a distance education course on databases and data management. (This VM would run IPython notebooks as the programming surface, PostgreSQL and MongoDB as the databases. I was also keen that OpenRefine should be made available, and as everything in the VM was being accessed via a browser, I added a browser based terminal app (tty.js) to the mix as well). The approach I started to follow was to use vagrant as a provisioner and VM manager, and puppet scripts to build the various applications. One reason for this approach is that the OU is an industrial scale educator, and (to my mind) it made sense to explore a model that would support the factory line production model we have in a way that would scale vertically as a way of maintaining VMs for a course that runs over several ways as well as horizontally across other courses with other software application requirements. You can see how my thinking evolved across the following posts: posts tagged “VM” on OUseful.info.

Since then, a lot has changed. IPython notebooks have forked into the Jupyter notebook server and IPython, and Jupyter has added a browser based terminal app to the base offerings of the notebook server. (It’s not as good a flexible as tty.js, which allowed for multiple terminals in the same browser window, but I guess there’s nothing to stop you loading multiple terminals into separate browser tabs.) docker has also become a thing…

To recap on some of thinking about how we might provide software to students, I was pre-occupied at various times with the following (not necessarily exhaustive) list of considerations:

  • how could we manage the installation and configuration of different software applications on students’ self-managed, remote computers, running arbitrary versions of arbitrary operating systems on arbitrarily specced machines over networks with unknown and perhaps low bandwidth internet connections;
  • how could we make sure those applications interoperated correctly on the students’ own machines;
  • how could we make sure the students retained access to local copies of all the files they had created as part of their studies, and that those local copies would be the ones they actually worked on in the provided software applications; (so for example, IPython notebook files, and perhaps even database data directories);
  • how could we manage the build of each application in the OU production context, with OU course teams requiring access to a possibly evolving version of the machine 18 months in advance of student first use date and an anticipated ‘gold master’ freeze date on elements of the software build ~9 months prior to students’ first use;
  • how could we manage the maintenance of VMs within a single presentation of a 9 month long course and across several presentations of the course spanning 1 presentation a year over a 5 year period;
  • how could the process support the build and configuration of the same software application for several courses (for example, an OU-standard PostgreSQL build);
  • how could the same process/workflow support the development, packaging, release to students, maintenance workflow for other software applications for other courses;
  • could the same process be used to manage the deployment of application sets to students on a cloud served basis, either through a managed OU cloud, or on a self-served basis, perhaps using an arbitrary cloud service provider.

All this bearing in mind that I know nothing about managing software packaging, maintenance and deployment in any sort of environment, let alone a production one…;-) And all this bearing in mind that I don’t think anybody else really cares about any of the above…;-)

Having spent a few weeks away from the VM, I’m now thinking that we would be better served by using a more piecemeal approach based around docker containers. These still require the use of something like Virtualbox, but rather than using vagrant to provision the necessary environment, we could use more of an appstore approach to starting and stopping services. So for example, today I had a quick play with Kitematic, a recent docker acquisition, and an app that doesn’t run on Windows yet but for which Windows supported is slated for June, 2015 in the Kitematic roadmap on github

So what’s involved? Install Kitematic (if Virtualbox isn’t already installed, I think it’ll grab it down for you?) and fire it up…

Kitematic_1

It starts up a dockerised virtual machine into which you can install various containers. Next up, you’re presented with an “app dashboard”, as well as the ability to search dockerhub for additional “apps”:

Kitematic_2

Find a container you want, and select it – this will download the required components and fire up the container.

Kitematic_3

The port tells you where you can find any services exposed by the container. In this case, for scipyserver, it’s an IPython notebook (HTML app) running on top of a scipy stack.

Kitematic_4

By default the service runs over https with a default password; we can go into the Settings for the container, reset the Jupyter server password, force it to use http rather than https, and save to force the container to use the new settings:

Kitematic_5

So for example…

kitematic_ipynb

In the Kitematic container homepage, if I click on the notebooks folder icon in the Edit Files panel, I can share the notebook folder across to my host machine:

scipyserver_share

I can also choose what directory on host to use as the shared folder:

Kitematic_7

I can also discover and fire up some other containers – a PostgreSQL database, for example, as well as a MongoDB database server:

Kitematic_6

From within my notebook, I can install additional packages and libraries and then connect to the databases. So for example, I can connect to the PostgreSQL database:

kitematic_ipynb_postgres

or to mongo:

kitematic_ipynb_mongodb

Looking at the container Edit Files settings, it looks like I may also be able to share across the database datafiles – though I’m not sure how this would work if I had a default database configuration to being with? (Working out how to pre-configure and then share database contents from containerised DBMS’ is something that’s puzzled me for a bit and something I haven’t got my head round yet).

So – how does this fit into the OU model (that doesn’t really exist yet?) for using VMs to make interoperating software collections available to students on their own machines?

First up, no Windows support at the moment, though that looks like it’s coming; secondly, the ability to mount shares with host seems to work, though I haven’t tested what happens if you shutdown and start up containers, or delete a scipyserver container and then fire up a clean replacement for example. Nor do I know (yet?!) how to manage shares and pre-seeding for the database containers. One original argument for the VM was that interoperability between the various software applications could be hardwired and tested. Kitematic doesn’t support fig/Docker compose (yet?) but it’s not too hard to lookup up the addresses paste them into a notebook. I think it does mean we can’t provide hard coded notebooks with ‘guaranteed to work’ configurations (i.e. ones prewritten with service addresses and port numbers) baked in, but it’s not too hard to do this manually. In the docker container Dockerfiles, I’m not sure if we could fix the port number mappings to initial default values?

One thing we’d originally envisioned for the VM was shipping it on a USB stick. It would be handy to be able to point Kitematic to a local dockerhub, for example, a set of prebuilt containers on a USB stick with the necessary JSON metadata file to announce what containers were available there, so that containers could be installed from the USB stick. (Kitematic currently grabs the container elements down from dockerhub and pops the layers into the VM (I assume?), so it could do the same to grab them from the USB stick?) In the longer term, I could imagine an OU branded version of Kitematic that allows containers to be installed from a USB stick or pulled down from an OU hosted dockerhub.

But then again, I also imagined an OU USB study stick and an OU desktop software updater 9 years or so ago and they never went anywhere either..;-)

Whither the Library?

As I scanned my feeds this morning, a table in a blog post (Thoughts on KOS (Part 3): Trends in knowledge organization) summarising the results from a survey reported in a paywalled academic journal article – Saumure, Kristie, and Ali Shiri. “Knowledge organization trends in library and information studies: a preliminary comparison of the pre-and post-web eras.” Journal of information science 34.5 (2008): 651-666 [pay content] – really wound me up:

lib_trends

My immediate reaction to this was: so why isn’t cataloguing about metadata? (Or indexing, for that matter?)

In passing, I note that the actual paper presented the results in a couple of totally rubbish (technical term;-) pie charts:

wtf_infoProfessionalsmyRS

More recently (that was a report from 2008 on a lit review going back before then), JISC have just announced a job ad for a role as Head of scholarly and library futures to “provide leadership on medium and long-term trends in the digital scholarly communication process, and the digital library.“. (They didn’t call… You going for it, Owen?!;-)

The brief includes “[k]eep[ing] a close watch on developments in the library and research support communities, and practices in digital scholarship, and also in digital technology, data, on-line resources and behavioural analytics” and providing:

Oversight and responsibility for practical projects and experimentation in that context in areas such as, but not limited to:

  • Digital scholarly communication and publishing
  • Digital preservation
  • Management of research data
  • Resource discovery infrastructure
  • Citation indices and other measures of impact
  • Digital library systems and services
  • Standards, protocols and techniques that allow on-line services to interface securely

So the provision of library services at a technical level, then (which presumably also covers things like intellectual property rights and tendering – making sure the libraries don’t give their data and organisation’s copyrights to the commercial publishers – but perhaps not providing a home for policy and information ethical issue considerations such as algorithmic accountability?), rather than identifying and meeting the information skills needs of upcoming generations (sensemaking, data management and all the other day to day chores that benefit from being a skilled worker with information).

It would be interesting to know what a new appointee to the role would make of the recently announced Hague Declaration on Knowledge Discovery in the Digital Age (possibly in terms of a wider “publishing data” complement to “management of research data”), which provides a call for opening up digitally represented content to the content miners.

I’d need to read it more carefully, but at the very briefest of first glances, it appears to call for some sort of de facto open licensing when it comes to making content available to machines for processing by machines:

Generally, licences and contract terms that regulate and restrict how individuals may analyse and use facts, data and ideas are unacceptable and inhibit innovation and the creation of new knowledge and, therefore, should not be adopted. Similarly, it is unacceptable that technical measures in digital rights management systems should inhibit the lawful right to perform content mining.

The declaration also seems to be quite dismissive of database rights. A well-put together database makes it easier – or harder – to ask particular sorts of question and to a certain respect reflects the amount of creative effort involved in determining a database schema, leaving aside the physical effort involved in compiling, cleaning and normalising the data that secures the database right.

Also, if I was Google, I think I’d be loving this… As ever, the promise of open is one thing, the reality may be different, as those who are geared up to work at scale, and concentrate power further, inevitably do so…

By the by, the declaration also got me thinking: who do I go to in the library to help me get content out of APIs so that I can start analysing it? That is, who do I go to get help with with “resource discovery infrastructure” and perhaps more importantly in this context, “resource retrieval infrastructure”? The library developer (i.e. someone with programming skills who works with librarians;-)?

And that aside from the question I keep asking myself: who do I go to to ask for help in storing data, managing data, cleaning data, visualising data, making sense of data, putting data into a start where I can even start to make sense of it, etc etc… (Given those pie charts, I probably wouldn’t trust the library!;-) Though I keep thinking: that should be the place I’d go.)

The JISC Library Futures role appears silent on this (but then, JISC exists to make money from selling services and consultancy to institutions, right, not necessarily helping or representing the end academic or student user?)

But that’s a shame; because as things like the Stanford Center for Interdisciplinary Digital Research (CIDR) show, libraries can act as a hub and go to place for sharing – and developing – digital skills, which increasingly includes digital skills that extend out of the scientific and engineering disciplines, out of the social sciences, and into the (digital) humanities.

When I started going into academic libraries, the librarian was the guardian of “the databases” and the CD-ROMs. Slowly access to these information resources opened up to the end user – though librarian support was still available. Now I’m as likely to need help with textmining and making calendar maps: so which bit of the library do I go to?

Problems of Data Quality

One of the advantages of working with sports data, you might have thought, is that official sports results are typically good quality data. With a recent redesign of the Formula One website, the official online (web) source of results is now the FIA website.

As well as publishing timing and classification (results) data in a PDF format intended for consumption by the press, presumably, the FIA also publish “official” results via a web page.

But as I discovered today, using data from a scraper that scrapes results from the “official” web page rather than the official PDF documents is no guarantee that the “official” web page results bear any resemblance at all to the actual result.

formula_one_spanish_grand_prix_2015_q_off_class_pdf__page_2_of_2__and_Session_Classifications___Federation_Internationale_de_l_Automobile

Yet another sign that the whole F1 circus is exactly that – an enterprise promoted by clowns…

Routine Sources, Court Reporting, the Data Beat and Metadata Journalism

In The Re-Birth of the “Beat”: A hyperlocal online newsgathering model (Journalism Practice 6.5-6 (2012): 754-765), Murray Dick cites various others to suggest that routine sources are responsible for generating a significant percentage of local news reports:

Schlesinger [Schlesinger, Philip (1987) Putting ‘Reality’ Together: BBC News. Taylor & Francis: London] found that BBC news was dependent on routine sources for up to 80 per cent of its output, while later [Franklin, Bob and Murphy, David (1991) Making the Local News: Local Journalism in Context. Routledge: London] established that local press relied upon local government, courts, police, business and voluntary organisations for 67 per cent of their stories (in [Keeble, Richard (2009) Ethics for Journalists, 2nd Edition. Routledge: London], p114-15)”].

As well as human sources, news gatherers may also look to data sources at either a local level, such as local council transparency (that is, spending data), or national data sources with a local scope as part of a regular beat. For example, the NHS publish accident and emergency statistics as the provider organisation level on a weekly basis, and nomis, the official labour market statistics publisher, publish unemployment figures at a local council level on a monthly basis. Ratings agencies such as the Care Quality Commission (CQC) and the Food Standards Agency (FSA) publish inspections data for local establishments as it becomes available, and other national agencies publish data annually that can be broken down to a local level: if you want to track car MOT failures at the postcode region level, the DVLA have the data that will help you do it.

To a certain extent, adding data sources to a regular beat, or making a beat purely from data sources enables the automatic generation of data driven press releases that can be used to shorten the production process of news reports about a particular class of routine stories that are essentially reports about “the latest figures” (see, for example, my nomis Labour Market Statistics textualisation sketch).

Data sources can also be used to support the newsgathering process by processing the data in order to raise alerts or bring attention to particular facts that might otherwise go unnoticed. Where the data has a numerical basis, this might relate to sorting a national dataset on the basis of some indicator value or other and highlighting to a particular local news outlet that their local X is in the top M or bottom N of similar establishments in the rest of the country, and that there may be a story there. Where the data has a text basis, looking for keywords might pull out paragraphs or records that are of particular interest, or running a text through an entity recognition engine such as Thomson Reuters’ OpenCalais might automatically help identify individuals or organisations of interest.

In this context of this post, I will be considering the role that metadata about court cases that is contained within court lists and court registers might have to play in helping news media identify possibly newsworthy stories arising from court proceedings. I will also explore the extent to which the metadata may be processed, both in order to help identify court proceedings that may be worth reporting on, as well to produce statistical summaries that may in themselves be newsworthy and provide a more balanced view over the activity of the courts than the impression one might get about their behaviour simply from the balance of coverage provided by the media.

Continue reading