Towards the end of last week I attended a two day symposium on Statistics in Journalism Practice and Education at the University of Sheffield. The programme was mixed, with several reviews of data journalism is or could be, and the occasional consideration of what stats might go into a statistics curriculum for students, but it got me thinking again about the way that content gets created and shunted around the news world.
Take polls, for example. At one point a comment got me idly wondering about the percentage of news copy that is derived from polls or surveys, and how it might be possible to automate the counting of such things. (My default position in this case is usually to wonder what might be possible be with the Guardian open platform content API. But I also started to wonder about how we could map the fan out from independent or commissioned polls or surveys as they get reported in the news media, then maybe start to find their way into other reports and documents by virtue of having been reported in the news.
This sort of thing is a corollary to tracking the way in which news stories might make their way from the newswires and into the papers via a bit of cut-and-pasting, as Nick Davies wrote so damningly about several years ago now in Flat Earth News, his indictment of churnalism and all that goes with it; it also reminds me of this old, old piece of Yahoo Pipes pipework where I tried to support the discovery of Media Release Related News Stories by putting university press release feeds into the same timeline view as news stories about that university.
I don’t remember whether I also built a custom search engine at the time for searching over press releases and news sites for mentions of universities, but that was what came immediately to mind this time round.
So for starters, here’s a quick Google Custom Search Engine that searches over a variety of polling organisation and news media websites looking for polls and surveys – Churnalism Times (Polls & Surveys Edition).
Here’s part of the setup, showing the page URL patterns to be search over.
I added additional refinements to the tab that searches over the news organisations so only pull out pages where “poll” or “survey” is mentioned. Note that if these words are indexed in the chrome around the news story (eg in a banner or sidebar), then we can get a false positive hit on the page (i.e. pull back a page where an irrelevant story is mentioned because a poll is linked to in the sidebar).
From way back when when I took an interest in search more than I do now, I thought Google was trying to find ways of distinguishing content from furniture, but I’m not so sure any more…
Anyway, here’s an example of a search into polls and surveys published by some of the big pollsters:
And an example of results from the news orgs:
For what it’s worth I also put together a custom search engine for searching over press releases – Churnalism Times (PR wires edition):
The best way of using this is to just past in a quote, or part of a quote, from a news story, in double quotes, to see which PR notice it came from…
To make life easier, an old bookmarklet generator I produced way back when on an Arcadia fellowship at the Cambridge University Library, can be used to knock up a simple bookmarklet that will let you highlight a chunk of text and then search for it – get-selection bookmarklet generator.
Give it a sensible title; then this is the URL chunk you need to add:
https://www.google.com/cse/publicurl?cx=016419300868826941330:wvfrmcn2oxc&q=
Sigh.. I used to have so much fun…
PS it actually makes more sense to enclose the selected quote in quotes. Here’s a tweaked version of the bookmarklet code I grabbed from my installation of it in Chrome:
javascript:(function()%7Bvar t%3Dwindow.getSelection%3Fwindow.getSelection().toString()%3Adocument.selection.createRange().text%3Bwindow.location%3D%27https%3A%2F%2Fwww.google.com%2Fcse%2Fpublicurl%3Fcx%3D016419300868826941330%3Awvfrmcn2oxc%26q%3D"%27%2Bt%2B%27"%27%3B%7D)()
PPS I’ve started to add additional search domains to the PR search engine to include political speeches.