Archive for the ‘Anything you want’ Category
@digiphile’s being doing some digging around current popular usage of the phrase data journalism – here are my recollections…
My personal recollection of the current vogue is that “data driven journalism” was the phrase that dominated the discussions/community I was witness to around early 2009, though for some reason my blog doesn’t give any evidence for that (must take better contemporaneous notes of first noticings of evocative phrases;-). My route in was via “mashups”, mashup barcamps, and the like, where folk were experimenting with building services on newly emerging (and reverse engineered) APIs; things like crime mapping and CraigsList maps were in the air – putting stuff on maps was very popular I seem to recall! Yahoo were one of the big API providers at the time.
I noted the launch of the Guardian datablog and datastore in my personal blog/notebook here – http://blog.ouseful.info/2009/03/10/using-many-eyes-wikified-to-visualise-guardian-data-store-data-on-google-docs/ – though for some reason don’t appear to have linked to a launch post. With the arrival of the datastore it looked like there were to be “trusted” sources of data we could start to play with in a convenient way, accessed through Google docs APIs:-) Some notes on the trust thing here: http://blog.ouseful.info/2009/06/08/the-guardian-openplatform-datastore-just-a-toy-or-a-trusted-resource/
NESTA did an event on News Innovation London in July 2009, a review of which by @kevglobal mentions “discussions about data-driven journalism” (sic on the hyphen). I seem to recall that Journalism.co.uk (@JTownend) were also posting quite a few noticings around the use of data in the news at the time.
At some point, I did a lunchtime at the Guardian for their developers – there was a lot about Yahoo Pipes, I seem to remember! (I also remember pitching the Guardian Platform API to developers in the OU as a way of possibly getting fresh news content into courses. No-one got it…) I recall haranguing Simon Rogers on a regular basis about their lack of data normalisation (which I think in part led to the production of the Rosetta Stone spreadsheet) and their lack of use (at the time) of fusion tables. Twitter archives may turn something up there. Maybe Simon could go digging in the Twitter archives…?;-)
There was a session on related matters at the first(?) news:rewired event in early 2010 but I don’t recall the exact title of the session (I was in a session with Francis Irving/@frabcus from the then nascent Scraperwiki) http://blog.ouseful.info/2010/01/14/my-presentation-for-newsrewired-doing-the-data-mash/ Looking at the content of that presentation, it’s heavily dominated by notions of data flow; the data driven journalism (hence #ddj) phrase, seemed to fit this well.
Later that year, summer, was a roundtable event hosted by the ECJ on “data driven journalism” – I recall meeting Mirko Lorenz there (who maybe had a background in business data? and since helped launch datawrapper.de) and Jonathan Gray – who then went on to help edit the Data Journalism handbook – among others.
For me the focus at the time was very much on using technology to help flow data into useable content, (eg in a similar but perhaps slightly weaker sense than the more polished content generation services that Narrative Science/Automated Insights have since come to work on, or other data driven visualisations or what I guess we might term local information services; more about data driven applications with a weak local news/specific theme or issue general news relevance, perhaps). I don’t remember where the sense of the journalist was in all this – maybe as someone who would be able to take the flowed data, or use tools that were being developed to get the stories out of data with tech support?
My “data driven journalism” phrase notebook timeline
My “data journalist” phrase notebook timeline
My first blogged used of the data journalism phrase, in quotes, as it happens, so it must have been a relatively new sounding phrase to me, was here: http://blog.ouseful.info/2009/05/20/making-it-a-little-easier-to-use-google-spreadsheets-as-a-database-hopefully/ (h/t @paulbradshaw)
Seems like my first use of the “data journalist” phrase was in picking up on a job ad – so certainly the phrase was common to me by then.
As a practice and a commonplace, things still seemed to be developing in 2011 enough for me to comment on a situation where the Guardian and Telegraph teams were co-opetitively bootstrapping each other’s ideas: http://blog.ouseful.info/2011/09/13/data-journalists-engaging-in-co-innovation/
I guess the deeper history of CAR, database journalism, precision journalism may throw off trace references, though maybe not representing situations that led to the phrase gaining traction in “popular” usage?
Certainly, now I’m wondering what the relative rise in popularity of “data journalist” versus “data journalism” was? Certainly, for me, “data driven journalism” was a phrase I was familiar with way before the other two, though I do recall a sense of unease about it’s applicability to news stories that were perhaps “driven” by data more in the sense of being motivated or inspired by it, or whose origins lay in a data set, rather than “driven” in a live, active sense of someone using an interface that was powered by flowing data.
From 1973, Charles Bachman in his acceptance lecture for that year’s Turing Award (The Programmer as Navigator) commenting on challenges in shifting the world view of the time about database design:
The publication policies of the technical literature are also a problem. The ACM SIGBDP and SIGFIDET publications are the best available, and membership in these groups should grow. The refereeing rules and practices of Communications of the ACM result in delays of one year to 18 months between submittal and publication. Add to that the time for the author to prepare his ideas for publication and you have at least a two-year delay between the detection of significant results and their earliest possible publication.
1973. We’re now in 2014. Do, as they say, the math…
Remember mashups? Five years or so ago they were all the rage. At their heart, they provided ways of combining things that already existed to do new things. This is a lazy approach, and one I favour.
One of the key inspirations for me in this idea combinatorial tech, or tech combinatorics, is Jon Udell. His Library Lookup project blew me away in its creativity (the use of bookmarklets, the way the project encouraged you to access one IT service from another, the using of “linked data”, common/core-canonical identifiers to bridge services and leverage or enrich one from another, and so on) and was the spark that fired many of my own doodlings. (Just thinking about it again excites me now…)
As Jon wrote on his blog yesterday (Shiny old tech) (my emphasis):
What does worry me, a bit, is the recent public conversation about ageism in tech. I’m 20 years past the point at which Vinod Khosla would have me fade into the sunset. And I think differently about innovation than Silicon Valley does. I don’t think we lack new ideas. I think we lack creative recombination of proven tech, and the execution and follow-through required to surface its latent value.
Elm City is one example of that. Another is my current project, Thali, Yaron Goland’s bid to create the peer-to-peer web that I’ve long envisioned. Thali is not a new idea. It is a creative recombination of proven tech: Couchbase, mutual SSL authentication, Tor hidden services. To make Thali possible, Yaron is making solid contributions to Thali’s open source foundations. Though younger than me, he is beyond Vinod Khosla’s sell-by date. But he is innovating in a profoundly important way.
Can we draw a clearer distinction between innovation and novelty?
I often think of this in terms of appropriation (eg Appropriating Technology, Appropriating IT: innovative uses of emerging technologies or Appropriating IT: Glue Steps).
Or repurposing, a form of reuse that differs from the intended original use.
Openness helps here. Open technologies allow users to innovate without permission. Open licensing is just part of that open technology jigsaw; open standards another; open access and accessibility a third. Open interfaces accessed sideways. And so on.
Looking back over archived blog posts from five, six, seven years ago, the web used to be such fun. An open playground, full of opportunities for creative recombination. Now we have Facebook, where authenticated APIs give you access to local social neighbourhoods, but little more. Now we have Google using link redirection and link pollution at every opportunity. Services once open are closed according to economic imperatives (and maybe scaling issues; maybe some creative recombinations are too costly to support when a network scales). Maybe my memory of a time when the web was more open is a false memory?
Creative recombination, ftw.
PS just spotted this (Walking on custard), via @plymuni. If you don’t see why it’s relevant, you probably don’t get the sense of this post!
Part of my weekend ritual is to buy the weekend papers and have a go at the recreational maths problems that are Sudoku and Killer. I also look for news stories with a data angle that might prompt a bit of recreational data activity…
In a paper that may or may not have been presented at the First European Congress of Mathematics in Paris, July, 1992, Prof. David Singmaster reflected on “The Unreasonable Utility of Recreational Mathematics”.
To begin with, it is worth considering what is meant by recreational
First, recreational mathematics is mathematics that is fun and popular – that is, the problems should be understandable to the interested layman, though the solutions may be harder. (However, if the solution is too hard, this may shift the topic from recreational toward the serious – e.g. Fermat’s Last Theorem, the Four Colour Theorem or the Mandelbrot Set.)
Secondly, recreational mathematics is mathematics that is fun and used as either as a diversion from serious mathematics or as a way of making serious mathematics understandable or palatable. These are the pedagogic uses of recreational mathematics. They are already present in the oldest known mathematics and continue to the present day.
These two aspects of recreational mathematics – the popular and the pedagogic – overlap considerably and there is no clear boundary between them and “serious” mathematics.
How is recreational mathematics useful?
Firstly, recreational problems are often the basis of serious mathematics. The most obvious fields are probability and graph theory where popular problems have been a major (or the dominant) stimulus to the creation and evolution of the subject. …
Secondly, recreational mathematics has frequently turned up ideas of genuine but non-obvious utility. …
Anyone who has tried to do anything with “real world” data knows how much of a puzzle it can represent: from finding the data, to getting hold of it, to getting it into a state and a shape where you can actually work with it, to analysing it, charting it, looking for pattern and structure within it, having a conversation with it, getting it to tell you one of the many stories it may represent, there are tricks to be learned and problems to be solved. And they’re fun.
An obvious definition [of recreational mathematics] is that it is mathematics that is fun, but almost any mathematician will say that he enjoys his work, even if he is studying eigenvalues of elliptic differential operators, so this definition would encompass almost all mathematics and hence is too general. There are two, somewhat overlapping, definitions that cover most of what is meant by recreational mathematics.
…the two definitions described above.
So how might we define “recreational data”. For me, recreational data activities are, in who or in part, data investigations, involving one or more steps of the data lifecycle (discovery, acquisition, cleaning, analysis, visualisation, storytelling). They are the activities I engage in when I look for, or behind, the numbers that appear in a news story. They’re the stories I read on FullFact, or listen to on the OU/BBC co-pro More or Less; they’re at the heart of the beautiful little book that is The Tiger That Isn’t; recreational data is what I do in the “Diary of a Data Sleuth” posts on OpenLearn.
Recreational data is about the joy of trying to find stories in data.
Recreational data is, or can be, the data journalism you do for yourself or the sense you make of the stats in the sports pages.
Recreational data is a safe place to practice – I tinker with Twitter and formulate charts around Formula One. But remember this: “recreational problems are often the basis of serious [practice]“. The “work” I did around Visualising Twitter User Timeline Activity in R? I can (and do) reuse that code as the basis of other timeline analyses. The puzzle of plotting connected concepts on Wikipedia I described in Visualising Related Entries in Wikipedia Using Gephi? It’s a pattern I can keep on playing with.
If you think you might like to do some doodle of your own with some data, why not check out the School Of Data. Or watch out on OpenLearn for some follow up stories from the OU/BBC co-pro of Hans Rosling’s award winning Don’t Panic
Via my feeds, I noticed this the other day: Google is pushing a new content-recommendation system for publishers, in which VentureBeat quoted a Google originated email sent to them: “Our engineers are working on a content recommendation beta that will present users relevant internal articles on your site after they read a page. This is a great way to drive loyal users and more pageviews.” Hmm.. what’s taken them so long?
(FWIW, use contextual ad-servers to serve content has been one of those ideas that I keep coming round to but never really pursuing:for example, Contextual Content Server, Courtesy of Google?, Google Banner Ads as On-Campus Signage? or Contextual Content Delivery on Higher Ed Websites Using Ad Servers.)
Reflecting on this, I started thinking again about the uses to which we might be able to put adservers. It struck me that one way is actually to use them to serve… ads.
One of the things I’ve noticed about the Open Knowledge Foundation (disclaimer: I work one day a week for the Open Knowledge Foundation’s School of Data) is that it throws up a lot of websites. Digging out a couple of tricks from Pondering Mapping the Pearson Network, I spot at least these domains, for example:
An emergent social positioning map around the School)fData twitter account also identifies a wealth of OKF related projects and local chapters (bottom region of the map), many of which will also run their own web presence:
One of the issues associated with such a widely dispersed and loosely coupled networked organisation relates to the running of campaigns, and promoting strong single campaign issue messages out across the various websites. So I wonder: would an internal adserver work???
… you define web sites, and for each website you then define one or more zones. A zone is a representation of a place on the web pages where the adverts must appear. For every zone, Revive Adserver generates a small piece of HTML code, which must be placed in the site at the exact spot where the ads must be displayed. …
You must also create advertisers, campaigns and advertisements …
The final step is to link the right campaigns to the right zones. This determines which ads will be displayed where. You can combine this with various forms of ‘targeting’, which means you can adjust the advertising to specific situations.
So…each website in the OKF sprawl could include a local adserver zone and display OKF ads. Such ads might be campaign related, or announcements of upcoming dates and events likely to be relevant across the OKF network (for example, internationl open data days, or open data census days).
Other ad blocks/zones could be defined to serve content from particular ad channels or campaigns.
Ad/content could in part be editorially controlled from the centre – for example, a campaign manager might be responsible for choosing which ads are in the pool for a particular campaign or set of campaigns. Site owners might allocate different zones that they can sign up to different ad channels that only serve ad/content on a particular theme?
Members of local groups and project teams could submit ads to the adserver relating to their projects or group activities, with associated campaign codes and topics so that content can be suitably targeted by the platform. The adserver thus also becomes a(nother) possible communications channel across the network.
Having dusted off and reversioned my Twitter emergent social positioning (ESP) code, and in advance of starting to think about what sorts of analyses I might properly start running, here’s a look back at what I was doing before in terms of charting where particular Twitter accounts sat amongst the other accounts commonly followed by the target account’s followers.
No longer having a whitelisted Twitter API key means the sample sizes I’m running are smaller than they used to be, to maybe that’s a good thing becuase it means I’ll have to start working properly on the methodology…
Anyway, here’s a quick snapshot of where I think hyperlocal news bloggers @onthewight might be situated on Twitter…
The view aims to map out accounts that are followed by 10 or more people from a sample of about 200 or so followers of @onthewight. The network is layed out according to a force directed layout algorithm with a dash of aesthetic tweaking; nodes are coloured based on community grouping as identified using the Gephi modularity statistic. Which has it’s issues, but it’s a start. The nodes are sized in the first case according to PageRank.
The quick take home from this original sketchmap is that there are a bunch of key information providers in the middle group, local accounts on the left, and slebs on the right.
If we look more closely at the key information providers, they seem to make sense…
These folk are likely to be either competitors of @onthewight, or prospects who might be worth approaching for advertising on the basis that @onthewight’s followers also follow the target account. (Of course, you could argue that because they share followers, there’s no point also using @onthewight as a channel. Except that @onthewight also has a popular blog presence, which would be where any ads were placed. (The @onthewight Twitter feed is generally news announcements and live reporting.) A better case could probably be made by looking at the follower profiles of the prospects, along with the ESP maps for the prospects, to see how well the audiences match, what additional reach could be offered, etc etc.
A broad brush view over the island community is a bit more cluttered:
If we tweak the layout a little, rerun PageRank to resize the nodes (note this will no longer take into account contributions from the other communities) and tweak the layout, again using a force directed algorithm, we get a bit less of a mess, though the map is still hard to read. Arts to the top, perhaps, Cowes to the right?
Again, with a bit more data, or perhaps a bit more of a think about what sort of map would be useful (and hence, what sort of data to collect), this sort of map might become useful for B2B marketing marketing purposes on the Island. (I’m not really interested in, erm, the plebs such as myself… i.e. people rather than bizs or slebs; though a pleb interest/demographic/reach analysis would probably be the one that would be most useful to take to prospects?).
If we look at the celebrity common follows, again resized and re-layed out, we see what I guess is a typical spread (it’s some time since I looked at these – not sure what the base line is, though @stephenfry still seems to feature high up in the background radiation count).
For bigger companies with their own marketing $, I guess this sort of map is the sort of place to look for potential celebrity endorsements to reinforce a message (folk following these accounts are already aware of @onthewight because they follow @onthewight) as well as potentially widen reach. But I guess the endorsement as reinforcement is more valuable as a legitimising thing?
Just got to work out what to do next, now, and how to start tightening this up and making it useful rather than just of passing interest…
PS A related chart that could be plotted using Facebook data would be to grab down all the likes of the friends of a person of company on Facebook, though I’m not not sure how that would work if their account is a page as a opposed to a “person”? I’m not so hot on Facebook API/permissions etc, or what sort of information page owners can get about their audience? Also, I’m not sure about the extent to which I can get likes from folk who aren’t my friends or who haven’t granted me app permissions? I used to be able to grab lists of people from groups and trawl through their likes, but I’m not sure default Facebook permissions make that as easy pickings now compared to a year or two ago? (The advantage of Twitter is that the friend/follow data is open on most accounts…)
Towards the end of last week I attended a two day symposium on Statistics in Journalism Practice and Education at the University of Sheffield. The programme was mixed, with several reviews of data journalism is or could be, and the occasional consideration of what stats might go into a statistics curriculum for students, but it got me thinking again about the way that content gets created and shunted around the news world.
Take polls, for example. At one point a comment got me idly wondering about the percentage of news copy that is derived from polls or surveys, and how it might be possible to automate the counting of such things. (My default position in this case is usually to wonder what might be possible be with the Guardian open platform content API. But I also started to wonder about how we could map the fan out from independent or commissioned polls or surveys as they get reported in the news media, then maybe start to find their way into other reports and documents by virtue of having been reported in the news.
This sort of thing is a corollary to tracking the way in which news stories might make their way from the newswires and into the papers via a bit of cut-and-pasting, as Nick Davies wrote so damningly about several years ago now in Flat Earth News, his indictment of churnalism and all that goes with it; it also reminds me of this old, old piece of Yahoo Pipes pipework where I tried to support the discovery of Media Release Related News Stories by putting university press release feeds into the same timeline view as news stories about that university.
I don’t remember whether I also built a custom search engine at the time for searching over press releases and news sites for mentions of universities, but that was what came immediately to mind this time round.
So for starters, here’s a quick Google Custom Search Engine that searches over a variety of polling organisation and news media websites looking for polls and surveys – Churnalism Times (Polls & Surveys Edition).
Here’s part of the setup, showing the page URL patterns to be search over.
I added additional refinements to the tab that searches over the news organisations so only pull out pages where “poll” or “survey” is mentioned. Note that if these words are indexed in the chrome around the news story (eg in a banner or sidebar), then we can get a false positive hit on the page (i.e. pull back a page where an irrelevant story is mentioned because a poll is linked to in the sidebar).
From way back when when I took an interest in search more than I do now, I thought Google was trying to find ways of distinguishing content from furniture, but I’m not so sure any more…
Anyway, here’s an example of a search into polls and surveys published by some of the big pollsters:
And an example of results from the news orgs:
For what it’s worth I also put together a custom search engine for searching over press releases – Churnalism Times (PR wires edition):
The best way of using this is to just past in a quote, or part of a quote, from a news story, in double quotes, to see which PR notice it came from…
To make life easier, an old bookmarklet generator I produced way back when on an Arcadia fellowship at the Cambridge University Library, can be used to knock up a simple bookmarklet that will let you highlight a chunk of text and then search for it – get-selection bookmarklet generator.
Give it a sensible title; then this is the URL chunk you need to add:
Sigh.. I used to have so much fun…
PS it actually makes more sense to enclose the selected quote in quotes. Here’s a tweaked version of the bookmarklet code I grabbed from my installation of it in Chrome:
PPS I’ve started to add additional search domains to the PR search engine to include political speeches.
During tumultuous times there is often an individual, an intellectual talisman if you like, who watches events unfold and extracts the essence of what is happening into a text, which then provides a handbook for the oppressed. For the frustrated Paris-based artists battling with the Academy during the second half of the nineteenth century, Baudelaire was that individual, his essay, The Painter of Modern Life, the text.
… He claimed that ‘for the sketch of manners, the depiction of bourgeois life … [sic] there is a rapidity of movement which calls for an equal speed of execution from the artist’. …
… Baudelaire passionately believed that it was incumbent upon living artists to document their time, recognizing the unique position that a talented painter or sculptor finds him or herself in: ‘Few men are gifted with the capacity of seeing; there are fewer still who possess the power of expression …’ … He challenged artists to find in modern life ‘the eternal from the the transitory’. That, he thought, was the essential purpose of art – to capture the universal in the everyday, which was particular to their here and now: the present.
And the way to do that was by immersing oneself in the day-to-day of metropolitan living: watching, thinking, feeling and finally recording.
Will Gompertz, What Are You Looking At?, pp.28-29
Not content with selling off public services, is the government doing all it can to monetise us by means other than taxation by looking for ways of selling off aggregated data harvested from our interaction as users of public services?
For example, “Better information means better care” (door drop/junk mail flyer) goes the slogan that masks the notice that informs you of the right to opt out [how to opt out] of a system in which your care data may be sold on to commercial third parties, in a suitably anonymised form of course… (as per this, perhaps?).
The intention is presumably laudable – better health research? – but when you sell to one person you tend to sell to another… So when I saw this story – Data Broker Was Selling Lists Of Rape Victims, Alcoholics, and ‘Erectile Dysfunction Sufferers’ – I wondered whether care.data could end up going the same way?
Despite all the stories about the care.data release, I have no idea which bit of legislation covers it (thanks, reporters…not); so even if I could make sense of the legalese, I don’t actually know where to read what the legislation says the HSCIC (presumably) can do in relation to sale of care data, how much it can charge, any limits on what the data can be used for etc.
I did think there might be a clause or two in the Health and Social Care Act 2012, but if there is it didn’t jump out at me. (What am I supposed to do next? Ask a volunteer librarian? Ask my MP to help me find out which bit of law applies, and then how to interpret it, as well as game it a little to see how far the letter if not the spirit of the law could be pushed in commercially exploiting the data? Could the data make it as far as Experian, or Wonga, for example, and if so, how might it in principle be used there? Or how about in ad exchanges?)
A little more digging around the HSCIC Data flows transition model turned up some block diagrams showing how data used for commissioning could flow around, but I couldn’t find anything similar as far as sale of care.data to arbitrary third parties goes.
(That’s another reason to check the legislation – there may be a list of what sorts of company is allowed to access care.data for now, but the legislation may also use Henry VIII’th clauses or other schedule devices to define by what ministerial whim additional recipients or classes of recipient can be added to the list…)
What else? Over on the Open Knowledge Foundation blog (disclaimer: I work for the Open Knowledge Foundation’s School of Data for 1 day a week), I see a guest post from Scraperwiki’s Francis Irving/@frabcus about the UK Government Performance Platform (The best data opens itself on UK Gov’s Performance Platform). The platform reports the number of applications for tax discs over time, for example, or the claims for carer’s allowance. But these headline reports make me think: there is presumably much finer grained data below the level of these reports, presumably tied (for digital channel uptake of this services at least) to Government Gateway IDs. And to what extent is this aggregated personal data sellable? Is the release of this data any different in kind to the release of the other national statistics or personal information containing registers (such as the electoral roll) that the government publish either freely or commercially?
Time was when putting together a jigsaw of the bits and pieces of information you could find out about a person meant doing a big jigsaw with little pieces. Are we heading towards a smaller jigsaw with much bigger pieces – Google, Facebook, your mobile operator, your broadband provider, your supermarket, your government, your health service?
PS related, in the selling off stakes? Sale of mortgage style student loan book completed. Or this ill thought out (by me) post – Confused by Government Spending, Indirectly… – around government encouraging home owners to take out shared ownership deals with UK gov so it can sell that loan book off at a later date?