Archive for the ‘Anything you want’ Category
Via my feeds, I noticed this the other day: Google is pushing a new content-recommendation system for publishers, in which VentureBeat quoted a Google originated email sent to them: “Our engineers are working on a content recommendation beta that will present users relevant internal articles on your site after they read a page. This is a great way to drive loyal users and more pageviews.” Hmm.. what’s taken them so long?
(FWIW, use contextual ad-servers to serve content has been one of those ideas that I keep coming round to but never really pursuing:for example, Contextual Content Server, Courtesy of Google?, Google Banner Ads as On-Campus Signage? or Contextual Content Delivery on Higher Ed Websites Using Ad Servers.)
Reflecting on this, I started thinking again about the uses to which we might be able to put adservers. It struck me that one way is actually to use them to serve… ads.
One of the things I’ve noticed about the Open Knowledge Foundation (disclaimer: I work one day a week for the Open Knowledge Foundation’s School of Data) is that it throws up a lot of websites. Digging out a couple of tricks from Pondering Mapping the Pearson Network, I spot at least these domains, for example:
An emergent social positioning map around the School)fData twitter account also identifies a wealth of OKF related projects and local chapters (bottom region of the map), many of which will also run their own web presence:
One of the issues associated with such a widely dispersed and loosely coupled networked organisation relates to the running of campaigns, and promoting strong single campaign issue messages out across the various websites. So I wonder: would an internal adserver work???
… you define web sites, and for each website you then define one or more zones. A zone is a representation of a place on the web pages where the adverts must appear. For every zone, Revive Adserver generates a small piece of HTML code, which must be placed in the site at the exact spot where the ads must be displayed. …
You must also create advertisers, campaigns and advertisements …
The final step is to link the right campaigns to the right zones. This determines which ads will be displayed where. You can combine this with various forms of ‘targeting’, which means you can adjust the advertising to specific situations.
So…each website in the OKF sprawl could include a local adserver zone and display OKF ads. Such ads might be campaign related, or announcements of upcoming dates and events likely to be relevant across the OKF network (for example, internationl open data days, or open data census days).
Other ad blocks/zones could be defined to serve content from particular ad channels or campaigns.
Ad/content could in part be editorially controlled from the centre – for example, a campaign manager might be responsible for choosing which ads are in the pool for a particular campaign or set of campaigns. Site owners might allocate different zones that they can sign up to different ad channels that only serve ad/content on a particular theme?
Members of local groups and project teams could submit ads to the adserver relating to their projects or group activities, with associated campaign codes and topics so that content can be suitably targeted by the platform. The adserver thus also becomes a(nother) possible communications channel across the network.
Having dusted off and reversioned my Twitter emergent social positioning (ESP) code, and in advance of starting to think about what sorts of analyses I might properly start running, here’s a look back at what I was doing before in terms of charting where particular Twitter accounts sat amongst the other accounts commonly followed by the target account’s followers.
No longer having a whitelisted Twitter API key means the sample sizes I’m running are smaller than they used to be, to maybe that’s a good thing becuase it means I’ll have to start working properly on the methodology…
Anyway, here’s a quick snapshot of where I think hyperlocal news bloggers @onthewight might be situated on Twitter…
The view aims to map out accounts that are followed by 10 or more people from a sample of about 200 or so followers of @onthewight. The network is layed out according to a force directed layout algorithm with a dash of aesthetic tweaking; nodes are coloured based on community grouping as identified using the Gephi modularity statistic. Which has it’s issues, but it’s a start. The nodes are sized in the first case according to PageRank.
The quick take home from this original sketchmap is that there are a bunch of key information providers in the middle group, local accounts on the left, and slebs on the right.
If we look more closely at the key information providers, they seem to make sense…
These folk are likely to be either competitors of @onthewight, or prospects who might be worth approaching for advertising on the basis that @onthewight’s followers also follow the target account. (Of course, you could argue that because they share followers, there’s no point also using @onthewight as a channel. Except that @onthewight also has a popular blog presence, which would be where any ads were placed. (The @onthewight Twitter feed is generally news announcements and live reporting.) A better case could probably be made by looking at the follower profiles of the prospects, along with the ESP maps for the prospects, to see how well the audiences match, what additional reach could be offered, etc etc.
A broad brush view over the island community is a bit more cluttered:
If we tweak the layout a little, rerun PageRank to resize the nodes (note this will no longer take into account contributions from the other communities) and tweak the layout, again using a force directed algorithm, we get a bit less of a mess, though the map is still hard to read. Arts to the top, perhaps, Cowes to the right?
Again, with a bit more data, or perhaps a bit more of a think about what sort of map would be useful (and hence, what sort of data to collect), this sort of map might become useful for B2B marketing marketing purposes on the Island. (I’m not really interested in, erm, the plebs such as myself… i.e. people rather than bizs or slebs; though a pleb interest/demographic/reach analysis would probably be the one that would be most useful to take to prospects?).
If we look at the celebrity common follows, again resized and re-layed out, we see what I guess is a typical spread (it’s some time since I looked at these – not sure what the base line is, though @stephenfry still seems to feature high up in the background radiation count).
For bigger companies with their own marketing $, I guess this sort of map is the sort of place to look for potential celebrity endorsements to reinforce a message (folk following these accounts are already aware of @onthewight because they follow @onthewight) as well as potentially widen reach. But I guess the endorsement as reinforcement is more valuable as a legitimising thing?
Just got to work out what to do next, now, and how to start tightening this up and making it useful rather than just of passing interest…
PS A related chart that could be plotted using Facebook data would be to grab down all the likes of the friends of a person of company on Facebook, though I’m not not sure how that would work if their account is a page as a opposed to a “person”? I’m not so hot on Facebook API/permissions etc, or what sort of information page owners can get about their audience? Also, I’m not sure about the extent to which I can get likes from folk who aren’t my friends or who haven’t granted me app permissions? I used to be able to grab lists of people from groups and trawl through their likes, but I’m not sure default Facebook permissions make that as easy pickings now compared to a year or two ago? (The advantage of Twitter is that the friend/follow data is open on most accounts…)
Towards the end of last week I attended a two day symposium on Statistics in Journalism Practice and Education at the University of Sheffield. The programme was mixed, with several reviews of data journalism is or could be, and the occasional consideration of what stats might go into a statistics curriculum for students, but it got me thinking again about the way that content gets created and shunted around the news world.
Take polls, for example. At one point a comment got me idly wondering about the percentage of news copy that is derived from polls or surveys, and how it might be possible to automate the counting of such things. (My default position in this case is usually to wonder what might be possible be with the Guardian open platform content API. But I also started to wonder about how we could map the fan out from independent or commissioned polls or surveys as they get reported in the news media, then maybe start to find their way into other reports and documents by virtue of having been reported in the news.
This sort of thing is a corollary to tracking the way in which news stories might make their way from the newswires and into the papers via a bit of cut-and-pasting, as Nick Davies wrote so damningly about several years ago now in Flat Earth News, his indictment of churnalism and all that goes with it; it also reminds me of this old, old piece of Yahoo Pipes pipework where I tried to support the discovery of Media Release Related News Stories by putting university press release feeds into the same timeline view as news stories about that university.
I don’t remember whether I also built a custom search engine at the time for searching over press releases and news sites for mentions of universities, but that was what came immediately to mind this time round.
So for starters, here’s a quick Google Custom Search Engine that searches over a variety of polling organisation and news media websites looking for polls and surveys – Churnalism Times (Polls & Surveys Edition).
Here’s part of the setup, showing the page URL patterns to be search over.
I added additional refinements to the tab that searches over the news organisations so only pull out pages where “poll” or “survey” is mentioned. Note that if these words are indexed in the chrome around the news story (eg in a banner or sidebar), then we can get a false positive hit on the page (i.e. pull back a page where an irrelevant story is mentioned because a poll is linked to in the sidebar).
From way back when when I took an interest in search more than I do now, I thought Google was trying to find ways of distinguishing content from furniture, but I’m not so sure any more…
Anyway, here’s an example of a search into polls and surveys published by some of the big pollsters:
And an example of results from the news orgs:
For what it’s worth I also put together a custom search engine for searching over press releases – Churnalism Times (PR wires edition):
The best way of using this is to just past in a quote, or part of a quote, from a news story, in double quotes, to see which PR notice it came from…
To make life easier, an old bookmarklet generator I produced way back when on an Arcadia fellowship at the Cambridge University Library, can be used to knock up a simple bookmarklet that will let you highlight a chunk of text and then search for it – get-selection bookmarklet generator.
Give it a sensible title; then this is the URL chunk you need to add:
Sigh.. I used to have so much fun…
PS it actually makes more sense to enclose the selected quote in quotes. Here’s a tweaked version of the bookmarklet code I grabbed from my installation of it in Chrome:
PPS I’ve started to add additional search domains to the PR search engine to include political speeches.
During tumultuous times there is often an individual, an intellectual talisman if you like, who watches events unfold and extracts the essence of what is happening into a text, which then provides a handbook for the oppressed. For the frustrated Paris-based artists battling with the Academy during the second half of the nineteenth century, Baudelaire was that individual, his essay, The Painter of Modern Life, the text.
… He claimed that ‘for the sketch of manners, the depiction of bourgeois life … [sic] there is a rapidity of movement which calls for an equal speed of execution from the artist’. …
… Baudelaire passionately believed that it was incumbent upon living artists to document their time, recognizing the unique position that a talented painter or sculptor finds him or herself in: ‘Few men are gifted with the capacity of seeing; there are fewer still who possess the power of expression …’ … He challenged artists to find in modern life ‘the eternal from the the transitory’. That, he thought, was the essential purpose of art – to capture the universal in the everyday, which was particular to their here and now: the present.
And the way to do that was by immersing oneself in the day-to-day of metropolitan living: watching, thinking, feeling and finally recording.
Will Gompertz, What Are You Looking At?, pp.28-29
Not content with selling off public services, is the government doing all it can to monetise us by means other than taxation by looking for ways of selling off aggregated data harvested from our interaction as users of public services?
For example, “Better information means better care” (door drop/junk mail flyer) goes the slogan that masks the notice that informs you of the right to opt out [how to opt out] of a system in which your care data may be sold on to commercial third parties, in a suitably anonymised form of course… (as per this, perhaps?).
The intention is presumably laudable – better health research? – but when you sell to one person you tend to sell to another… So when I saw this story – Data Broker Was Selling Lists Of Rape Victims, Alcoholics, and ‘Erectile Dysfunction Sufferers’ – I wondered whether care.data could end up going the same way?
Despite all the stories about the care.data release, I have no idea which bit of legislation covers it (thanks, reporters…not); so even if I could make sense of the legalese, I don’t actually know where to read what the legislation says the HSCIC (presumably) can do in relation to sale of care data, how much it can charge, any limits on what the data can be used for etc.
I did think there might be a clause or two in the Health and Social Care Act 2012, but if there is it didn’t jump out at me. (What am I supposed to do next? Ask a volunteer librarian? Ask my MP to help me find out which bit of law applies, and then how to interpret it, as well as game it a little to see how far the letter if not the spirit of the law could be pushed in commercially exploiting the data? Could the data make it as far as Experian, or Wonga, for example, and if so, how might it in principle be used there? Or how about in ad exchanges?)
A little more digging around the HSCIC Data flows transition model turned up some block diagrams showing how data used for commissioning could flow around, but I couldn’t find anything similar as far as sale of care.data to arbitrary third parties goes.
(That’s another reason to check the legislation – there may be a list of what sorts of company is allowed to access care.data for now, but the legislation may also use Henry VIII’th clauses or other schedule devices to define by what ministerial whim additional recipients or classes of recipient can be added to the list…)
What else? Over on the Open Knowledge Foundation blog (disclaimer: I work for the Open Knowledge Foundation’s School of Data for 1 day a week), I see a guest post from Scraperwiki’s Francis Irving/@frabcus about the UK Government Performance Platform (The best data opens itself on UK Gov’s Performance Platform). The platform reports the number of applications for tax discs over time, for example, or the claims for carer’s allowance. But these headline reports make me think: there is presumably much finer grained data below the level of these reports, presumably tied (for digital channel uptake of this services at least) to Government Gateway IDs. And to what extent is this aggregated personal data sellable? Is the release of this data any different in kind to the release of the other national statistics or personal information containing registers (such as the electoral roll) that the government publish either freely or commercially?
Time was when putting together a jigsaw of the bits and pieces of information you could find out about a person meant doing a big jigsaw with little pieces. Are we heading towards a smaller jigsaw with much bigger pieces – Google, Facebook, your mobile operator, your broadband provider, your supermarket, your government, your health service?
PS related, in the selling off stakes? Sale of mortgage style student loan book completed. Or this ill thought out (by me) post – Confused by Government Spending, Indirectly… – around government encouraging home owners to take out shared ownership deals with UK gov so it can sell that loan book off at a later date?
Prompted by an email request, I’ve revisited the code I used to generate emergent social positioning maps in Twitter as an iPython notebook that reuses chunks of code from, as well as the virtual machine used to support, Matthew A. Russell’s Mining the Social Web (2nd edition) [code]).
As a reminder, the social positioning maps show folk commonly followed by the followers of a particular twitter user.
As far as [Duchamp] was concerned, the role in society of an artist was akin to that of a philosopher; it didn’t even matter if he or she could paint or draw. An artist’s job was not to give aesthetic pleasure – designers could do that; it was to step back from the world and attempt to make sense or comment on it through the presentation of ideas that had no functional purpose other than themselves.
– Will Gompertz, What Are You Looking At? p. 10
Every so often I do a round up of job openings in different areas. This is particular true around year end, as I look at my dwindling salary (no more increments, ever, and no hope of promotion, …) and my overall lack of direction, and try to come up with sort sort of resolution to play with during the first few weeks of the year.
The data journalism phrase has being around for some time now (was it really three and half years ago since Data Driven Journalism Round Table at the European Journalism Centre? (FFS, what have I been doing for the last three years?!:-( and it seems to be maturing a little. We’ve had the period of shiny, shiny web apps requiring multiskilled development teams and designers working with the hacks to produce often confusing and wtf am I supposed to be doing now?! interactives and things seem to be becoming a little more embedded… Perhaps…
My reading (as an outsider) is that there is now more of a move towards developing some sort of data skillbase that allows journalists to do “investigative” sorts of things with data, often using very small data sets or concise summary datasets. To complement this, there seems to be some sort of hope that visually appealing charts can be used to hook eyeballs into a story (rather than pushing eyeballs away) – Trinity Mirror’s Ampp3d (as led by Martin Belam) is a good example of this, as is the increasing(?) use of the DataWrapper library.
From working with the School of Data, as well as a couple of bits of data journalism not-really-training with some of the big news groups, I’ve come to realise there is probably some really basic, foundational work to be done in the way people think (or don’t think) about data. For example, I don’t believe that people in general read charts. I think they may glance at them, but they don’t try to read them. They have no idea what story they tell. Given a line chart that plots some figure over time. How many people ride the line to get a feel for how it really changed?
Hans Rosling famously brings data alive with his narrative commentary around animated development data charts, including bar charts…
But if you watch the video with the sound off, or just look at the final chart, do you have the feeling of being told the same story? Can you even retell yourself the story by looking at the chart. And how about if you look at another bar chart? Can you use any of Hans Rosling’s narrative or rhetorical tricks to help you read through those?
(The rhetoric of data (and the malevolent arts of persuasion) is something I want to ponder in more depth next year, along with the notion of data aesthetics and the theory of forms given a data twist.)
Another great example of narrated data storytelling comes from Kurt Vonnegut as he describes the shapes of stories:
Is that how you read a line chart when you see one?
One thing about the data narration technique is that it is based around the construction of a data trace. There is a sense of anticipation about where the line will go next, and uncertainty as to what sort of event will cause the line to move one way or another. Looking back at a completed data chart, what points do we pick from it that we want to use as events in our narration or reading of it? (The lines just connect the points – they are processional in the way they move us from one point of interest to the next, although the gradient of the line may provide us with ideas for embellishing or decorating the story a little.)
It’s important to make art because the people that get the most out of art are the ones that make it. It’s not … You know there’s this idea that you go to a wonderful art gallery and it’s good for you and it makes you a better person and it informs your soul, but actually the person who’s getting the most out of any artistic activity is the person who makes it because they’re sort of expressing themselves and enjoying it, and they’re in the zone and you know it’s a nice thing to do. [Grayson Perry, Reith Lectures 2013, Lecture 2, Q&A [transcript, PDF], audio]
In the same way, the person who gets the most out of a chart is the person who constructed it. They know what they left in and what they left out. They know why the axes are selected as they are, why elements are coloured or sized as they are. They know the question that led up to the chart and the answers it provides to those questions. They know where to look. Like an art critic who reads their way round a painting, they know how to read one or many different stories from the chart.
The interactives that appeared during the data journalism wave from a couple of years ago sought to provide a playground for people to play with data and tells their own stories with it. But they didn’t. In part because they didn’t know how to play with data, didn’t know how to use it in a constructive way as part of a narrative, (even a made up, playful narrative). And in part this comes back to not knowing how to read – that is, recover stories from – a chart.
It is often said that a picture saves a thousand words, but if the picture tells a thousand word story, how many of us try to read that thousand word story from each picture or chart? Maybe we need to use a thousand words as well as the chart? (How many words does Hans Rosling use? How many, Kurt Vonnegut?)
When producing a chart that essentially represents a summary of a conversation with have had with a dataset, it’s important to remember that for someone looking at the final chart it might not make as much sense in absence of the narrative that was used to construct it. Edward de Bono’s constructed illustrations helps read a the final image through recalling his narrative. But if we just look at a “completed” sketch from one of his talks, it will probably be meaningless.
One of the ideas that works for me when I reflect on my own playing with data is that it is a conversation. Meaning is constructed through the conversation I have with a dataset, and the things it reveals when I pose particular questions to it. In many cases, these questions are based on filtering a dataset, although the result may be displayed in many ways. The answers I get to a question inform the next question I want to ask. Questions take the form of constructing this chart as opposed to that chart, though I am free to ask the same question in many slightly different ways if the answers don’t appear to be revealing of anything.
It is in this direction – of seeing data as a source that can be interviewed and coaxed into telling stories – that I sense elements of the data journalism thang are developing. This leads naturally into seeing data journalism skills as core investigative style skills that all journalists would benefit from. (Seeing things as data allows you to ask particular sorts of question in very particular ways. Being able to cast things into a data form – as for example in Creating Data from Text – Regular Expressions in OpenRefine) – so that they become amenable to data-style queries, is the next idea I think we need to get across…
So what are the jobs that are out at the moment? Here’s a quick round-up of some that I’ve spotted…
- Data editor (Guardian): “develop and implement a clear strategy for the Data team and the use of data, numbers and statistics to generate news stories, analysis pieces, blogs and fact checks for The Guardian and The Observer.
You will take responsibility for commissioning and editing content for the Guardian and Observer data blogs as well as managing the research needs of the graphics department and home and foreign news desks. With day-to-day managerial responsibility for a team of three reporters / researchers working on the data blog, you will also be responsible for data analysis and visualisation: using a variety of specialist software and online tools, including Tableau, ARCGis, Google Fusion, Microsoft Access and Excel”
Perpetuating the “recent trad take” on data journalism, viewed as gonzo journalist hacker:
- Data Journalist [Telegraph Media Group]: “[S]ource, sift and surface data to find and generate stories, assist with storytelling and to support interactive team in delivering data projects.
“The Data Journalist will sit within the Interactive Data team, and will work with a team of designers, web developers and journalists on data-led stories and in developing innovative interactive infographics, visualisations and news applications. They will need to think and work fast to tackle on-going news stories and tight deadlines.
- One of the most exciting opportunities that I can see around data related published is in new workflows and minimising the gap between investigative tools and published outputs. This seems to me a bit risky in that it seems so conservative when it comes to getting data outputs actually published?
Designer [Trinity Mirror]: “Trinity Mirror’s pioneering data unit is looking for a first-class designer to work across online and print titles. … You will be a whizz with design software – such as Illustrator, Photoshop and InDesign – and understand the principles of designing infographics, charts and interactives for the web. You will also be able to design simple graphical templates for re-purposing around the group.
“You should have a keen interest in current affairs and sport, and be familiar with – and committed to – the role data journalism can play in a modern newsroom.”
- [Trinity Mirror]: Can you take an API feed and turn it into a compelling gadget which will get the whole country talking?
“Trinity Mirror’s pioneering data unit is looking for a coder/developer to help it take the next step in helping shape the future of journalism. …
“You will be able to create tools which automatically grab the latest data and use them to create interactive, dynamically-updated charts, maps and gadgets across a huge range of subjects – everything from crime to football. …
“The successful candidate will have a thorough knowledge of scraping techniques, ability to manage a database using SQL, and demonstrable ability in at least one programming language.”
But there is hope about the embedding of data skills as part of everyday journalistic practice:
- Culture report (Guardian): “We are looking for a Culture Reporter to generate stories and cover breaking news relating to Popular Culture, Film and Music … Applicants should also have expertise with digital tools including blogging, social media, data journalism and mobile publishing. “
- Investigations Correspondent [BBC Newsnight]: “Reporting to the Editor, the Investigations Correspondent will contribute to Newsnight by producing long term investigations as well as sometimes contributing to big ongoing stories. Some investigations will take months, but there will also be times when we’ll need to dig up new lines on moving the stories in days.
“We want a first rate reporter with a proven track record of breaking big stories who can comfortably work across all subject areas from politics to sport. You will be an established investigative journalist with a wide range of contacts and sources as well as having experience with a range of different investigative approaches including data journalism, Freedom Of Information (FOI) and undercover reporting.”
- News Reporter, GP [Haymarket Medical Media]: “GP is part of Haymarket Medical Media, which also produces MIMS, Medeconomics, Inside Commissioning, and mycme.com, and delivers a wide range of medical education projects. …
“Ideally you will also have some experience of data journalism, understand how social media can be used to enhance news coverage and have some knowledge of multimedia journalism, including video and blogs.”
- Reporter, ENDS Report [Haymarket]: “We are looking for someone who has excellent reporting and writing skills, is enthusiastic and able to digest and summarise in depth documents and analysis. You will also need to be comfortable with dealing with numbers and statistics and prepared to sift through data to find the story that no one else spots.
“Ideally you will have some experience of data journalism, understand how social media can be used to enhance news coverage.”
Are there any other current ones I’m missing?
I think the biggest shift we need is to get folk treating data as a source that responds to a particular style of questioning. Learning how to make the source comfortable and get it into a state where you can start to ask it questions is one key skill. Knowing how to frame questions so that discover the answers you need for a story are another. Choosing which bits of the conversation you use in a report (if any – maybe the conversation is akin to a background chat?) yet another.
Treating data as a source also helps us think about how we need to take care with it – how not to ask leading questions, how not to get it to say things it doesn’t mean. (On the other hand, some folk will undoubtedly force the data to say things it never intended to say…
“If you torture the data enough, nature will always confess” [Ronald Coase]
[Disclaimer: I started looking at some medical data for Haymarket.]
A nice observation, pointed out by Kay Bromley in a department meeting earlier this week whilst reporting on the OU’s changing model for student support: if we introduce learning analytics that trigger particular interventions, (for example, email prompts), we should expect a certain percentage of those interventions to result in additional calls for support…
Which is to say, a consequence of analytics driven interventions may be a need to provide additional levels of support.
Another factor to be taken in to account is the extent to which Associate Lecturers (that is, personal module tutors) need to be informed when an intervention occurs, because there is a good chance that the AL will be the person a student contacts following an automated intervention or alert…
A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).
For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:
Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.
import os,re import urllib2 #New OU course will start using pandas, so I need to start getting familiar with it. #In this case it's overkill, because all I'm using it for is to load in a CSV file... import pandas as pd #url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf' #Load in the data scraped from Shell df= pd.read_csv('shell_30_11_13_ng.csv') errors= #For each line item: for url in df[df.columns]: try: print 'trying',url u = urllib2.urlopen(url) fn=url.split('/')[-1] #Grab a local copy of the downloaded picture containing PDF localFile = open(fn, 'w') localFile.write(u.read()) localFile.close() except: print 'error with',url errors.append(url) continue #If we look at the filenames/urls, the filenames tend to start with the JIV id #...so we can try to extract this and use it as a key id=re.split(r'[_-]',fn) #I'm going to move the PDFs and the associated images stripped from them in separate folders fo='data/'+id os.system(' '.join(['mkdir',fo])) idp='/'.join([fo,id]) #Try to cope with crappy filenames containing punctuation chars fn= re.sub(r'([()&])', r'\\\1', fn) #THIS IS THE LINE THAT PULLS OUT THE IMAGES #Available via poppler-utils #See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/ #Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo ]) os.system(cmd) #Still a couple of errors on filenames #just as quick to catch by hand/inspection of files that don't get moved properly print 'Errors',errors
Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng
The important line of code in the above is:
pdfimages -j FILENAME OUTPUT_STUB
FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)
pdfimages can be downloaded as part of poppler (I think?!)
See also this Stack Exchange question/answer: Extracting images from a PDF
PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.
http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:
- generate post containing images and body text made up from data in the associated line from the CSV file.
|Date Reported||Incident Site||JIV Date||Terrain||Cause||Estimated Spill Volume (bbl)||Clean-up Status||Comments||Photo|
|02-Jan-13||10″ Diebu Creek – Nun River Pipeline at Onyoma||05-Jan-13||Swamp||Sabotage/Theft||65||Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.||Site Certification was completed on 28th June 2013.||http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf|
So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).
A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).
We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).
There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:
Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.
If those phrases are common/templated refrains, we can parse the corresponding dates out?
I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?