A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).
For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:
Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.
import os,re import urllib2 #New OU course will start using pandas, so I need to start getting familiar with it. #In this case it's overkill, because all I'm using it for is to load in a CSV file... import pandas as pd #url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf' #Load in the data scraped from Shell df= pd.read_csv('shell_30_11_13_ng.csv') errors= #For each line item: for url in df[df.columns]: try: print 'trying',url u = urllib2.urlopen(url) fn=url.split('/')[-1] #Grab a local copy of the downloaded picture containing PDF localFile = open(fn, 'w') localFile.write(u.read()) localFile.close() except: print 'error with',url errors.append(url) continue #If we look at the filenames/urls, the filenames tend to start with the JIV id #...so we can try to extract this and use it as a key id=re.split(r'[_-]',fn) #I'm going to move the PDFs and the associated images stripped from them in separate folders fo='data/'+id os.system(' '.join(['mkdir',fo])) idp='/'.join([fo,id]) #Try to cope with crappy filenames containing punctuation chars fn= re.sub(r'([()&])', r'\\\1', fn) #THIS IS THE LINE THAT PULLS OUT THE IMAGES #Available via poppler-utils #See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/ #Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo ]) os.system(cmd) #Still a couple of errors on filenames #just as quick to catch by hand/inspection of files that don't get moved properly print 'Errors',errors
Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng
The important line of code in the above is:
pdfimages -j FILENAME OUTPUT_STUB
FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)
pdfimages can be downloaded as part of poppler (I think?!)
See also this Stack Exchange question/answer: Extracting images from a PDF
PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.
http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:
- generate post containing images and body text made up from data in the associated line from the CSV file.
|Date Reported||Incident Site||JIV Date||Terrain||Cause||Estimated Spill Volume (bbl)||Clean-up Status||Comments||Photo|
|02-Jan-13||10″ Diebu Creek – Nun River Pipeline at Onyoma||05-Jan-13||Swamp||Sabotage/Theft||65||Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.||Site Certification was completed on 28th June 2013.||http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf|
So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).
A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).
We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).
There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:
Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.
If those phrases are common/templated refrains, we can parse the corresponding dates out?
I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?
Via @wilm, I notice that it’s time again for someone (this time at the Wall Street Journal) to have written about the scariness that is your Google personal web history (the sort of thing you probably have to opt out of if you sign up for a new Google account, if other recent opt-in by defaults are to go by…)
It may not sound like much, but if you do have a Google account, and your web history collection is not disabled, you may find your emotional response to seeing months of years of your web/search history archived in one place surprising… Your Google web history.
Not mentioned in the WSJ article was some of the games that the Chrome browser gets up. @tim_hunt tipped me off to a nice (if technically detailed, in places) review by Ilya Grigorik of some the design features of the Chrome browser, and some of the tools built in to it: High Performance Networking in Chrome. I’ve got various pre-fetching tools switched off in my version of Chrome (tools that allow Chrome to pre-emptively look up web addresses and even download pages pre-emptively*) so those tools didn’t work for me… but looking at chrome://predictors/ was interesting to see what keystrokes I type are good predictors of web pages I visit…
* By the by, I started to wonder whether webstats get messed up to any significant effect by Chrome pre-emptively prefetching pages that folk never actually look at…?
In further relation to the tracking of traffic we generate from our browsing habits, as we access more and more web/internet services through satellite TV boxes, smart TVs, and catchup TV boxes such as Roku or NowTV, have you ever wondered about how that activity is tracked? LG Smart TVs logging USB filenames and viewing info to LG servers describes not only how LG TVs appear to log the things you do view, but also the personal media you might view, and in principle can phone that information home (because the home for your data is a database run by whatever service you happen to be using – your data is midata is their data).
there is an option in the system settings called “Collection of watching info:” which is set ON by default. This setting requires the user to scroll down to see it and, unlike most other settings, contains no “balloon help” to describe what it does.
At this point, I decided to do some traffic analysis to see what was being sent. It turns out that viewing information appears to be being sent regardless of whether this option is set to On or Off.
you can clearly see that a unique device ID is transmitted, along with the Channel name … and a unique device ID.
This information appears to be sent back unencrypted and in the clear to LG every time you change channel, even if you have gone to the trouble of changing the setting above to switch collection of viewing information off.
It was at this point, I made an even more disturbing find within the packet data dumps. I noticed filenames were being posted to LG’s servers and that these filenames were ones stored on my external USB hard drive.
Hmmm… maybe it’s time I switched out my BT homehub for a proper hardware firewalled router with a good set of logging tools…?
PS FWIW, I can’t really get my head round how evil on the one hand, or damp squib on the other, the whole midata thing is turning out to be in the short term, and what sorts of involvement – and data – the partners have with the project. I did notice that a midata innovation lab report has just become available, though to you and me it’ll cost 1500 squidlly diddlies so I haven’t read it: The midata Innovation Opportunity. Note to self: has anyone got any good stories to say about TSB supporting innovation in micro-businesses…?
PPS And finally, something else from the Ilya Grigorik article:
The HTTP Archive project tracks how the web is built, and it can help us answer this question. Instead of crawling the web for the content, it periodically crawls the most popular sites to record and aggregate analytics on the number of used resources, content types, headers, and other metadata for each individual destination. The stats, as of January 2013, may surprise you. An average page, amongst the top 300,000 destinations on the web is:
- 1280 KB in size
- composed of 88 resources
- connects to 15+ distinct hosts
Is it any wonder that pages take so long to load on a mobile phone off the 3G netwrok, and that you can soon eat up your monthly bandwidth allowance!
A picture may be worth a thousand words, but whilst many of us may get a pre-attentive gut reaction reading from a data set visualised using a chart type we’re familiar with, how many of us actually take the time to read a chart thoroughly and maybe verbalise, even if only to ourselves, what the marks on the chart mean, and how they relate to each other? (See How fertility rates affect population for an example of how to read a particular sort of chart.)
An idea that I’m finding increasingly attractive is the notion of text visualisation (or text visualization for the US-English imperialistic searchbots). That is, the generation of mechanical text from data tables so we can read words that describe the numbers – and how they relate – rather than looking at pictures of them or trying to make sense of the table itself.
Here’s a quick example of the sort of thing I mean – the generation of this piece of text:
The total number of people claiming Job Seeker’s Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.
from a data table that can be sliced like this:
In the same way that we make narrative decisions when it comes to choosing what to put into a data visualisation, as well as how to read it (and how the various elements displayed in it relate to each other), so we make choices about the textual, or narrative, mapping from the data set to the text version (that is, the data textualisation) of it. When we present a chart or data table to a reader, we can try to influence their reading of it in variety of ways: by choosing the sort of order of bars on a bar chart, or rows in table, for example; or by highlighting one or more elements in a chart or table through the use of colour, font, transparency, and so on.
The actual reading of the chart or table is still largely under the control of the reader, however, and may be thought of as non-linear in the sense that the author of the chart or table can’t really control the order in which the various attributes of the table or chart, or relationships between the various elements, are encountered by the reader. In a linear text, however, the author retains a far more significant degree of control over the exposition, and the way it is presented to the reader.
There is thus a considerable amount of editorial judgement put into the mapping from a data table to text interpretations of the data contained within a particular row, or down a column, or from some combination thereof. The selection of the data points and how the relationships between them are expressed in the sentences formed around them directs attention in terms of how to read the data in a very literal way.
There may also be a certain amount of algorithmic analysis used along the way as sentences are constructed from looking at the relationships between different data elements; (“up 94″ is a representation (both in sense of rep-resentation and re-presentation) of a month on month change of +94, “down 377″ generated mechanically from a year on year comparison).
Every cell in a table may be a fact that can be reported, but there are many more stories to be told by comparing how different data elements in a table stand in relation to each other.
The area of geekery related to this style of computing is known as NLG – natural language generation – but I’ve not found any useful code libraries (in R or Python, preferably…) for messing around with it. (The JSA example above was generated using R as a proof of concept around generating monthly press releases from ONS/nomis job-figures.
PS why “data textualisation“, when we can consider even graphical devices as “texts” to be read? I considered “data characterisation” in the sense of turning data in characters, but characterisation is more general a term. Data narration was another possibility, but those crazy Americans patenting everything that moves might think I was “stealing” ideas from Narrative Science. Narrative Science (as well as Data2Text and Automated Insights etc. (who else should I mention?)) are certainly interesting but I have no idea how any of them do what they do. And in terms of narrating data stories – I think that’s a higher level process than the mechanical textualisation I want to start with. Which is not to say I don’t also have a few ideas about how to weave a bit of analysis into the textualisation process too…
Picking up on Political Representation on BBC Political Q&A Programmes – Stub , the quickest of hacks…
In OpenRefine, create a new project by importing data from a couple of URLs – data from the BBC detailing episode IDs for Any Questions and Question Time:
Import the data as XML, highlighting a single programme code row as the import element.
The data we get looks like this – /programmes/b007ck8s#programme – so we can add a column by URL around 'http://www.bbc.co.uk'+value.split('#')+'.json' to get JSON data back for each column.
Parse the JSON that comes back using something like value.parseJson()['programme']['medium_synopsis'] to create a new column containing the medium synopsis information.
The medium synopsis elements typically look like Topical debate from Colchester, with David Dimbleby. On the panel are Peter Hain, Sir Menzies Campbell, Francis Maude, singer Beverley Knight and journalist Cristina Odone. Which is to say they often contain the names of the panellists.
We can try to extract the names contained within each synopsis using the Zemanta API (key required) accessed via the Named-Entity Recognition extension for Google Refine / OpenRefine.
These seem to come back in reconciliation API form with the name set to a name and the id to a URL. We can get a concatenated list of the URLs that are returned by creating a column around something like this: forEach(cell.recon.candidates,v,v.id).sort().join('||') but I’m not sure that’s useful.
We can creata a column based just around the matched ID using cell.recon.match.name.
Let’s use the row view and fill down on programme IDs, then have a look at a duplicate facet and view only rows that are duplicated (that is, where an extracted named entity appears more than once). We can also use a text facet to see which names appear in multiple episodes of Question Time and/or Any Questions.
Selecting a single name allows us to see the programmes that person appeared on. If we pull out the time of first broadcast (value.parseJson()['programme']['first_broadcast_date']) and Edit Cells-Common Transforms-To date, we can also use a date facet to select out programmes first broadcast within a particular date range.
We can also run a text filter to limit records to episodes including a particular person and then use the Date facet to highlight the episodes in which they appeared on the timeline:
What this suggests is that we can use OpenRefine as some sort of ‘application shell’ for creating information tools around a particular dataset without actually having to build UI components ourselves?
If we custom export a table using programme IDs and matched names, and then rename the columns Source and Target, we can visualise them in something like Gephi (you can use the recipe described in the second part of this walkthrough: An Introduction to Mapping Company Networks Using Gephi and OpenCorporates, via OpenRefine).
The directed graph we load into Gephi connects entities (participant names, location names) with programme IDs. There is handy tool – Multimode Networks Projection – that can collapse the graph so that entities are connected to other entities that they shared a programme ID with.
(If you forget to remove the programme nodes, a degree range filter to select only nodes with degree greater than 2 tidies the graph up.)
If we run PageRank on the graph (now as an undirected graph), layout using ForceAtlas2 and size nodes according to PageRank, we can look into the heart of the UK political establishment as evidenced by appearances on Question Time and Any Questions.
The next step would probably be to try to pull info about each recognised entity from dbPedia (forEach(cell.recon.candidates,v,v.id).sort() seems to pull out dbpedia URIs) but grabbing data from dbPedia seems to be borked in my version of OpenRefine atm:-(
Anyway – a quick hack that took longer to write up than it did to do…
OpenRefine project file here.
It’s too nice a day to be inside hacking around with Parliament data as a remote participant in today’s Parliamentary hack weekend (resource list), but if it had been a wet weekend I may have toyed with one of the following:
- revisiting this cleaning script for Analysing UK Lobbying Data Using OpenRefine (actually, a look at who finds/offers support for All Party Groups. The idea was to get a dataset of people who provide secretariat and funds to APPGs, as well as who works for them, and then do something with that dataset…)
- tinkering with data from Question Time and Any Questions…
On that last one:
These gives us generatable URLs for programmes by month with URLs of form http://www.bbc.co.uk/programmes/b006t1q9/broadcasts/2013/01 but how do we get a JSON version of that?! Adding .json on the end doesn’t work?!:-( UPDATE – this could be a start, via @nevali – use pattern /programmes/PID.rdf , such as http://www.bbc.co.uk/programmes/b006qgvj.rdf
We can get bits of information (albeit in semi-structured from) about panellists in data form from programme URL hacks like this: http://www.bbc.co.uk/programmes/b007m3c1.json
Note that some older programmes don’t list all the panelists in the data? So a visit to WIkipedia – http://en.wikipedia.org/wiki/List_of_Question_Time_episodes#2007 – may be in order for Question Time (there isn’t a similar page for Any Questions?)
Given panellists (the BBC could be more helpful here in the way it structures its data…), see if we can identify parliamentarians (MP suffix? Lord/Lady title?) and look them up using the new-to-me, not-yet-played-with-it UK Parliament – Members’ Names Data Platform API. Not sure if reconciliation works on parliamentarian lookup (indeed, not sure if there is a reconciliation service anywhere for looking up MPs, members of the House of Lords, etc?)
From Members’ Names API, we can get things like gender, constituency, whether or not they were holding a (shadow) cabinet post, maybe whether they were on a particular committee at the time etc. From programme pages, we may be able to get the location of the programme recording. So this opens up possibility of mapping geo-coverage of Question Time/Any Questions, both in terms of where the programmes visit as well as which constituencies are represented on them.
If we were feeling playful, we could also have a stab at which APPGs have representation on those programmes!
It also suggests a simpler hack – of just providing a briefing around the representatives appearing on a particular episode in terms of their current (or at the time) parliamentary status (committees, cabinet positions, APPGs etc etc)?
I’ve been doodling around local spending data again recently, noticing as ever that one of the obvious things to do is pull out payments to a particular company (notwithstanding the very many issues associated with actually identifying a particular company or entities within the same corporate group), and I started wondering about certain classes of public payment that may or may not get classed as spend but that do get spent with particular companies.
One example might be winter fuel payments. I don’t know if these are granted in such a way that they have to be used to cover energy bills (for example, by virtue of being delivered in the form of vouchers that can be redeemed against energy bills), or whether the money is just cash that the recipient can choose to spend howsoever; but if they are so restricted in terms of how they can be used they represent a way for government to make a payment to an energy company using a level of indirection that means we can’t at first glance see how government makes that payment to the energy company. The “choice” of who receives the payment is up to the consumer, presumably, but it seems to me to be that it is the government that is essentially making the payment to the energy company as a subsidy for a particular class of customer (as defined by criteria for determining winter fuel payment eligibility).
By not regulating profits made by energy companies more harshly, government presumably supports pricing that requires government to subsidise a significant number of customers. By not regulating prices more harshly, government seems keen to keep giving the energy companies a bung by proxy? I guess the rationale for making the payments this way is that the government can argue that it is acting progressively. If government just gave the energy companies a bung directly, people would get upset: either that the energy companies were being given a chunk of cash for free, or that they were being given a chunk of cash to subsidise the prices they set which would mean that people who could afford the higher price were also benefiting from the deal. How would we feel if, rather than government giving winter payments to those eligible, it just gave the cash straight to the utilities in a transparent way we could track, and required them to identify eligible customers and give them a reduced tariff? Of course, if the winter fuel payments are actually hypothecated, doing it this way would mean that folk currently in receipt of the payments wouldn’t be able to use the money in other ways?
Another area of “spend” that confuses me is the new “Share to buy” home equity loan scheme, in which the government “provides an equity loan (also known as shared equity) of up to 20% of the value of the home you are buying. … the buyer needs only a 5% deposit, and a 75% mortgage to make up the rest.”
The Help to Buy Equity Loan is interest-free for 5 years. After that, the purchaser pays an annual fee of 1.75% on the amount of the outstanding loan. The fee will increase each year by inflation (Retail Price Index (RPI) + 1%.
The purchaser can start repaying the equity loan after they’ve owned the home for a year, but they’ll need to be able to pay a minimum of 10% of the property value at the time of repayment.
When they want to sell their home, they’ll need to repay the percentage equity loan that is still outstanding. So, for example, if they originally bought 80% of the property and they hadn’t repaid any of the equity loan, their repayment on selling would be 20% of the market value at the time when they sell.
One reading of this might be that folk spend as much as they can afford on a house (and maybe even a little bit more), now some of them may be tempted to spend that much and up to 20% more…? That is, might they see the deal as if they were getting a 20% discount on the house (conveniently forgetting that interest payments will kick in in a 5 years?) allowing them to offer more and hence inflate prices more?
What I also wonder about this is: is this the government trying to kickstart a more fluid market in shared ownership on the equity side? I’m guessing that at some point the plan is for the government to flog off the loan book (and presumably then allow interest payments on the loans to float a little more…)? But might there also be an intention to allow individual investors to buy the title to individual equity loans? So rather than investing in a buy to let, individuals would be encouraged to invest in shared ownership schemes from the equity, rather than resident partial owner, side, as an investment?
PS I don’t know about regulatory capture, but policy capture seems like even more of a win for the utilities?! Gas industry employee seconded to draft UK’s energy policy
All I am nowadays is confused… about everything. Take MOOCs (What Are MOOCs (Good For)? I Don’t Really Know…) – folk seem to think that something (I don’t know what) about MOOCs makes sense, but I don’t understand what it is they think is interesting or what it is they think is happening.
In the same way that I never did understand what folk were talking about when OERs (that is, open educational resources) were all the rage in ed tech circles, I really have no idea what they think they’re talking about now MOOC is the de rigeur topic of conversation.
(See for example Bits and Pieces Around OERs… or OERs: Public Service Education and Open Production. I also note that folk tend not to appreciate the value of linking. Or maybe I misunderstand it. Whatever.)
From the scraps of stats that are making it out of odds and sods of some of the online platforms (data is not generally available; data will pay the bills when the marketing spend gets cut back and until the MOOC platform providers start making money from selling analytics and course platform/VLE “solutions” to institutions or eking out affiliate and referral fees from recruiters) it’s hard to know whose taking the courses and why, and even whether the different platforms are appealing to the same markets.
My gut feeling in the absence of a proper review is that folk taking courses from the US MOOCx providers are as likely to have a degree as not (eg Participation And performance In 8.02x Electricity And Magnetism: The First Physics MOOC From MITx; I have no idea what the demographics of learners signing up for Futurelearn courses are (Futurelearn has far more of a “casual learner”/hobbiest learner (one might even say, “edutainment”…) vibe about it, though it also seems as if it could be positioned quite well as a taster site).
So here are a few of the things I particularly don’t get:
- if advanced courses are attractive to graduates, does that mean there is a gap in the market for courses for graduates? I’ve largely given up trying to convince anyone that universities should do what the banks used to do and treat the first degree as an opportunity to recruit someone for life as part of a lifelong learning package. The professional institutions have traditionally filled this role in the professions, but it’s hard to know how their membership figures are doing? Could/should the universities be signing up their recent graduates to a lifelong learning top-up package, potentially made up from MOOCs provided by their alma mater?
- if graduates like taking courses, why is the OU so keen on a) making it difficult for folk to take individual were-called-courses-are-now-called-modules? b) pricing individual courses out of the leisure-learner or professional-occasional-top-up market? c) insisting on competing with other universities on their terms rather than breaking open new markets for higher education and widening access to it? (Arguably, FutureLearn is a play at widening access.)
- if MOOCs are going to be important as part of a taster style marketing funnel, how would it be if FutureLearn MOOCs were eligible as an additional/alternative courses in the International Baccalaureate (have any MOOC platforms benefitted from PR around such an end-use yet? There are possibly also potential tie-ups there around the provision of invigilated assessment centres?); or received some amoutn of CAT point credit equivalent that counted towards university applications? Again, something I don’t really understand is why the OU has given up on the Young Applicants in Schools scheme at just the time when it’s starting to compete for 18 year old entry?
As I said, I’m increasingly confused, increasingly don’t understand what’s going on, increasingly don’t see whatever the hell it is that everybody else seems to see as emerging from the latest eduhype.
What’s education good for anyway, when we have the web to hand. Does the web change anything, or nothing? Why did we need universities when we had libraries – and university libraries – with books in them? Why does everybody need a degree? If graduates are the only people who make it to the end of an ‘advanced’ (rather than ‘course taster’) MOOC, what the hell are the universities doing? Why do folk who have become graduates need to take courses when we’ve got the web lying around? What is going on? I just don’t understand…