Archive for the ‘Anything you want’ Category
A quick recipe for extracting images from PDFs…
For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:
Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.
import os,re import urllib2 #New OU course will start using pandas, so I need to start getting familiar with it. #In this case it's overkill, because all I'm using it for is to load in a CSV file... import pandas as pd #url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf' #Load in the data scraped from Shell df= pd.read_csv('shell_30_11_13_ng.csv') errors= #For each line item: for url in df[df.columns]: try: print 'trying',url u = urllib2.urlopen(url) fn=url.split('/')[-1] #Grab a local copy of the downloaded picture containing PDF localFile = open(fn, 'w') localFile.write(u.read()) localFile.close() except: print 'error with',url errors.append(url) continue #If we look at the filenames/urls, the filenames tend to start with the JIV id #...so we can try to extract this and use it as a key id=re.split(r'[_-]',fn) #I'm going to move the PDFs and the associated images stripped from them in separate folders fo='data/'+id os.system(' '.join(['mkdir',fo])) idp='/'.join([fo,id]) #Try to cope with crappy filenames containing punctuation chars fn= re.sub(r'([()&])', r'\\\1', fn) #THIS IS THE LINE THAT PULLS OUT THE IMAGES #Available via poppler-utils #See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/ #Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo ]) os.system(cmd) #Still a couple of errors on filenames #just as quick to catch by hand/inspection of files that don't get moved properly print 'Errors',errors
Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng
The important line of code in the above is:
pdfimages -j FILENAME OUTPUT_STUB
FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)
pdfimages can be downloaded as part of poppler (I think?!)
See also this Stack Exchange question/answer: Extracting images from a PDF
PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.
http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:
- generate post containing images and body text made up from data in the associated line from the CSV file.
|Date Reported||Incident Site||JIV Date||Terrain||Cause||Estimated Spill Volume (bbl)||Clean-up Status||Comments||Photo|
|02-Jan-13||10″ Diebu Creek – Nun River Pipeline at Onyoma||05-Jan-13||Swamp||Sabotage/Theft||65||Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.||Site Certification was completed on 28th June 2013.||http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf|
So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).
A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).
We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).
There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:
Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.
If those phrases are common/templated refrains, we can parse the corresponding dates out?
I should probably also try to pull out the caption text from the image PDF and associate it with a given image? This would be useful for any generated blog post too?
Picking up on Political Representation on BBC Political Q&A Programmes – Stub , the quickest of hacks…
In OpenRefine, create a new project by importing data from a couple of URLs – data from the BBC detailing episode IDs for Any Questions and Question Time:
Import the data as XML, highlighting a single programme code row as the import element.
The data we get looks like this – /programmes/b007ck8s#programme – so we can add a column by URL around 'http://www.bbc.co.uk'+value.split('#')+'.json' to get JSON data back for each column.
Parse the JSON that comes back using something like value.parseJson()['programme']['medium_synopsis'] to create a new column containing the medium synopsis information.
The medium synopsis elements typically look like Topical debate from Colchester, with David Dimbleby. On the panel are Peter Hain, Sir Menzies Campbell, Francis Maude, singer Beverley Knight and journalist Cristina Odone. Which is to say they often contain the names of the panellists.
We can try to extract the names contained within each synopsis using the Zemanta API (key required) accessed via the Named-Entity Recognition extension for Google Refine / OpenRefine.
These seem to come back in reconciliation API form with the name set to a name and the id to a URL. We can get a concatenated list of the URLs that are returned by creating a column around something like this: forEach(cell.recon.candidates,v,v.id).sort().join('||') but I’m not sure that’s useful.
We can creata a column based just around the matched ID using cell.recon.match.name.
Let’s use the row view and fill down on programme IDs, then have a look at a duplicate facet and view only rows that are duplicated (that is, where an extracted named entity appears more than once). We can also use a text facet to see which names appear in multiple episodes of Question Time and/or Any Questions.
Selecting a single name allows us to see the programmes that person appeared on. If we pull out the time of first broadcast (value.parseJson()['programme']['first_broadcast_date']) and Edit Cells-Common Transforms-To date, we can also use a date facet to select out programmes first broadcast within a particular date range.
We can also run a text filter to limit records to episodes including a particular person and then use the Date facet to highlight the episodes in which they appeared on the timeline:
What this suggests is that we can use OpenRefine as some sort of ‘application shell’ for creating information tools around a particular dataset without actually having to build UI components ourselves?
If we custom export a table using programme IDs and matched names, and then rename the columns Source and Target, we can visualise them in something like Gephi (you can use the recipe described in the second part of this walkthrough: An Introduction to Mapping Company Networks Using Gephi and OpenCorporates, via OpenRefine).
The directed graph we load into Gephi connects entities (participant names, location names) with programme IDs. There is handy tool – Multimode Networks Projection – that can collapse the graph so that entities are connected to other entities that they shared a programme ID with.
(If you forget to remove the programme nodes, a degree range filter to select only nodes with degree greater than 2 tidies the graph up.)
If we run PageRank on the graph (now as an undirected graph), layout using ForceAtlas2 and size nodes according to PageRank, we can look into the heart of the UK political establishment as evidenced by appearances on Question Time and Any Questions.
The next step would probably be to try to pull info about each recognised entity from dbPedia (forEach(cell.recon.candidates,v,v.id).sort() seems to pull out dbpedia URIs) but grabbing data from dbPedia seems to be borked in my version of OpenRefine atm:-(
Anyway – a quick hack that took longer to write up than it did to do…
OpenRefine project file here.
It’s too nice a day to be inside hacking around with Parliament data as a remote participant in today’s Parliamentary hack weekend (resource list), but if it had been a wet weekend I may have toyed with one of the following:
- revisiting this cleaning script for Analysing UK Lobbying Data Using OpenRefine (actually, a look at who finds/offers support for All Party Groups. The idea was to get a dataset of people who provide secretariat and funds to APPGs, as well as who works for them, and then do something with that dataset…)
- tinkering with data from Question Time and Any Questions…
On that last one:
These gives us generatable URLs for programmes by month with URLs of form http://www.bbc.co.uk/programmes/b006t1q9/broadcasts/2013/01 but how do we get a JSON version of that?! Adding .json on the end doesn’t work?!:-( UPDATE – this could be a start, via @nevali – use pattern /programmes/PID.rdf , such as http://www.bbc.co.uk/programmes/b006qgvj.rdf
We can get bits of information (albeit in semi-structured from) about panellists in data form from programme URL hacks like this: http://www.bbc.co.uk/programmes/b007m3c1.json
Note that some older programmes don’t list all the panelists in the data? So a visit to WIkipedia – http://en.wikipedia.org/wiki/List_of_Question_Time_episodes#2007 – may be in order for Question Time (there isn’t a similar page for Any Questions?)
Given panellists (the BBC could be more helpful here in the way it structures its data…), see if we can identify parliamentarians (MP suffix? Lord/Lady title?) and look them up using the new-to-me, not-yet-played-with-it UK Parliament – Members’ Names Data Platform API. Not sure if reconciliation works on parliamentarian lookup (indeed, not sure if there is a reconciliation service anywhere for looking up MPs, members of the House of Lords, etc?)
From Members’ Names API, we can get things like gender, constituency, whether or not they were holding a (shadow) cabinet post, maybe whether they were on a particular committee at the time etc. From programme pages, we may be able to get the location of the programme recording. So this opens up possibility of mapping geo-coverage of Question Time/Any Questions, both in terms of where the programmes visit as well as which constituencies are represented on them.
If we were feeling playful, we could also have a stab at which APPGs have representation on those programmes!
It also suggests a simpler hack – of just providing a briefing around the representatives appearing on a particular episode in terms of their current (or at the time) parliamentary status (committees, cabinet positions, APPGs etc etc)?
I’ve been doodling around local spending data again recently, noticing as ever that one of the obvious things to do is pull out payments to a particular company (notwithstanding the very many issues associated with actually identifying a particular company or entities within the same corporate group), and I started wondering about certain classes of public payment that may or may not get classed as spend but that do get spent with particular companies.
One example might be winter fuel payments. I don’t know if these are granted in such a way that they have to be used to cover energy bills (for example, by virtue of being delivered in the form of vouchers that can be redeemed against energy bills), or whether the money is just cash that the recipient can choose to spend howsoever; but if they are so restricted in terms of how they can be used they represent a way for government to make a payment to an energy company using a level of indirection that means we can’t at first glance see how government makes that payment to the energy company. The “choice” of who receives the payment is up to the consumer, presumably, but it seems to me to be that it is the government that is essentially making the payment to the energy company as a subsidy for a particular class of customer (as defined by criteria for determining winter fuel payment eligibility).
By not regulating profits made by energy companies more harshly, government presumably supports pricing that requires government to subsidise a significant number of customers. By not regulating prices more harshly, government seems keen to keep giving the energy companies a bung by proxy? I guess the rationale for making the payments this way is that the government can argue that it is acting progressively. If government just gave the energy companies a bung directly, people would get upset: either that the energy companies were being given a chunk of cash for free, or that they were being given a chunk of cash to subsidise the prices they set which would mean that people who could afford the higher price were also benefiting from the deal. How would we feel if, rather than government giving winter payments to those eligible, it just gave the cash straight to the utilities in a transparent way we could track, and required them to identify eligible customers and give them a reduced tariff? Of course, if the winter fuel payments are actually hypothecated, doing it this way would mean that folk currently in receipt of the payments wouldn’t be able to use the money in other ways?
Another area of “spend” that confuses me is the new “Share to buy” home equity loan scheme, in which the government “provides an equity loan (also known as shared equity) of up to 20% of the value of the home you are buying. … the buyer needs only a 5% deposit, and a 75% mortgage to make up the rest.”
The Help to Buy Equity Loan is interest-free for 5 years. After that, the purchaser pays an annual fee of 1.75% on the amount of the outstanding loan. The fee will increase each year by inflation (Retail Price Index (RPI) + 1%.
The purchaser can start repaying the equity loan after they’ve owned the home for a year, but they’ll need to be able to pay a minimum of 10% of the property value at the time of repayment.
When they want to sell their home, they’ll need to repay the percentage equity loan that is still outstanding. So, for example, if they originally bought 80% of the property and they hadn’t repaid any of the equity loan, their repayment on selling would be 20% of the market value at the time when they sell.
One reading of this might be that folk spend as much as they can afford on a house (and maybe even a little bit more), now some of them may be tempted to spend that much and up to 20% more…? That is, might they see the deal as if they were getting a 20% discount on the house (conveniently forgetting that interest payments will kick in in a 5 years?) allowing them to offer more and hence inflate prices more?
What I also wonder about this is: is this the government trying to kickstart a more fluid market in shared ownership on the equity side? I’m guessing that at some point the plan is for the government to flog off the loan book (and presumably then allow interest payments on the loans to float a little more…)? But might there also be an intention to allow individual investors to buy the title to individual equity loans? So rather than investing in a buy to let, individuals would be encouraged to invest in shared ownership schemes from the equity, rather than resident partial owner, side, as an investment?
PS I don’t know about regulatory capture, but policy capture seems like even more of a win for the utilities?! Gas industry employee seconded to draft UK’s energy policy
If you’re a Google account holder, you may have noticed an announcement recently that Google has changed its terms and conditions, in part to allow it to use your +1s and comments as “shared endorsements” in ads published through Google ad services.
So it seems as if there’s now at least two ways Google uses you, me, us, to generate revenue in an advertising context. Firstly, we’re sold as “audience” within a particular segment: “35-50 males into tech”, for example, and audience that advertisers can buy access to. This may even get to the level of individual targeting (for example, Centralising User Tracking on the Web – Let Google Track Everyone For You). Now, secondly, as personal endorsers of a particular company, service or product.
The ‘recent changes’ announcement URL looks like a general “change notice” URL – https://www.google.co.uk/intl/en/policies/terms/changes/ – so I’ll repost key elements from the announcement here….
“Because many of [us] are allergic to legalese”, announcement goes, “here’s a plain English summary for [our] convenience.”
We’ve made three changes:
Firstly, clarifying how your Profile name and photo might appear in Google products (including in reviews, advertising and other commercial contexts).
You can control whether your image and name appear in ads via the Shared Endorsements setting.
Secondly, a reminder to use your mobile devices safely.
Thirdly, details on the importance of keeping your password confidential.
The first change – how my Profile name and photo might appear in Google products – is the one I’m interested in.
How your Profile name and photo may appear (including in reviews and advertising)
We want to give you, and your friends and connections, the most useful information. Recommendations from people that you know can really help. So your friends, family and others may see your Profile name and photo, and content like the reviews that you share or the ads that you +1’d. This only happens when you take an action (things like +1’ing, commenting or following) – and the only people who see it are the people that you’ve chosen to share that content with. On Google, you’re in control of what you share. This update to our Terms of Service doesn’t change in any way who you’ve shared things with in the past or your ability to control who you want to share things with in the future.
Feedback from people you know can save you time and improve results for you and your friends across all Google services, including Search, Maps, Play and in advertising. For example, your friends might see that you rated an album 4 stars on the band’s Google Play page. And the +1 you gave your favourite local bakery could be included in an ad that the bakery runs through Google. We call these recommendations shared endorsements and you can learn more about them here.
When it comes to shared endorsements in ads, you can control the use of your Profile name and photo via the Shared Endorsements setting.
Here’s a direct link to the setting… [if you have a Google+ account, I suggest you go there, uncheck the box, and hit "Save"]. I never knowingly checked this – so presumably the default is set to checked (that is, with me opted in to the “service”?
If you turn the setting to “off,” …
you’ll get hassled:
or to put it another way,
…your Profile name and photo will not show up on that ad for your favourite bakery or any other ads. This setting only applies to use in ads, and doesn’t change whether your Profile name or photo may be used in other places such as Google Play.
I have no idea what the context of Google Play might mean. I do have an Google Android phone, and it is tied to a Google account. It is largely a mystery to me, particularly when it comes to knowing who has access to – or has taken copies of – my contacts. I have no idea what Google Play services I have or have not been opted in to.
If you previously told Google that you did not want your +1’s to appear in ads, then of course we’ll continue to respect that choice as a part of this updated setting.
I’m not sure what that means? If I’ve checked “do not want my +1’s to appear in ads” box, will the current setting be set to unchecked (opt out of shared endorsements)? Does the original setting still exist somewhere, or has it been replaced by the new setting? Or is there another level of privacy setting somewhere, and if so how do the various levels interact?
This is on my current Google+ settings page:
and I can’t see anything about +1 ad opt outs, so presumably the setting has changed? I’d have thought I’d have opted out of allowing +1s to appears in ads (had I known: a) that +1s may have been used in ads; and b) that such a setting existed), but presumably that fact passed me by (more on this later in the post…) Or I had opted out and the opt-out wasn’t respected? But surely not that…?
For users under 18, their actions won’t appear in shared endorsements in ads and certain other contexts.
Which is to say, ‘if you lied about your age in order to access to particular services, we’re gonna sell the ability for advertisers to use you to endorse their products to your friends’.
So that’s the “helpful” explanation of the terms.. what do the actual terms say?
When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide licence to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes that we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights that you grant in this licence are for the limited purpose of operating, promoting and improving our Services, and to develop new ones. This licence continues even if you stop using our Services (for example, for a business listing that you have added to Google Maps). Some Services may offer you ways to access and remove content that has been provided to that Service. Also, in some of our Services, there are terms or settings that narrow the scope of our use of the content submitted in those Services. Make sure that you have the necessary rights to grant us this licence for any content you submit to our Services. [This para, or one very much like it, is in the current terms.]
If you have a Google Account, we may display your Profile name, Profile photo and actions you take on Google or on third-party applications connected to your Google Account (such as +1’s, reviews you write and comments you post) in our Services, including displaying in ads and other commercial contexts. We will respect the choices you make to limit sharing or visibility settings in your Google Account. For example, you can choose your settings so that your name and photo do not appear in an ad.
Hmmm.. so maybe the settings do – or will – have a finer level of control (and complexity…) associated with them? I wonder also whether those two paragraphs can work together? If I comment on a Google+ page, or maybe tag a brand or product in an image I have uploaded, could Google create a derivative work as part of a shared endorsement by me?
Looking Around Some Other Google+ Settings
Finding myself on my Google+ settings page, I had a look at some of the other settings…
Hmm… this could be an issue, if checked? If things are shared to people in my circles, and folk get automatically added to my circles if I just search for them, then, erm, I could maybe unwaringly opt a page in to my circles?
But if I do search for someone and they’re added to my circles on my behalf, what circle are they added to?
Not being paranoid or anything, but I can now also imagine something like the following setting appearing on my main Google account insofar as it relates to search, for example:
Google Search Pages
_ Automatically add a Google+ Author to my circles if I click through on a search result marked with a Google+ Author tag.
So what other settings are there that may be of interest?
Several to do with automatically tampering with my content (as if false memory syndromes aren’t bad enough!)
I seem to remember these being announced, but didn’t think to check that I would automatically be opted in.
Note to self: When Google announces a new Google+ service, or service related to Google accounts, assume I get automatically opted in.
Any others? Ah, ha… a little something that invisibly enmeshes me a little deeper in the Google knowledge web:
Here’s the blurb, rather bluntly entitled Find My Face: “Find my face makes finding pictures and videos of you easy and more social. Find my face offers name tag suggestions so you, or people that you know, can quickly tag photos. Any time someone tags you in a photo or video, you’ll be able to accept or reject name tags created by people you know.”
So I’m guessing if I opt in to this, if Google recognises that I’m in a photo, and someone I know views that photo, they’ll be prompted to tag me in it. I wonder if Google actually has a belief graph and a knowledge graph? In the first case, the belief graph would associate me with photos Google’s algorithms think I’m in. In the second case, the knowledge graph, Google would associate me with photos where someone confirms that I am in the photo. If you want to get geeky, this knowledge vs. belief distinction, where knowledge means “justified true belief”, has a basis in things like epistemic logic (which I came across in the context of agent logics) – I’d never really thought about Google’s graph in this way… Hmmm…
Here’s how it works, apparently:
After you turn on Find my Face, Google+ uses the photos or videos you’re tagged in to create a model of your face. The model updates as tags of you are added or removed and you can delete the entire face model at any time by turning off Find my Face.
If you turn on Find my Face, we can use your face model to make it easier to find photos or videos of you. For example, we’ll show a suggestion to tag you when you or someone you know looks at a photo or video that matches your face model. Name tag suggestions by themselves do not change the sharing setting of photos or albums or videos. However, when someone approves the suggestion to add a name tag, the photo and relevant album or video are shared with the person tagged.
So can Google sell that face model of me to other parties? Or just sell recognition of my face in photos and videos as a service, or as part of an audience construction process?
I guess at least I get to approve any photo tags though… Or do I?
So if I search for someone on Google+, they’re added to my circles, which means that if they tag me in a photo when prompted by Google+ to do so, their tag is automatically accepted by me by virtue of this proxy setting I seem to have been automatically opted in to? Or am I reading these settings all wrong?
Ho hum, I guess it’s not even the legalese I’m allergic to… it’s understanding the emergent complexity and consequences that arise from different combinations of settings on personal account pages…
[An old post, rescued from the list of previously unpublished posts...]
Although I use OpenRefine from time time, one thing I don’t tend to use it for is screenscraping HTML web pages – I tend to write Python scripts in Scraperwiki to do this. Writing code is not for everyone, however, so I’ve brushed off my searches of the OpenRefine help pages to come up with this recipe for hacking around with various flavours of company data.
The setting actually comes from OpenOil’s Johnny West:
1) given the companies in a particular spreadsheet… for example “Bayerngas Norge AS” (row 6)
2) plug them into the Norwegian govt’s company registry — http://www.brreg.no/ (second search box down nav bar on the left) – this gives us corporate identifier… so for example… 989490168
3) plug that into purehelp.no — so http://www.purehelp.no/company/details/bayerngasnorgeas/989490168
4) the Aksjonærer at the bottom (the shareholders that hold that company) – their percentages
5) searching OpenCorporates.com with those names to get their corporate identifiers and home jurisdictions
6) mapping that back to the spreadsheet in some way… so for each of the companies with their EITI entry we get their parent companies and home jurisdictions
Let’s see how far we can get…
To start with, I had a look at the two corporate search sites Johnny mentioned. Hacking around with the URLs, there seemed to be a couple of possible simplifications:
- looking up company ID can be constructed around
http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS – the link structure has changed since I originally wrote this post, correct form is now http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS&orgform=0&fylke=0&kommune=0&barebedr=false [h/t/ Larssen in the comments.]
- http://www.purehelp.no/company/details/989490168 (without company name in URL) appears to work ok, so can get there from company number.
Loading the original spreadsheet data into OpenRefine gives us a spreadsheet that looks like this:
So that’s step 1…
We can run step 2 as follows* – create a new column from the company column:
* see the end of the post for an alternative way of obtaining company identifiers using the OpenCorporates reconciliation API…
Here’s how we construct the URL:
The HTML is a bit of a mess, but by Viewing Source on an example page, we can find a crib that leads us close to the data we require, specifically the fragment detalj.jsp?orgnr= in the URL of the first of the href attributes of the result links.
Using that crib, we can pull out the company ID and the company name for the first result, constructing a name/id pair as follows:
[value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() , value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlText() ].join('::')
The first part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() – pulls out the company ID from the first search result, extracting it from the URL fragment.
The second part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlText() – pulls out the company name from the first search result.
We place these two parts into an array and then join them with two colons: .join('::')
This keeps thing tidy and allows us to check by eye that sensible company names have been found from the original search strings.
We can now split the name/ID pair column into two separate columns:
And the result:
The next step, step 3, requires looking up the company IDs on purehelp. We’ve already see how a new column can be created from a source column by URL, so we just repeat that approach with a new URL pattern:
(We could probably reduce the throttle time by an order of magnitude!)
The next step, step 4, is to pull out shareholders and their percentages.
The first step is to grab the shareholder table and each of the rows, which in the original looked like this:
The following hack seems to get us the rows we require:
BAH – crappy page sometimes has TWO companyOwnership IDs, when the company has shareholdings in other companies as well as when it has shareholders:-(
So much for unique IDs… ****** ******* *** ***** (i.e. not happy:-(
Need to search into table where “Shareholders” is specified in top bar of the table, and I don’t know offhand how to do that using the GREL recipe I was taking because the HTML of the page is really horrible. Bah…. #ffs:-(
Question, in GREL, how do I get the rows in this not-a-table? I need to specify the companyOwnership id in the parent div, and check for the Shareholders text() value in the first child, then ideally miss the title row, then get all the shareholder companies (in this case, there’s just one; better example):
<div id="companyOwnership" class="box"> <div class="boxHeader">Shareholders:</div> <div class="boxContent"> <div class="row rowHeading"> <label class="fl" style="width: 70%;">Company name:</label> <label class="fl" style="width: 30%;">Percentage share (%)</label> <div class="cb"></div> </div> <div class="row odd"> <label class="fl" style="width: 70%;">Shell Exploration And Production Holdings</label> <div class="fr" style="width: 30%;">100.00%</div> <div class="cb"></div> </div> </div>
For now I’m going to take a risky shortcut and assume that the Shareholders (are there always shareholders?) are the last companyOwnership ID on the page:
We can then generate one row for each shareholder in OpenRefine:
(We’ll need to do some filling in later to cope with the gaps, but no need right now. We also picked up the table header, which has been given it’s own row, which we’ll have to cope with at some point. But again, no need right now.)
For some reason, I couldn’t parse the string for each row (it was late, I was confused!) so I hacked this piecemeal approach to try to take them by surprise…
value.replace(/\s/,' ').replace('<div class="row odd">','').replace('<div class="row even">','').replace('<form>','').replace('<label class="fl" style="width: 70%;">','').replace('<div class="cb"></div>','').replace('</form> </div>','').split('</label>').join('::')
Using the trick we previously applied to the combined name/ID column, we can split these into two separate columns, one for the shareholder and the other for their percentage holding (I used possibly misleading column names below – should say “Shareholder name”, for example, rather than shareholding 1?):
We then need to tidy the two columns:
Note that some of the shareholder companies have identifiers in the website we scraped the data from, and some don’t. We’re going to be wasteful and throw the info away that links the company if it’s there…
value.replace('<div class="fr" style="width: 30%;">','').replace('</div>','').strip()
We now need to do a bit more tidying – fill down on the empty columns in the shareholder company column and also in the original company name and ID [actually - this is not right, as we can see below for the line Altinex Oil Norway AS...? Maybe we can get away with it though?], and filter out the rows that were generated as headers (text facet then select out blank and Fimanavn).
This is what we get:
We can then export this file, before considering steps 5 and 6, using the custom exporter:
Select the columns – including the check column of the name of the company we discovered by searching on the names given in the original spreadsheet… these are the names that the shareholders actually refer to…
And then select the export format:
Here’s the file: shareholder data (one of the names at least appears not to have worked – Altinex Oil Norway AS). LOoking at the data, I think we also need to take the precaution of using .strip() on the shareholder names.
Here’s the OpenRefine project file to this point [note the broken link pattern for brreg noted at the top of the post and in the comments... The original link will be the one used in the OpenRefine project...]
Maybe export on a filtered version where Shareholding 1 is not empty. Also remove the percentage sign (%) in the shareholding 2 column? ALso note that Andre is “Other”… maybe replace this too?
In order to get the OpenCorporates identifiers, we should be able to just run company names through the OpenCorporates reconciliation service.
Hmmm.. I wonder – do we even have to go that far? From the Norwegian company number, is the OpenCorporates identifier just that number in the Norwegian namespace? So for BAYERNGAS NORGE AS, which has Norwegian company number 989490168, can we look it up directly on OpenCorporates as http://opencorporates.com/companies/no/989490168? It seems like we can…
This means we possibily have an alternative to step 2 – rather than picking up company numbers by searching into and scraping the Norwegian company register, we can reconcile names against the OpenCorporates reconciliation API and then pick up the company numbers from there?
and ended up on a search results page with the URL http://dwp.gov.uk.master.com/texis/master/search/?q=sharing+data+local+authority&s=SS:
(The results appear to be broken – on the first link at least, the redirect from the results page goes to a largely irrelevant link on the new gov.uk site.)
Ooh – slick… ‘ere, gov, wanna buy a new
I note a pre-announced intention from the Justice Data Lab that they will publish “[t]ailored reports pertaining to the re-offending outcomes of services or interventions delivered by organisations who have requested information through the Justice Data Lab. Each report will be an Official Statistic.
If you haven’t been keeping up, the Justice data lab is a currently free, one year pilot scheme (started April 2013) in which “a small team from Analytical Services within the Ministry of Justice (the Justice Data Lab team) will support organisations that provide offender services by allowing them easy access to aggregate re-offending data specific to the group of people they have worked with” [User Journey].
Here’s how the user journey doc describes the process…
…and the methodology:
which is also described in the pre-announcement doc as follows:
Participating organisations supply the Justice Data Lab with details of the offenders who they have worked with, and information about the services they have provided. As standard the Justice Data Lab will supply aggregate one year proven re-offending rates for that group, and that of a matched control group of similar offenders. The re-offending rates for the organisation’s group and the matched control group will be compared using statistical testing to assess the impact of the organisation’s work on reducing re-offending. The results will then be returned to the organisation in a clear and easy to understand format, with explanations of the key metrics, and any caveats and limitations necessary for interpretation of the results.
The pre-announcement suggests that participating organisations will not only receive a copy of the report, but so will the public… The rationale:
The Justice Data Lab pilot is free at the point of service, paid for through the Ministry of Justice budget. The Ministry of Justice therefore has a duty to act transparently and openly about the outcomes of this initiative. It is anticipated that by making this information available in the public domain, organisations that work with offenders will have a greater evidence base about what works to rehabilitate offenders, and ultimately cut crime.
(Nice to see the MoJ believes in transparency. Shame that doesn’t go as far as timely spending data transparency, but I guess we can’t have it all…)
I think it’s worth taking notice of this pre-announcement for few reasons:
- are such data release mechanisms the result of lobbying pressure? Other government departments have datalabs, such as the HMRC datalab. HMRC recently ran a consultation on the release of VAT registration information as opendata, although concerns have been raised that this may just be a shortcut way of releasing company VAT registration data to credit rating agencies and their ilk…?, so it seems as if they are looking at what data they may be able to open up, and how, maybe in response to lobbying requests from corporate players who don’t want to have to (pay to) collect the data themselves…? Who might have lobbied the MoJ for the results of MoJ datalab requests to be opened up as public data, I wonder?
- are the results gameable, or might they be used as a tool to “attack” a group that is the basis of a research request? For example, can third parties request that the MoJ datalab runs an analysis on the effectiveness of a programme carried out by another party, such as, I dunno, G4S?
- the ESRC is in the process of a multi-stage funding round that will establish a range of research data centres. The first round, to establish a series of Administrative Data Research Centres has now closed (who won?!) and the second – for Business and Local Government Data Research Centres – is currently open. (Phase three will focus on “Third Sector data and social media data”…wtf?!) To what extent might any of the funded research data centres require that summaries of analyses run using datasets they control access to are released as public open data?
Just by the by, I note here the RCUK Common Principles on Data Policy:
Publicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner that does not harm intellectual property.
Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long-term value should be preserved and remain accessible and usable for future research.
To enable research data to be discoverable and effectively re-used by others, sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data. Published results should always include information on how to access the supporting data.
RCUK recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process.
To ensure that research teams get appropriate recognition for the effort involved in collecting and analysing data, those who undertake Research Council funded work may be entitled to a limited period of privileged use of the data they have collected to enable them to publish the results of their research. The length of this period varies by research discipline and, where appropriate, is discussed further in the published policies of individual Research Councils.
In order to recognise the intellectual contributions of researchers who generate, preserve and share key research datasets, all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed.
It is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds.
You probably can’t help but have noticed (in the EU at least), that website operators seem keen to gain your permission to pop “cookies” into your browser. Cookies are tiny computer files that a website can use to store information about you on your own computer. To prevent nasty people doing nasty things, the security policies operated by your browser try to ensure that only websites that write a cookie can read them back.
If enough people adopt a particular third party service, that service may be able to pick up quite a good idea about the range of sites you visit. Google’s various ad’n'analytics services in principle allow it to track you across a wide range of sites, because those services are so widely used, though the extent to which Google does or does not fuse data from the cookies associated with those various services may be moot…
One thing I hadn’t realised (or maybe, hadn’t really thought about) before was brought to my attention when something else that was new to me crossed my radar the other day: real time bidding on web adverts, the architecture for which is broadly descibed by Shuai Yuan, Jun Wang, Xiaoxue Zhao in their paper Real-time Bidding for Online Advertising: Measurement and Analysis as follows:
The model is crudely this: when you visit a web site, the publisher alerts the advertisers that someone has landed on the webpage. Through various cookie machinations, the publisher (and/or the advertiser) may be able to identify you, or certain things about you, from the various cookies on your machine. The advertisers decide what you’re worth and bid to place the advert. The publisher accepts a bid from one of the advertisers and pops the ad into the page you’re visiting. Sort of. (The publisher in this case is more likely to be an ad marketplace/broker, rather than the webpage publisher.)
So that was new to me – realtime bidding. The world’s gone mad. Anyway. As a result of that, I suddenly appreciated the creepy bit in the image above, in step 4: “advertisers choose to buy 3rd party data optionally”. That is, advertisers – in real time – may buy cookie mediated information about people who are in the process of loading a particular web page – in order to work out a bid price for placing an ad within that page to present to that person. Personal advertising, in real time. If data from other (non-web) sources can be added into the mix, perhaps because someone has been uniquely identified, then presumably all the better… for example.
To help create a better picture of the person who is actually opening up a web page, and to piece together all those fractional bits of information that separate web domains can place into your browser through cookies they – and only they – can read and write – “cookie matching” services, such as the cookie matching service run by Google’s DoubleClick Ad Exchange provide a means by which various parties can pool together, or sell between each other, what they know about an individual from the cookies they have independently set on that individual’s webpage. (For a description of one recipe for matching cookies, see SSP to DSP Cookie-Synching Explained.)
I guess I knew this happened anyway (it’s part of the basis for ad retargeting – aka ads that follow you round the web), but I hadn’t realised quite how sharey-sharey, selly-selly and real time it all was.
So we’re being tracked and info about us being sold in real-time as we traipse around the web. But we know that anyway, and we don’t seem to let it bother us.
How about real world tracking, though? Are we as happy being tracked as we walk around in physical space too? It seems so – and the technology appears to be getting so mundane… Through my feeds yesterday I was lead to MFlow Journey, a product of a company not so subtly called Human Recognition Systems, that uses video surveillance to capture and follow “anonymous” faces to track human traffic flows through airports. Human tracking is nothing new of course – your mobile operator can track your phone as easy as peasy can be, and if you have wifi enabled on any of the devices you’re carrying around with you, anyone who cares to can track you too. With the click of a button, apparently (review of a typical Euclid analytics dashboard). (See also: New York Times, Attention, Shoppers: Store Is Tracking Your Cell). Apple seems to be doing it’s bit to make retail centre tracking easier too.
So that’s faces and phones… number plates can be trivially tracked too of course. Here’s a recent (January 2013) ACPO report on The police use of Automatic Number Plate Recognition (the vignettes at the start of the report are illustrative of just what the millions of rows of data in the database allow you to pick out about a particular individual; other operational examples are described in this IPPC Independent Investigation into the use of ANPR in Durham, Cleveland and North Yorkshire from 23 – 26 October 2009 (summarising press release); see also the ACPO 2009 Practice Advice on the management and Use of Automatic Number Plate Recognition, the National ACPO ANPR Standards (NAAS) v.4.12 Nov 2011 and this Memorandum of Understanding to Support Access to ANPR Data, v.2 Feb 2011. A recent bollocking from the ICO (Police use of ‘Ring of Steel’ is disproportionate and must be reviewed) suggests that popping ANPR cameras on all roads in and out of a town is just not on, but I guess this is limited to police deployed cameras, and doesn’t necessarily address mosaic pictures that you can build up from piecing together ANPR hits wherever you can pick them up from…
…because as well as ANPR systems operated by the police, ANPR is widely used by private companies (though I’m not sure about the extent to which they do, or may be obliged to, share their logs or data collection facilities with the police?) For an idea of what sorts of ANPR “solutions” are available, here’s a list of approved car parking operators with some handy metadata that shows whether they use ANPR or not.
Camera surveillance is just not limited to ANPR systems of course, as any precinct bench loitering yoofs will be able to tell you. Just what is and isn’t deemed acceptable generally is described by the recent (August 2013) surveillance camera code of practice (press release).
Hey ho – it’s got me wondering now what other pieces of the panopticon are already in place?
A couple of weeks ago I spotted a BBC news article announcing that a University launches online course with TV show:
In what is being claimed as the biggest ever experiment in “edutainment”, a US television company is forming a partnership with a top-ranking Californian university to produce online courses linked to a hit TV show.
This blurring of the digital boundaries between academic study and the entertainment industry will see a course being launched next month based on the post-apocalypse drama series the Walking Dead.
Television shows might have spin-off video games or merchandising, but this drama about surviving disaster and social collapse is now going to have its own university course.
The OU has supplemented courses with material from TV broadcasts for several decades, and has also wrapped factual programming with OU courses. We’ve even commissioned drama pieces that have been woven into OU courses. But something about wrapping Hollywood hype also seemed familiar… and then I remembered Hollywood Science. But it’s not available on iPlayer, unfortunately, and I don’t think it went to DVD either…which makes this all something of a non-post!