OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Anything you want’ Category

The Artist’s Role…

with 2 comments

As far as [Duchamp] was concerned, the role in society of an artist was akin to that of a philosopher; it didn’t even matter if he or she could paint or draw. An artist’s job was not to give aesthetic pleasure – designers could do that; it was to step back from the world and attempt to make sense or comment on it through the presentation of ideas that had no functional purpose other than themselves.

– Will Gompertz, What Are You Looking At? p. 10

Written by Tony Hirst

January 13, 2014 at 10:12 pm

Posted in Anything you want

Tagged with

So You Want to be a Data Journalist? Current Opportunities

Every so often I do a round up of job openings in different areas. This is particular true around year end, as I look at my dwindling salary (no more increments, ever, and no hope of promotion, …) and my overall lack of direction, and try to come up with sort sort of resolution to play with during the first few weeks of the year.

The data journalism phrase has being around for some time now (was it really three and half years ago since Data Driven Journalism Round Table at the European Journalism Centre? (FFS, what have I been doing for the last three years?!:-( and it seems to be maturing a little. We’ve had the period of shiny, shiny web apps requiring multiskilled development teams and designers working with the hacks to produce often confusing and wtf am I supposed to be doing now?! interactives and things seem to be becoming a little more embedded… Perhaps…

My reading (as an outsider) is that there is now more of a move towards developing some sort of data skillbase that allows journalists to do “investigative” sorts of things with data, often using very small data sets or concise summary datasets. To complement this, there seems to be some sort of hope that visually appealing charts can be used to hook eyeballs into a story (rather than pushing eyeballs away) – Trinity Mirror’s Ampp3d (as led by Martin Belam) is a good example of this, as is the increasing(?) use of the DataWrapper library.

From working with the School of Data, as well as a couple of bits of data journalism not-really-training with some of the big news groups, I’ve come to realise there is probably some really basic, foundational work to be done in the way people think (or don’t think) about data. For example, I don’t believe that people in general read charts. I think they may glance at them, but they don’t try to read them. They have no idea what story they tell. Given a line chart that plots some figure over time. How many people ride the line to get a feel for how it really changed?

Hans Rosling famously brings data alive with his narrative commentary around animated development data charts, including bar charts…

But if you watch the video with the sound off, or just look at the final chart, do you have the feeling of being told the same story? Can you even retell yourself the story by looking at the chart. And how about if you look at another bar chart? Can you use any of Hans Rosling’s narrative or rhetorical tricks to help you read through those?

(The rhetoric of data (and the malevolent arts of persuasion) is something I want to ponder in more depth next year, along with the notion of data aesthetics and the theory of forms given a data twist.)

Another great example of narrated data storytelling comes from Kurt Vonnegut as he describes the shapes of stories:

Is that how you read a line chart when you see one?

One thing about the data narration technique is that it is based around the construction of a data trace. There is a sense of anticipation about where the line will go next, and uncertainty as to what sort of event will cause the line to move one way or another. Looking back at a completed data chart, what points do we pick from it that we want to use as events in our narration or reading of it? (The lines just connect the points – they are processional in the way they move us from one point of interest to the next, although the gradient of the line may provide us with ideas for embellishing or decorating the story a little.)

It’s important to make art because the people that get the most out of art are the ones that make it. It’s not … You know there’s this idea that you go to a wonderful art gallery and it’s good for you and it makes you a better person and it informs your soul, but actually the person who’s getting the most out of any artistic activity is the person who makes it because they’re sort of expressing themselves and enjoying it, and they’re in the zone and you know it’s a nice thing to do. [Grayson Perry, Reith Lectures 2013, Lecture 2, Q&A [transcript, PDF], audio]

In the same way, the person who gets the most out of a chart is the person who constructed it. They know what they left in and what they left out. They know why the axes are selected as they are, why elements are coloured or sized as they are. They know the question that led up to the chart and the answers it provides to those questions. They know where to look. Like an art critic who reads their way round a painting, they know how to read one or many different stories from the chart.

The interactives that appeared during the data journalism wave from a couple of years ago sought to provide a playground for people to play with data and tells their own stories with it. But they didn’t. In part because they didn’t know how to play with data, didn’t know how to use it in a constructive way as part of a narrative, (even a made up, playful narrative). And in part this comes back to not knowing how to read – that is, recover stories from – a chart.

It is often said that a picture saves a thousand words, but if the picture tells a thousand word story, how many of us try to read that thousand word story from each picture or chart? Maybe we need to use a thousand words as well as the chart? (How many words does Hans Rosling use? How many, Kurt Vonnegut?)

When producing a chart that essentially represents a summary of a conversation with have had with a dataset, it’s important to remember that for someone looking at the final chart it might not make as much sense in absence of the narrative that was used to construct it. Edward de Bono’s constructed illustrations helps read a the final image through recalling his narrative. But if we just look at a “completed” sketch from one of his talks, it will probably be meaningless.

One of the ideas that works for me when I reflect on my own playing with data is that it is a conversation. Meaning is constructed through the conversation I have with a dataset, and the things it reveals when I pose particular questions to it. In many cases, these questions are based on filtering a dataset, although the result may be displayed in many ways. The answers I get to a question inform the next question I want to ask. Questions take the form of constructing this chart as opposed to that chart, though I am free to ask the same question in many slightly different ways if the answers don’t appear to be revealing of anything.

It is in this direction – of seeing data as a source that can be interviewed and coaxed into telling stories – that I sense elements of the data journalism thang are developing. This leads naturally into seeing data journalism skills as core investigative style skills that all journalists would benefit from. (Seeing things as data allows you to ask particular sorts of question in very particular ways. Being able to cast things into a data form – as for example in Creating Data from Text – Regular Expressions in OpenRefine) – so that they become amenable to data-style queries, is the next idea I think we need to get across…

So what are the jobs that are out at the moment? Here’s a quick round-up of some that I’ve spotted…

  • Data editor (Guardian): “develop and implement a clear strategy for the Data team and the use of data, numbers and statistics to generate news stories, analysis pieces, blogs and fact checks for The Guardian and The Observer.
    You will take responsibility for commissioning and editing content for the Guardian and Observer data blogs as well as managing the research needs of the graphics department and home and foreign news desks. With day-to-day managerial responsibility for a team of three reporters / researchers working on the data blog, you will also be responsible for data analysis and visualisation: using a variety of specialist software and online tools, including Tableau, ARCGis, Google Fusion, Microsoft Access and Excel”

Perpetuating the “recent trad take” on data journalism, viewed as gonzo journalist hacker:

  • Data Journalist [Telegraph Media Group]: “[S]ource, sift and surface data to find and generate stories, assist with storytelling and to support interactive team in delivering data projects.
    “The Data Journalist will sit within the Interactive Data team, and will work with a team of designers, web developers and journalists on data-led stories and in developing innovative interactive infographics, visualisations and news applications. They will need to think and work fast to tackle on-going news stories and tight deadlines.
    “Applicants should have a portfolio of relevant work and bylines on data-led stories and/or interactive graphics. The role will include mentoring and training opportunities, but candidates should feel confident working with HTML/CSS, Javascript, PHP and MySQL, even if they are not writing code themselves. Experience of writing scrapers and using statistical software (e.g. R) is desired, but not essential.
  • One of the most exciting opportunities that I can see around data related published is in new workflows and minimising the gap between investigative tools and published outputs. This seems to me a bit risky in that it seems so conservative when it comes to getting data outputs actually published?
    Designer [Trinity Mirror]: “Trinity Mirror’s pioneering data unit is looking for a first-class designer to work across online and print titles. … You will be a whizz with design software – such as Illustrator, Photoshop and InDesign – and understand the principles of designing infographics, charts and interactives for the web. You will also be able to design simple graphical templates for re-purposing around the group.
    “You should have a keen interest in current affairs and sport, and be familiar with – and committed to – the role data journalism can play in a modern newsroom.”
  • [Trinity Mirror]: Can you take an API feed and turn it into a compelling gadget which will get the whole country talking?
    “Trinity Mirror’s pioneering data unit is looking for a coder/developer to help it take the next step in helping shape the future of journalism. …
    “You will be able to create tools which automatically grab the latest data and use them to create interactive, dynamically-updated charts, maps and gadgets across a huge range of subjects – everything from crime to football. …
    “The successful candidate will have a thorough knowledge of scraping techniques, ability to manage a database using SQL, and demonstrable ability in at least one programming language.”

But there is hope about the embedding of data skills as part of everyday journalistic practice:

  • Culture report (Guardian): “We are looking for a Culture Reporter to generate stories and cover breaking news relating to Popular Culture, Film and Music … Applicants should also have expertise with digital tools including blogging, social media, data journalism and mobile publishing. “
  • Investigations Correspondent [BBC Newsnight]: “Reporting to the Editor, the Investigations Correspondent will contribute to Newsnight by producing long term investigations as well as sometimes contributing to big ongoing stories. Some investigations will take months, but there will also be times when we’ll need to dig up new lines on moving the stories in days.
    “We want a first rate reporter with a proven track record of breaking big stories who can comfortably work across all subject areas from politics to sport. You will be an established investigative journalist with a wide range of contacts and sources as well as having experience with a range of different investigative approaches including data journalism, Freedom Of Information (FOI) and undercover reporting.”
  • News Reporter, GP [Haymarket Medical Media]: “GP is part of Haymarket Medical Media, which also produces MIMS, Medeconomics, Inside Commissioning, and mycme.com, and delivers a wide range of medical education projects. …
    “Ideally you will also have some experience of data journalism, understand how social media can be used to enhance news coverage and have some knowledge of multimedia journalism, including video and blogs.”
  • Reporter, ENDS Report [Haymarket]: “We are looking for someone who has excellent reporting and writing skills, is enthusiastic and able to digest and summarise in depth documents and analysis. You will also need to be comfortable with dealing with numbers and statistics and prepared to sift through data to find the story that no one else spots.
    “Ideally you will have some experience of data journalism, understand how social media can be used to enhance news coverage.”

Are there any other current ones I’m missing?

I think the biggest shift we need is to get folk treating data as a source that responds to a particular style of questioning. Learning how to make the source comfortable and get it into a state where you can start to ask it questions is one key skill. Knowing how to frame questions so that discover the answers you need for a story are another. Choosing which bits of the conversation you use in a report (if any – maybe the conversation is akin to a background chat?) yet another.

Treating data as a source also helps us think about how we need to take care with it – how not to ask leading questions, how not to get it to say things it doesn’t mean. (On the other hand, some folk will undoubtedly force the data to say things it never intended to say…

“If you torture the data enough, nature will always confess” [Ronald Coase]

[Disclaimer: I started looking at some medical data for Haymarket.]

Written by Tony Hirst

December 19, 2013 at 7:54 pm

Learning Analytics Interventions

A nice observation, pointed out by Kay Bromley in a department meeting earlier this week whilst reporting on the OU’s changing model for student support: if we introduce learning analytics that trigger particular interventions, (for example, email prompts), we should expect a certain percentage of those interventions to result in additional calls for support…

Which is to say, a consequence of analytics driven interventions may be a need to provide additional levels of support.

Another factor to be taken in to account is the extent to which Associate Lecturers (that is, personal module tutors) need to be informed when an intervention occurs, because there is a good chance that the AL will be the person a student contacts following an automated intervention or alert…

Written by Tony Hirst

December 13, 2013 at 10:34 am

Posted in Anything you want

Tagged with

Extracting Images from PDFs

A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).

For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:

shell ng oil spill

Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.

import os,re
import urllib2

#New OU course will start using pandas, so I need to start getting familiar with it.
#In this case it's overkill, because all I'm using it for is to load in a CSV file...
import pandas as pd


#Load in the data scraped from Shell
df= pd.read_csv('shell_30_11_13_ng.csv')


#For each line item:
for url in df[df.columns[15]]:
		print 'trying',url
		u = urllib2.urlopen(url)

		#Grab a local copy of the downloaded picture containing PDF
		localFile = open(fn, 'w')
		print 'error with',url
	#If we look at the filenames/urls, the filenames tend to start with the JIV id
	#...so we can try to extract this and use it as a key

	#I'm going to move the PDFs and the associated images stripped from them in separate folders
	os.system(' '.join(['mkdir',fo]))
	#Try to cope with crappy filenames containing punctuation chars
	fn= re.sub(r'([()&])', r'\\\1', fn)

	#Available via poppler-utils
	#See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/
	#Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory
	cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo  ])
	#Still a couple of errors on filenames
	#just as quick to catch by hand/inspection of files that don't get moved properly
print 'Errors',errors

Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng

The important line of code in the above is:


FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)

pdfimages can be downloaded as part of poppler (I think?!)

See also this Stack Exchange question/answer: Extracting images from a PDF

PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.

http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:

- generate post containing images and body text made up from data in the associated line from the CSV file.

Example data:

Date Reported Incident Site JIV Date Terrain Cause Estimated Spill Volume (bbl) Clean-up Status Comments Photo
02-Jan-13 10″ Diebu Creek – Nun River Pipeline at Onyoma 05-Jan-13 Swamp Sabotage/Theft 65 Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013. Site Certification was completed on 28th June 2013. http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf

So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).

A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).

We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).

There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:

Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.

If those phrases are common/templated refrains, we can parse the corresponding dates out?

I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?

Written by Tony Hirst

December 1, 2013 at 5:14 pm

Posted in Anything you want, School_Of_Data

Tagged with ,

Peeking at Representation on Any Question(s) Time…

Picking up on Political Representation on BBC Political Q&A Programmes – Stub , the quickest of hacks…

In OpenRefine, create a new project by importing data from a couple of URLs – data from the BBC detailing episode IDs for Any Questions and Question Time:

- http://www.bbc.co.uk/programmes/b006t1q9.rdf
- http://www.bbc.co.uk/programmes/b006qgvj.rdf

Import the data as XML, highlighting a single programme code row as the import element.

The data we get looks like this – /programmes/b007ck8s#programme – so we can add a column by URL around 'http://www.bbc.co.uk'+value.split('#')[0]+'.json' to get JSON data back for each column.

Parse the JSON that comes back using something like value.parseJson()['programme']['medium_synopsis'] to create a new column containing the medium synopsis information.

The medium synopsis elements typically look like Topical debate from Colchester, with David Dimbleby. On the panel are Peter Hain, Sir Menzies Campbell, Francis Maude, singer Beverley Knight and journalist Cristina Odone. Which is to say they often contain the names of the panellists.

We can try to extract the names contained within each synopsis using the Zemanta API (key required) accessed via the Named-Entity Recognition extension for Google Refine / OpenRefine.

These seem to come back in reconciliation API form with the name set to a name and the id to a URL. We can get a concatenated list of the URLs that are returned by creating a column around something like this: forEach(cell.recon.candidates,v,v.id).sort().join('||') but I’m not sure that’s useful.

We can creata a column based just around the matched ID using cell.recon.match.name.

Let’s use the row view and fill down on programme IDs, then have a look at a duplicate facet and view only rows that are duplicated (that is, where an extracted named entity appears more than once). We can also use a text facet to see which names appear in multiple episodes of Question Time and/or Any Questions.

openrefine any questions

Selecting a single name allows us to see the programmes that person appeared on. If we pull out the time of first broadcast (value.parseJson()['programme']['first_broadcast_date']) and Edit Cells-Common Transforms-To date, we can also use a date facet to select out programmes first broadcast within a particular date range.

question time date facet

We can also run a text filter to limit records to episodes including a particular person and then use the Date facet to highlight the episodes in which they appeared on the timeline:

name and date filter - questions MP

What this suggests is that we can use OpenRefine as some sort of ‘application shell’ for creating information tools around a particular dataset without actually having to build UI components ourselves?

If we custom export a table using programme IDs and matched names, and then rename the columns Source and Target, we can visualise them in something like Gephi (you can use the recipe described in the second part of this walkthrough: An Introduction to Mapping Company Networks Using Gephi and OpenCorporates, via OpenRefine).

The directed graph we load into Gephi connects entities (participant names, location names) with programme IDs. There is handy tool – Multimode Networks Projection – that can collapse the graph so that entities are connected to other entities that they shared a programme ID with.

Graph colouring in gephi

(If you forget to remove the programme nodes, a degree range filter to select only nodes with degree greater than 2 tidies the graph up.)

If we run PageRank on the graph (now as an undirected graph), layout using ForceAtlas2 and size nodes according to PageRank, we can look into the heart of the UK political establishment as evidenced by appearances on Question Time and Any Questions.

The heart if the establishment

The next step would probably be to try to pull info about each recognised entity from dbPedia (forEach(cell.recon.candidates,v,v.id).sort()[0] seems to pull out dbpedia URIs) but grabbing data from dbPedia seems to be borked in my version of OpenRefine atm:-(

Anyway – a quick hack that took longer to write up than it did to do…

OpenRefine project file here.

Written by Tony Hirst

November 16, 2013 at 6:34 pm

Posted in Anything you want

Tagged with ,

Political Representation on BBC Political Q&A Programmes – Stub

It’s too nice a day to be inside hacking around with Parliament data as a remote participant in today’s Parliamentary hack weekend (resource list), but if it had been a wet weekend I may have toyed with one of the following:

- revisiting this cleaning script for Analysing UK Lobbying Data Using OpenRefine (actually, a look at who finds/offers support for All Party Groups. The idea was to get a dataset of people who provide secretariat and funds to APPGs, as well as who works for them, and then do something with that dataset…)

- tinkering with data from Question Time and Any Questions…

On that last one:

- we have data from the BBC about historical episodes of Question Time and historical episodes of Any Questions. (Click a year/month link to get the listing.)

These gives us generatable URLs for programmes by month with URLs of form http://www.bbc.co.uk/programmes/b006t1q9/broadcasts/2013/01 but how do we get a JSON version of that?! Adding .json on the end doesn’t work?!:-( UPDATE – this could be a start, via @nevali – use pattern /programmes/PID.rdf , such as http://www.bbc.co.uk/programmes/b006qgvj.rdf

We can get bits of information (albeit in semi-structured from) about panellists in data form from programme URL hacks like this: http://www.bbc.co.uk/programmes/b007m3c1.json

Note that some older programmes don’t list all the panelists in the data? So a visit to WIkipedia – http://en.wikipedia.org/wiki/List_of_Question_Time_episodes#2007 – may be in order for Question Time (there isn’t a similar page for Any Questions?)

Given panellists (the BBC could be more helpful here in the way it structures its data…), see if we can identify parliamentarians (MP suffix? Lord/Lady title?) and look them up using the new-to-me, not-yet-played-with-it UK Parliament – Members’ Names Data Platform API. Not sure if reconciliation works on parliamentarian lookup (indeed, not sure if there is a reconciliation service anywhere for looking up MPs, members of the House of Lords, etc?)

From Members’ Names API, we can get things like gender, constituency, whether or not they were holding a (shadow) cabinet post, maybe whether they were on a particular committee at the time etc. From programme pages, we may be able to get the location of the programme recording. So this opens up possibility of mapping geo-coverage of Question Time/Any Questions, both in terms of where the programmes visit as well as which constituencies are represented on them.

If we were feeling playful, we could also have a stab at which APPGs have representation on those programmes!

It also suggests a simpler hack – of just providing a briefing around the representatives appearing on a particular episode in terms of their current (or at the time) parliamentary status (committees, cabinet positions, APPGs etc etc)?

Written by Tony Hirst

November 16, 2013 at 12:31 pm

Confused by Government Spending, Indirectly…

I’ve been doodling around local spending data again recently, noticing as ever that one of the obvious things to do is pull out payments to a particular company (notwithstanding the very many issues associated with actually identifying a particular company or entities within the same corporate group), and I started wondering about certain classes of public payment that may or may not get classed as spend but that do get spent with particular companies.

One example might be winter fuel payments. I don’t know if these are granted in such a way that they have to be used to cover energy bills (for example, by virtue of being delivered in the form of vouchers that can be redeemed against energy bills), or whether the money is just cash that the recipient can choose to spend howsoever; but if they are so restricted in terms of how they can be used they represent a way for government to make a payment to an energy company using a level of indirection that means we can’t at first glance see how government makes that payment to the energy company. The “choice” of who receives the payment is up to the consumer, presumably, but it seems to me to be that it is the government that is essentially making the payment to the energy company as a subsidy for a particular class of customer (as defined by criteria for determining winter fuel payment eligibility).

By not regulating profits made by energy companies more harshly, government presumably supports pricing that requires government to subsidise a significant number of customers. By not regulating prices more harshly, government seems keen to keep giving the energy companies a bung by proxy? I guess the rationale for making the payments this way is that the government can argue that it is acting progressively. If government just gave the energy companies a bung directly, people would get upset: either that the energy companies were being given a chunk of cash for free, or that they were being given a chunk of cash to subsidise the prices they set which would mean that people who could afford the higher price were also benefiting from the deal. How would we feel if, rather than government giving winter payments to those eligible, it just gave the cash straight to the utilities in a transparent way we could track, and required them to identify eligible customers and give them a reduced tariff? Of course, if the winter fuel payments are actually hypothecated, doing it this way would mean that folk currently in receipt of the payments wouldn’t be able to use the money in other ways?

Another area of “spend” that confuses me is the new “Share to buy” home equity loan scheme, in which the government “provides an equity loan (also known as shared equity) of up to 20% of the value of the home you are buying. … the buyer needs only a 5% deposit, and a 75% mortgage to make up the rest.”

The Help to Buy Equity Loan is interest-free for 5 years. After that, the purchaser pays an annual fee of 1.75% on the amount of the outstanding loan. The fee will increase each year by inflation (Retail Price Index (RPI) + 1%.

The purchaser can start repaying the equity loan after they’ve owned the home for a year, but they’ll need to be able to pay a minimum of 10% of the property value at the time of repayment.

When they want to sell their home, they’ll need to repay the percentage equity loan that is still outstanding. So, for example, if they originally bought 80% of the property and they hadn’t repaid any of the equity loan, their repayment on selling would be 20% of the market value at the time when they sell.

One reading of this might be that folk spend as much as they can afford on a house (and maybe even a little bit more), now some of them may be tempted to spend that much and up to 20% more…? That is, might they see the deal as if they were getting a 20% discount on the house (conveniently forgetting that interest payments will kick in in a 5 years?) allowing them to offer more and hence inflate prices more?

What I also wonder about this is: is this the government trying to kickstart a more fluid market in shared ownership on the equity side? I’m guessing that at some point the plan is for the government to flog off the loan book (and presumably then allow interest payments on the loans to float a little more…)? But might there also be an intention to allow individual investors to buy the title to individual equity loans? So rather than investing in a buy to let, individuals would be encouraged to invest in shared ownership schemes from the equity, rather than resident partial owner, side, as an investment?

PS I don’t know about regulatory capture, but policy capture seems like even more of a win for the utilities?! Gas industry employee seconded to draft UK’s energy policy

Written by Tony Hirst

November 11, 2013 at 10:48 am

Posted in Anything you want

Google’s New Terms Mean You Could Soon Be Acting as a Product Endorser

If you’re a Google account holder, you may have noticed an announcement recently that Google has changed its terms and conditions, in part to allow it to use your +1s and comments as “shared endorsements” in ads published through Google ad services.


So it seems as if there’s now at least two ways Google uses you, me, us, to generate revenue in an advertising context. Firstly, we’re sold as “audience” within a particular segment: “35-50 males into tech”, for example, and audience that advertisers can buy access to. This may even get to the level of individual targeting (for example, Centralising User Tracking on the Web – Let Google Track Everyone For You). Now, secondly, as personal endorsers of a particular company, service or product.

The ‘recent changes’ announcement URL looks like a general “change notice” URL – https://www.google.co.uk/intl/en/policies/terms/changes/ – so I’ll repost key elements from the announcement here….

“Because many of [us] are allergic to legalese”, announcement goes, “here’s a plain English summary for [our] convenience.”

We’ve made three changes:

Firstly, clarifying how your Profile name and photo might appear in Google products (including in reviews, advertising and other commercial contexts).

You can control whether your image and name appear in ads via the Shared Endorsements setting.

Secondly, a reminder to use your mobile devices safely.
Thirdly, details on the importance of keeping your password confidential.

The first change – how my Profile name and photo might appear in Google products – is the one I’m interested in.

How your Profile name and photo may appear (including in reviews and advertising)

We want to give you, and your friends and connections, the most useful information. Recommendations from people that you know can really help. So your friends, family and others may see your Profile name and photo, and content like the reviews that you share or the ads that you +1’d. This only happens when you take an action (things like +1’ing, commenting or following) – and the only people who see it are the people that you’ve chosen to share that content with. On Google, you’re in control of what you share. This update to our Terms of Service doesn’t change in any way who you’ve shared things with in the past or your ability to control who you want to share things with in the future.

Feedback from people you know can save you time and improve results for you and your friends across all Google services, including Search, Maps, Play and in advertising. For example, your friends might see that you rated an album 4 stars on the band’s Google Play page. And the +1 you gave your favourite local bakery could be included in an ad that the bakery runs through Google. We call these recommendations shared endorsements and you can learn more about them here.

When it comes to shared endorsements in ads, you can control the use of your Profile name and photo via the Shared Endorsements setting.

Here’s a direct link to the setting… [if you have a Google+ account, I suggest you go there, uncheck the box, and hit "Save"]. I never knowingly checked this – so presumably the default is set to checked (that is, with me opted in to the “service”?

I never knowingly checked this - so presumably the defualt is "checked"?

If you turn the setting to “off,” …

you’ll get hassled:

F**k you, google...

or to put it another way,

…your Profile name and photo will not show up on that ad for your favourite bakery or any other ads. This setting only applies to use in ads, and doesn’t change whether your Profile name or photo may be used in other places such as Google Play.

I have no idea what the context of Google Play might mean. I do have an Google Android phone, and it is tied to a Google account. It is largely a mystery to me, particularly when it comes to knowing who has access to – or has taken copies of – my contacts. I have no idea what Google Play services I have or have not been opted in to.

If you previously told Google that you did not want your +1’s to appear in ads, then of course we’ll continue to respect that choice as a part of this updated setting.

I’m not sure what that means? If I’ve checked “do not want my +1’s to appear in ads” box, will the current setting be set to unchecked (opt out of shared endorsements)? Does the original setting still exist somewhere, or has it been replaced by the new setting? Or is there another level of privacy setting somewhere, and if so how do the various levels interact?

This is on my current Google+ settings page:

shared endorsements

and I can’t see anything about +1 ad opt outs, so presumably the setting has changed? I’d have thought I’d have opted out of allowing +1s to appears in ads (had I known: a) that +1s may have been used in ads; and b) that such a setting existed), but presumably that fact passed me by (more on this later in the post…) Or I had opted out and the opt-out wasn’t respected? But surely not that…?

For users under 18, their actions won’t appear in shared endorsements in ads and certain other contexts.

Which is to say, ‘if you lied about your age in order to access to particular services, we’re gonna sell the ability for advertisers to use you to endorse their products to your friends’.

So that’s the “helpful” explanation of the terms.. what do the actual terms say?

When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide licence to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes that we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights that you grant in this licence are for the limited purpose of operating, promoting and improving our Services, and to develop new ones. This licence continues even if you stop using our Services (for example, for a business listing that you have added to Google Maps). Some Services may offer you ways to access and remove content that has been provided to that Service. Also, in some of our Services, there are terms or settings that narrow the scope of our use of the content submitted in those Services. Make sure that you have the necessary rights to grant us this licence for any content you submit to our Services. [This para, or one very much like it, is in the current terms.]

If you have a Google Account, we may display your Profile name, Profile photo and actions you take on Google or on third-party applications connected to your Google Account (such as +1’s, reviews you write and comments you post) in our Services, including displaying in ads and other commercial contexts. We will respect the choices you make to limit sharing or visibility settings in your Google Account. For example, you can choose your settings so that your name and photo do not appear in an ad.

Hmmm.. so maybe the settings do – or will – have a finer level of control (and complexity…) associated with them? I wonder also whether those two paragraphs can work together? If I comment on a Google+ page, or maybe tag a brand or product in an image I have uploaded, could Google create a derivative work as part of a shared endorsement by me?

Looking Around Some Other Google+ Settings

Finding myself on my Google+ settings page, I had a look at some of the other settings…

Be wary of implict reveals?

Hmm… this could be an issue, if checked? If things are shared to people in my circles, and folk get automatically added to my circles if I just search for them, then, erm, I could maybe unwaringly opt a page in to my circles?

circle shares

But if I do search for someone and they’re added to my circles on my behalf, what circle are they added to?

so which do they get added to?

Not being paranoid or anything, but I can now also imagine something like the following setting appearing on my main Google account insofar as it relates to search, for example:

Google Search Pages
_ Automatically add a Google+ Author to my circles if I click through on a search result marked with a Google+ Author tag.

So what other settings are there that may be of interest?

Several to do with automatically tampering with my content (as if false memory syndromes aren’t bad enough!)

mess with my stuff...

do stuff to my stuff

I seem to remember these being announced, but didn’t think to check that I would automatically be opted in.

Note to self: When Google announces a new Google+ service, or service related to Google accounts, assume I get automatically opted in.

Any others? Ah, ha… a little something that invisibly enmeshes me a little deeper in the Google knowledge web:

link me in to the Google knowledge graph

Here’s the blurb, rather bluntly entitled Find My Face: “Find my face makes finding pictures and videos of you easy and more social. Find my face offers name tag suggestions so you, or people that you know, can quickly tag photos. Any time someone tags you in a photo or video, you’ll be able to accept or reject name tags created by people you know.”

So I’m guessing if I opt in to this, if Google recognises that I’m in a photo, and someone I know views that photo, they’ll be prompted to tag me in it. I wonder if Google actually has a belief graph and a knowledge graph? In the first case, the belief graph would associate me with photos Google’s algorithms think I’m in. In the second case, the knowledge graph, Google would associate me with photos where someone confirms that I am in the photo. If you want to get geeky, this knowledge vs. belief distinction, where knowledge means “justified true belief”, has a basis in things like epistemic logic (which I came across in the context of agent logics) – I’d never really thought about Google’s graph in this way… Hmmm…

Here’s how it works, apparently:

After you turn on Find my Face, Google+ uses the photos or videos you’re tagged in to create a model of your face. The model updates as tags of you are added or removed and you can delete the entire face model at any time by turning off Find my Face.

If you turn on Find my Face, we can use your face model to make it easier to find photos or videos of you. For example, we’ll show a suggestion to tag you when you or someone you know looks at a photo or video that matches your face model. Name tag suggestions by themselves do not change the sharing setting of photos or albums or videos. However, when someone approves the suggestion to add a name tag, the photo and relevant album or video are shared with the person tagged.

So can Google sell that face model of me to other parties? Or just sell recognition of my face in photos and videos as a service, or as part of an audience construction process?

I guess at least I get to approve any photo tags though… Or do I?

Act on my behalf

So if I search for someone on Google+, they’re added to my circles, which means that if they tag me in a photo when prompted by Google+ to do so, their tag is automatically accepted by me by virtue of this proxy setting I seem to have been automatically opted in to? Or am I reading these settings all wrong?

Ho hum, I guess it’s not even the legalese I’m allergic to… it’s understanding the emergent complexity and consequences that arise from different combinations of settings on personal account pages…

Written by Tony Hirst

October 12, 2013 at 3:22 pm

ScreenScraping HTML Web Pages With OpenRefine – Norwegian Oil Company Data

[An old post, rescued from the list of previously unpublished posts...]

Although I use OpenRefine from time time, one thing I don’t tend to use it for is screenscraping HTML web pages – I tend to write Python scripts in Scraperwiki to do this. Writing code is not for everyone, however, so I’ve brushed off my searches of the OpenRefine help pages to come up with this recipe for hacking around with various flavours of company data.

The setting actually comes from OpenOil’s Johnny West:

1) given the companies in a particular spreadsheet… for example “Bayerngas Norge AS” (row 6)
2) plug them into the Norwegian govt’s company registry — http://www.brreg.no/ (second search box down nav bar on the left) – this gives us corporate identifier… so for example… 989490168
3) plug that into purehelp.no — so http://www.purehelp.no/company/details/bayerngasnorgeas/989490168
4) the Aksjonærer at the bottom (the shareholders that hold that company) – their percentages
5) searching OpenCorporates.com with those names to get their corporate identifiers and home jurisdictions
6) mapping that back to the spreadsheet in some way… so for each of the companies with their EITI entry we get their parent companies and home jurisdictions

Let’s see how far we can get…

To start with, I had a look at the two corporate search sites Johnny mentioned. Hacking around with the URLs, there seemed to be a couple of possible simplifications:

- looking up company ID can be constructed around http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS – the link structure has changed since I originally wrote this post, correct form is now http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS&orgform=0&fylke=0&kommune=0&barebedr=false [h/t/ Larssen in the comments.]

- http://www.purehelp.no/company/details/989490168 (without company name in URL) appears to work ok, so can get there from company number.

Loading the original spreadsheet data into OpenRefine gives us a spreadsheet that looks like this:

openRefine xls import

So that’s step 1…

We can run step 2 as follows* – create a new column from the company column:

* see the end of the post for an alternative way of obtaining company identifiers using the OpenCorporates reconciliation API…

openRefine add new col

Here’s how we construct the URL:

OpenRefine - get new col by URL

The HTML is a bit of a mess, but by Viewing Source on an example page, we can find a crib that leads us close to the data we require, specifically the fragment detalj.jsp?orgnr= in the URL of the first of the href attributes of the result links.

table to scrape - crib

Using that crib, we can pull out the company ID and the company name for the first result, constructing a name/id pair as follows:

[value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() , value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() ].join('::')

The first part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() – pulls out the company ID from the first search result, extracting it from the URL fragment.

The second part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() – pulls out the company name from the first search result.

We place these two parts into an array and then join them with two colons: [].join('::')

This keeps thing tidy and allows us to check by eye that sensible company names have been found from the original search strings.

open refine - compare names

We can now split the name/ID pair column into two separate columns:

openRefine spilt column into cols

And the result:

openrefne  cols now split

The next step, step 3, requires looking up the company IDs on purehelp. We’ve already see how a new column can be created from a source column by URL, so we just repeat that approach with a new URL pattern:

openrefine add another col by URL

(We could probably reduce the throttle time by an order of magnitude!)

The next step, step 4, is to pull out shareholders and their percentages.

The first step is to grab the shareholder table and each of the rows, which in the original looked like this:

shareholders table

The following hack seems to get us the rows we require:


BAH – crappy page sometimes has TWO companyOwnership IDs, when the company has shareholdings in other companies as well as when it has shareholders:-(


So much for unique IDs… ****** ******* *** ***** (i.e. not happy:-(

Need to search into table where “Shareholders” is specified in top bar of the table, and I don’t know offhand how to do that using the GREL recipe I was taking because the HTML of the page is really horrible. Bah…. #ffs:-(

Question, in GREL, how do I get the rows in this not-a-table? I need to specify the companyOwnership id in the parent div, and check for the Shareholders text() value in the first child, then ideally miss the title row, then get all the shareholder companies (in this case, there’s just one; better example):

<div id="companyOwnership" class="box">
	<div class="boxHeader">Shareholders:</div>
	<div class="boxContent">
		<div class="row rowHeading">
			<label class="fl" style="width: 70%;">Company name:</label>
			<label class="fl" style="width: 30%;">Percentage share (%)</label>
			<div class="cb"></div>
		<div class="row odd">
			<label class="fl" style="width: 70%;">Shell Exploration And Production Holdings</label>
			<div class="fr" style="width: 30%;">100.00%</div>
			<div class="cb"></div>

For now I’m going to take a risky shortcut and assume that the Shareholders (are there always shareholders?) are the last companyOwnership ID on the page:


openrefine last company ownership

We can then generate one row for each shareholder in OpenRefine:

open refine - spilt

(We’ll need to do some filling in later to cope with the gaps, but no need right now. We also picked up the table header, which has been given it’s own row, which we’ll have to cope with at some point. But again, no need right now.)

For some reason, I couldn’t parse the string for each row (it was late, I was confused!) so I hacked this piecemeal approach to try to take them by surprise…

value.replace(/\s/,' ').replace('<div class="row odd">','').replace('<div class="row even">','').replace('<form>','').replace('<label class="fl" style="width: 70%;">','').replace('<div class="cb"></div>','').replace('</form> </div>','').split('</label>').join('::')

horrible hack openrefine

Using the trick we previously applied to the combined name/ID column, we can split these into two separate columns, one for the shareholder and the other for their percentage holding (I used possibly misleading column names below – should say “Shareholder name”, for example, rather than shareholding 1?):

openrefine column split

We then need to tidy the two columns:


Note that some of the shareholder companies have identifiers in the website we scraped the data from, and some don’t. We’re going to be wasteful and throw the info away that links the company if it’s there…

value.replace('<div class="fr" style="width: 30%;">','').replace('</div>','').strip()

We now need to do a bit more tidying – fill down on the empty columns in the shareholder company column and also in the original company name and ID [actually - this is not right, as we can see below for the line Altinex Oil Norway AS...? Maybe we can get away with it though?], and filter out the rows that were generated as headers (text facet then select out blank and Fimanavn).

This is what we get:

COmpany ownership

We can then export this file, before considering steps 5 and 6, using the custom exporter:

open refine exporter

Select the columns – including the check column of the name of the company we discovered by searching on the names given in the original spreadsheet… these are the names that the shareholders actually refer to…

column export

And then select the export format:

column export format

Here’s the file: shareholder data (one of the names at least appears not to have worked – Altinex Oil Norway AS). LOoking at the data, I think we also need to take the precaution of using .strip() on the shareholder names.

Here’s the OpenRefine project file to this point [note the broken link pattern for brreg noted at the top of the post and in the comments... The original link will be the one used in the OpenRefine project...]

Maybe export on a filtered version where Shareholding 1 is not empty. Also remove the percentage sign (%) in the shareholding 2 column? ALso note that Andre is “Other”… maybe replace this too?

In order to get the OpenCorporates identifiers, we should be able to just run company names through the OpenCorporates reconciliation service.

Hmmm.. I wonder – do we even have to go that far? From the Norwegian company number, is the OpenCorporates identifier just that number in the Norwegian namespace? So for BAYERNGAS NORGE AS, which has Norwegian company number 989490168, can we look it up directly on OpenCorporates as http://opencorporates.com/companies/no/989490168? It seems like we can…

This means we possibily have an alternative to step 2 – rather than picking up company numbers by searching into and scraping the Norwegian company register, we can reconcile names against the OpenCorporates reconciliation API and then pick up the company numbers from there?

Written by Tony Hirst

October 10, 2013 at 11:14 pm

www.gov.uk – Replacing the master (dot com) of Previous Government Departmental Web Domains?

By the by, I ran a search on the deprecated dwp.gov.uk website earlier today (Google took me there originally, I think, rather than to the new site reached via http://www.gov.uk/dwp):

search on dwp.gov.uk

and ended up on a search results page with the URL http://dwp.gov.uk.master.com/texis/master/search/?q=sharing+data+local+authority&s=SS:

dwp search

(The results appear to be broken – on the first link at least, the redirect from the results page goes to a largely irrelevant link on the new gov.uk site.)

Hmm…, master.com?

master.com homepage

Ooh – slick… ‘ere, gov, wanna buy a new motorwebservice?

Written by Tony Hirst

September 24, 2013 at 9:45 am


Get every new post delivered to your Inbox.

Join 729 other followers