September 2015 – OUseful.Info, the blog…

Link Sharing in Classrooms, Workshops and Conferences

Every so often, I get reminded of old OUseful experiments by recent news, some of which still seem to work, and some of which don’t (the latter usually as a result of link rot or third party service closures).

So for example, a few weeks ago, the Google Education blog announced a new Chrome extension – Share to Classroom (Get your students on the same (web)page, instantly.

In the first case, it seems as if the extension lets someone at the front of the room share pages to class member screens inside the room. Secondly, folk in the class are also free to browse to other pages of their own finding and push suggestions of new pages to the person at the front, who can review them and then share them with everyone.

Anyway, in part it reminded me of one of my old hacks – the FeedShow Link Presenter. This was a much cruder affair, creating a frame based page that accepted an RSS feed of links referred to by the main presenter that audience members could click forwards and back through, but that also seems to have started exploring a “back with me” feature to sync everyone’s pages.

Presenters could go off piste (splashing page links not contained in the feed, but it looks as if these couldn’t be synched. (I’m not sure if I addressed that in a later revision.) Nor could audience members suggest links back to the main presenter.

The Feedshow link presenter had a Yahoo Pipes backend, and still seems to work; but with Pipes dues to close on September 30th, it looks as if this experiment’s time will have come to an end…

Ho hum… (I guess I’ve stopped doing link based presentations anyway, but it was nice to be reminded of them:-)

Poking Around the VW Thing (Algorithmic Cheating in the Automobile Industry)

The news in recent days that VW had installed a software device into some of it’s diesel engines that identify when the engine is being tested for emissions on a dynamometer (“dyno”) provides a nice demonstration of how “intelligent” software control systems can be use to identify particular operating environments and switch into an appropriate – or inappropriate – operational mode.

As I keep finding, press coverage of the events seems to offer less explanation and context than the original document that seems to have kicked off the recent publicity, specifically a letter from the US environmental protection agency to the Volkswagen Group of America:

(The document is a PDF of a scanned document; I had hoped to extracted the text using a variant of this recipe, running Tika on my own computer via Kitematic, Getting Text Out Of Anything (docs, PDFs, Images) Using Apache Tika, but it doesnlt seem to extract text from the inlined image(s). Instead, I uploaded it to Google Drive, then opened it in Google docs – you see pages from the original doc as an image, and below is the extracted text.)

Here are some of the bits that jumped out at me:

A defeat device is an AECD [Auxiliary Emission Control Device] “that reduces the effectiveness of the emission control system under conditions which may reasonably be expected to be encountered in normal vehicle operation and procedure use, unless: (1) Such conditions are substantially included in the Federal emission test procedure; (2) The need for the AECD is justified in terms of protecting the vehicle against damage or accident; (3) The AECD does not go beyond the requirements of engine starting; or (4) The AECD applies only for emergency vehicles …” 40 C.F.R. § 86.1803-01.

Motor vehicles equipped with defeat devices, … cannot be certified.

The CAA makes it a violation “for any person to manufacture or sell, or offer to sell, or install, any part or component intended for use with, or as part of any motor vehicle or motor vehicle engine, where a principal effect of the part or component is to bypass, defeat, or render inoperative any device or element of design installed on or in a motor vehicle or motor vehicle engine in compliance with regulations under this subchapter, and where the person knows or should know that such part or component is being offered for sale or installed for such use or put to such use.” CAA § 203(a)(3)(B), 42 U.S.C. § 7522(a)(3)(B); 40 C.F.R. § 86.1854-12(a)(3)(ii).

Each VW vehicle identified … has AECDs that were not described in the application for the COC that purportedly covers the vehicle. Specifically, VW manufactured and installed software in the electronic control module (ECM) of these vehicles that sensed when the vehicle was being tested for compliance with EPA emission standards. For ease of reference, the EPA is calling this the “switch.” The “switch” senses whether the vehicle is being tested or not based on various inputs including the position of the steering wheel, vehicle speed, the duration of the engine’s operation, and barometric pressure. These inputs precisely track the parameters of the federal test procedure used for emission testing for EPA certification purposes. During EPA emission testing, the vehicles ECM ran software which produced compliant emission results under an ECM calibration that VW referred to as the “dyno calibration” (referring to the equipment used in emissions testing, called a dynamometer). At all other times during normal vehicle operation, the “switch” was activated and the vehicle ECM software ran a separate “road calibration” which reduced the effectiveness of the emission control system (specifically the selective catalytic reduction or the lean NOx trap). As a result, emissions of NOx increased by a factor of 10 to 40 times above the EPA compliant levels, depending on the type of drive cycle (e.g., city, highway). … Over the course of the year following the publication of the WVU study [TH: see link below], VW continued to assert to CARB and the EPA that the increased emissions from these vehicles could be attributed to various technical issues and unexpected in-use conditions. VW issued a voluntary recall in December 2014 to address the issue. … When the testing showed only a limited benefit to the recall, CARB broadened the testing to pinpoint the exact technical nature of the vehicles’ poor performance, and to investigate why the vehicles’ onboard diagnostic system was not detecting the increased emissions. None of the potential technical issues suggested by VW explained the higher test results consistently confirmed during CARB’s testing. It became clear that CARB and the EPA would not approve certificates of conformity for VW’s 2016 model year diesel vehicles until VW could adequately explain the anomalous emissions and ensure the agencies that the 2016 model year vehicles would not have similar issues. Only then did VW admit it had designed and installed a defeat device in these vehicles in the form of a sophisticated software algorithm that detected when a vehicle was undergoing emissions testing.

VW knew or should have known that its “road calibration” and “switch” together bypass, defeat, or render inoperative elements of the vehicle design related to compliance with the CAA emission standards. This is apparent given the design of these defeat devices. As described above, the software was designed to track the parameters of the federal test procedure and cause emission control systems to underperform when the software determined that the vehicle was not undergoing the federal test procedure.

VW’s “road calibration” and “switch” are AECDs’ that were neither described nor justified in the applicable COC applications, and are illegal defeat devices.

The news also reminded of another tech journalism brouhahah from earlier this year around tractor manufacturer John Deere arguing that farmers don’t own their tractors, but instead purchase “an implied license for the life of the vehicle to operate the vehicle” (Wired, We Can’t Let John Deere Destroy the Very Idea of Ownership).

I didn’t really follow that story properly at the time but it seems the news arose out of a response to a consultation by the US Copyright Office around the Digital Millennium Copyright Act (DMCA), and in particular a “Proposed Class 21: Vehicle software – diagnosis, repair, or modification” category (first round comments, second round comments) for the DCMA Section 1201: Exemptions to Prohibition Against Circumvention of Technological Measures Protecting Copyrighted Works.

[UPDATE: Decisions have now been made on what exemptions are allowable: Exemption to Prohibition on Circumvention of Copyright Protection Systems for Access Control Technologies (PDF)]

Here’s how the class was defined:

21. Proposed Class 21: Vehicle software – diagnosis, repair, or modification
This proposed class would allow circumvention of TPMs [technological protection measures] protecting computer programs that control the functioning of a motorized land vehicle, including personal automobiles, commercial motor vehicles, and agricultural machinery, for purposes of lawful diagnosis and repair, or aftermarket personalization, modification, or other improvement. Under the exemption as proposed, circumvention would be allowed when undertaken by or on behalf of the lawful owner of the vehicle.

Note the phrase “for purposes of lawful diagnosis and repair”…

I also note a related class:

22. Proposed Class 22: Vehicle software – security and safety research
This proposed class would allow circumvention of TPMs protecting computer programs that control the functioning of a motorized land vehicle for the purpose of researching the security or safety of such vehicles. Under the exemption as proposed, circumvention would be allowed when undertaken by or on behalf of the lawful owner of the vehicle.

(and in passing note Proposed Class 27: Software – networked medical devices…).

Looking at some of the supporting documents, it’s interesting to see how the lobby moved. For example, from the Senior Director of Environmental Affairs for the Alliance of Automobile Manufacturers:

The proponents state that an exemption is needed for three activities related to vehicles – diagnosis, repair, and modification. In my limited time, I will explain why, for the first two activities – diagnosis and repair – there is no need to circumvent access controls on Electronic Control Units (ECUs). Then, I will address why tampering with ECUs to “modify” vehicle performance undermines national regulatory goals for clean air, fuel efficiency, and auto safety, and why the Copyright Office should care about that.

1. Diagnosis/repair
The arguments put forward by the proponents of this exemption are unfounded. State and federal regulations, combined with the Right to Repair MOU and the 2002 “Dorgan letter,” guarantee all independent repair shops and individual consumers access to all the information and tools needed to diagnose and repair Model Year 1996 or newer cars. This information and these tools are already accessible online, through a thriving and competitive aftermarket. Every piece of information and every tool used to diagnose and repair vehicles at franchised dealers is available to every consumer and every independent repair shop in America. This has been the case for the past 12 years. Moreover, all of these regulations and agreements require automakers to provide the information and tools at a “fair and reasonable price.” No one in the last 12 years has disputed this fact, in any of the various avenues for review provided, including U.S. EPA, the California Air Resources Board, and joint manufacturer-aftermarket organizations.

There is absolutely no need to hack through technological protection measures and copy ECU software to diagnose and repair vehicles.

2. Modification
The regulations and agreements discussed above do not apply to information needed to “modify” engine and vehicle software. We strongly support a competitive marketplace in the tools and information people need so their cars continue to perform as designed, in compliance with all regulatory requirements. But helping people take their cars out of compliance with those requirements is something we certainly do not want to encourage. That, in essence, is what proponents of exemption #21 are calling for, in asserting a right to hack into vehicle software for purposes of “modification.” In the design and operation of ECUs in today’s automobiles, manufacturers must achieve a delicate balance among many competing regulatory demands, notably emissions (air pollution); fuel economy; and of course, vehicle safety. If the calibrations are out of balance, the car may be taken out of compliance. This is so likely to occur with many of the modifications that the proponents want to make that you could almost say that noncompliance is their goal, or at least an inevitable side effect.

Manufacturer John Deere suggested that:

1. The purpose and character of the use frustrate compliance with federal public safety and environmental regulations

The first fair use factor weighs against a finding of fair use because the purpose and character of the use will encourage non-compliance with environmental regulations and will interfere with the ability of manufacturers to identify and resolve software problems, conduct recalls, review warranty claims, and provide software upgrade versions.

And General Motors seem to take a similar line:

TPMs also ensure that vehicles meet federally mandated safety and emissions standards. For example, circumvention of certain emissions-oriented TPMs, such as seed/key access control mechanisms, could be a violation of federal law. Notably, the Clean Air Act (“CAA”) prohibits “tampering” with vehicles or vehicle engines once they have been certified in a certain configuration by the Environmental Protection Agency (“EPA”) for introduction into U.S. commerce. “Tampering” includes “rendering inoperative” integrated design elements to modify vehicle and/or engine performance without complying with emissions regulations. In addition, the Motor Vehicle Safety Act (“MVSA”) prohibits the introduction into U.S. commerce of vehicles that do not comply with the Federal Motor Vehicle Safety Standards, and prohibits manufacturers, dealers, distributors, or motor vehicle repair businesses from knowingly making inoperative any part of a device or element of design installed on or in a motor vehicle in compliance with an applicable motor vehicle standard.14

Further, tampering with these systems would not be obvious to a subsequent owner or driver of a vehicle that has been tampered with. If a vehicle’s airbag systems, including any malfunction indicator lights, have been disabled (whether deliberately or inadvertently), a subsequent vehicle owner’s safety will be in jeopardy without warning. Further, if a vehicle’s emissions systems have been tampered with, a subsequent owner would have no way of knowing this has occurred. For tampering that the subsequent owner eventually discovers, manufacturer warranties do not cover the repair of damage caused by the tampering, placing the repair cost on the subsequent owner. For good cause, federal environmental and safety regulations regarding motor vehicles establish a well-recognized overall policy against allowing tampering with in-vehicle electronic systems designed for safety and emissions control.

…

While so-called “tinkerers” and enthusiasts may wish to modify their vehicle software for personal needs, granting greater access to vehicle software for purposes of modification fails to consider the overall concerns surrounding regulatory compliance and safety and the overall impact on safety and the environment. … Thus, the current prohibition ensures the distribution of safe and secure vehicle software within an overall vehicle security strategy implemented by car manufacturers that does not restrict vehicle owners’ ability to diagnose, modify or repair their cars.

The arguments from the auto lobby therefore go along the lines of “folk can’t mess with the code because they’ll try to break the law”, as opposed to the manufacturers systematically breaking the law, or folk trying to find out why a car performs nothing like the apparently declared figures. And I’m sure there are no elements of the industry wanting to prevent folk from looking at the code lest they find that it has “test circumvention” code baked in to it by the actual manufacturers…

What the VW case throws up, perhaps, is the need for a clear route for investigators to be allowed to find a way of checking on the compliance behaviour of various algorithms, not just in formal tests but also in unannounced to the engine management system, everyday road tests.

And that doesn’t necessarily require on-the-road tests in a real vehicle. If the controller is a piece of software acting on digitised sensor inputs to produce a particular set of control algorithm outputs, the controller can be tested on a digital testbench or test harness against various test inputs covering a variety of input conditions captured from real world data logging. This is something I think I need to read up more about… this could be a quick way in to the very basics: National Instruments: Building Flexible, Cost-Effective ECU Test Systems White Paper. Something like this could also be relevant: Gehring, J. and Schütte, H., “A Hardware-in-the-Loop Test Bench for the Validation of Complex ECU Networks”, SAE Technical Paper 2002-01-0801, 2002, doi:10.4271/2002-01-0801 (though the OU Library fails to get me immediate access to this resource…:-(.

PS In passing, I just spotted this: Auto Parts Distributor Pleads Guilty to Manufacturing and Selling Pirated Mercedes-Benz Software – it seems that Mercedes-Benz distribute “a portable tablet-type computer that contains proprietary software created by to diagnose and repair its automobiles and that requires a code or ‘license key’ to access [it]” and that a company had admitted to obtaining “without authorization, … [the] Mercedes-Benz SDS software and updates, modified and duplicated the software, and installed the software on laptop computers (which served as the SDS units)”. So a simple act of software copyright/license infringement, perhaps, relating to offboard testing and diagnostic tools. But another piece in the jigsaw, for example, when it comes to engineering software that can perform diagnostics.

PPS via @mhawksey, a link to the relevant West Virginia University test report – In-use emissions testing of light-duty diesel vehicles in the U.S. and noting Martin’s observation that there are several references to Volkswagen’s new 2.0 l TDI engine for the most stringent emission standards — Part 2 (reference [31] in the paper, Hadler, J., Rudolph, F., Dorenkamp, R., Kosters, M., Mannigel, D., and Veldten, B., “Volkswagen’s New 2.0l TDI Engine for the Most Stringent Emission Standards – Part 2,” MTZ Worldwide, Vol. 69, June, (2008). , which the OU Library at least doesn’t subscribe to:-(…

“Interestingly” the report concluded:

In summary, real-world NOx emissions were found to exceed the US-EPA Tier2-Bin5 standard (at full useful life) by a factor of 15 to 35 for the LNT equipped Vehicle A, by a factor of 5 to 20 for the urea-SCR fitted Vehicle B (same engine as Vehicle A) and at or below the standard for Vehicle C with exception of rural-up/downhill driving conditions, over five predefined test routes. Generally, distance-specific NOx emissions were observed to be highest for rural-up/downhill and lowest for high-speed highway driving conditions with relatively flat terrain. Interestingly, NOx emissions factors for Vehicles A and B were below the US-EPA Tier2-Bin5 standard for the weighted average over the FTP-75 cycle during chassis dynamometer testing at CARB’s El Monte facility, with 0.022g/km ±0.006g/km (±1σ, 2 repeats) and 0.016g/km ±0.002g/km (±1σ, 3 repeats), respectively.

It also seems that the researchers spotted what might be happening to explain the apparently anomalous results they were getting with help from reference 31: “The probability of this explanation is additionally supported by a detailed description of the after-treatment control strategy for Vehicle A presented elsewhere [31]“.

PPPS I guess one way of following the case might be track the lawsuits, such as this class action complaint against VW filed in California (/via @funnymonkey) or this one, also in San Francisco, or Tennessee, or Georgia, or another district in Georgia, or Santa Barbara, or Illinois, or Virginia, and presumably the list goes on… (I wonder if any of those complaints are actually informative/well-researched in terms of their content? And whether they are all just variations on a similar theme? In which case, they could be an interesting basis for a comparative text analysis?)

In passing, I also note that suits have previously – but still recently – been filed against VW, amongst others, regarding misleading fuel consumption claims, for example in the EU.

Tinkering With Data Consuming WordPress Plugins

Now that I’ve got my own webspace on Reclaim Hosting and set up a handful of my own self-hosted WordPress blogs, I started having a tinker with some custom plugins. I’ve started off with something simple, a couple of shortcode plugins that I can use to embed a simple leaflet map into a post or a page.

So far, I’ve tried a couple of different approaches…

The first one has been pared down to just a default accepting shortcode:

[IWPlanningLeafletMap]

The shortcode calls a bit of code that makes a cached, (at most, three times a day) request to a morph.io scraper that returns a list of currently open for consultation planning applications to the Isle of Wight Council. (That said, I can pass in a few shortcode parameters relating to the size of the map (which is a fixed size, unfortunately – I haven’t worked out to make it flexible…) and the default location and zoom.) The scraper itself runs once daily.

My feeling is that this sort of code would work best in the context of a WordPress page, acting as a destination that allows folk to just check on currently open applications.

The second plugin embeds a map that displays markers for recent house sales (as recorded by the Land Registry prices paid dataset). This dataset is published as a monthly set of data a month or two after the fact and is downloaded to my local desktop. A python script then reads in the data, creates a new WordPress post containing the shortcode with the data baked in, and uploads the post to WordPress (where it appears in the draft queue).

In this shortcode, the marker data is currently encoded using the PHP serialise format (via the python phpserialize dumps method) and embedded in the post as the value of shortcode attribute.

[MultiMarkerLeafletMap zoom=11 lat=50.675 lon=-1.32 width=800 height=500 markers='a:112:{i:0;a:5:{s:3:"lat";d:50.699382843;s:4:"date";s:10:"2015-07-10";s:3:"lon";d:-1.29297620442;s:8:"location";s:22:"SAVOY COURT, TOWN...]

In this case, with the marker data baked into the shortcode, there’s a good argument for rendering the map within a timestamped post as a fixed map (at least, ‘fixed’ in the sense that the data is unchanging).

The PHP un/serialize route isn’t ideal because I think it raises security issues? I originally tried to pass the data as serialised JSON, but the data is in the form of a list and it seems the ] breaks things. I guess what I should really do is see if I can pass the data in as serialised JSON between the shortcode tags rather than pass it in as an attribute?

Another approach might be to define just a simple map embed shortcode, and then support additional shortcodes that add markers to the map?

My PHP coding is also a bit scrappy (and I keep forgetting the ;’s…:-( I think one thing I do need to do is pop the simple plugin code functions into a class to keep them safe, and also patch in a hack or two (as seems to be required?) so that the leaflet map libraries are only loaded into post and page headers for posts/pages that actually contain one of the map shortcodes.

I’m also thinking now I need to find a way to support boundary lines and shape colouring?

Hmm…

DON’T PANIC – The Latest OU/Hans Rosling Co-Pro Has (Almost) Arrived…

To tie in with the UN’s Sustainable Development Goals summit later this week, the OU teamed up with production company Wingspan Productions and data raconteur Professor Hans Rosling to produce a second “Don’t Panic” lecture performance that airs on BBC2 at 8pm tonight: Don’t Panic – How to End Poverty in 15 Years.

Here’s a trailer…

…and here’s the complementary OpenLearn site: Don’t Panic – How to End Poverty in 15 Years (OpenLearn).

If you saw the previous outing – DON’T PANIC — The Facts About Population – it takes a similar format, once again using the Musion projection system to render “holographic” data visualisations that Hans tells his stories around.

(I did try to suggest that a spinning 3d chart could be quite compelling up on the big screen to illustrate different country trajectories over time, but was told it’d probably be too complicated graphic for the audience to understand!;-)

Off the back of the previous co-production, the OU commissioned a series of video shorts featuring Hans Rosling that review several ways in which we can make sense of global development using data, and statistics:

Hans Rosling: How to compare countries using statistics: Statistician magician Hans Rosling shows you how to collate and animate the data of different countries in these short videos that depict our changing world.
Showing the population’s age distribution: Hans Rosling teaches you how to display the age distribution of any country and interpret the patterns.
Displaying GDP per capita income data: Find economics challenging? Hans Rosling makes it easier by explaining how to show income data.
Conveying income distribution statistics: Hans Rosling depicts the inequality within each country by using income distribution indicators.
Presenting data on carbon emissions: Think China is the biggest gas guzzler? Hans Rosling exposes this myth by showing carbon emissions per capita.
Unveiling data on the world’s health: In an animated fashion, Hans Rosling shows us the correct way to display data on health across the world.

One idea for making use of these videos was to incorporate them into a full open course on FutureLearn, but for a variety of (internal?) reasons, that idea was canned. However, some of the material I’d started sketching for that possibility have finally made the light of day. They appear as a series of OpenLearn lessons relating to the first three short films listed above, with the videos cut into bite size fragments and interspersed throughout a narrative text and embedded interactive data charts:

You might also pick up on some of the activity possibilities that are included too…

Note that those lessons are not quite presented as originally handed over… I was half hoping OpenLearn might have a go at displaying them as “scrollytelling” immersive stories as something of experiment, but that appears not to be the case (maybe I should have actually published the stories first?! Hmmm…!). Anyway, here’s what I original drafted, using Storybuilder:

If you have any comments on the charts, or feedback on the immersive story presentation (did it work for you, or did you find it irritating?), please let me know via the comments below.

PS if you are interested in doing a FutureLearn course with an development data feel, at least in part, check out OUr forthcoming FutureLearn MOOC, Learn to Code for Data Analysis. Using interactive IPython notebooks, you’ll learn how to start wrangling and visualising open datasets (including weather data, World Bank indicators data, and UN Comtrade import and export data) using the Python programming language and the pandas data wrangling python package.

Even Though RSS Never Went Away, Could It Be Coming Back as a Facebook Sinker?

Long time readers will know I was – am – a huge fan of RSS and Atom, simple feed based protocols for syndicating content and attachment links, even going so far as to write a manifesto of a sort at one point (We Ignore RSS at OUr Peril).

This blog, and the earlier archived version of it, are full of reports and recipes around various RSS experiments and doodles, although in more recent years I haven’t really been using RSS as a creative medium that much, if at all.

But today I noticed this on the official Facebook developer blog: Publishing Instant Articles Directly From Your Content Management System [Instant Article docs]. Or more specifically, this:

When publishers get started with Instant Articles, they provide an RSS feed of their articles to Facebook, a format that most Content Management Systems already support. Once this RSS feed is set up, Instant Articles automatically loads new stories as soon as they are published to the publisher’s website and apps. Updates and corrections are also automatically captured via the RSS feed so that breaking news remains up to date.

So… Facebook will use RSS to synch content into Facebook from publishers’ CMS’.

Depending on the agreement Facebook has with the publishers, it may require that those feeds are private, rather than public, feeds that sink the the content directly into Facebook.

But I wonder, will it also start sinking content from other independent publishers into the Facebook platform via those open feeds, providing even less reason for Facebook users to go elsewhere as it drops bits of content from the open web into closed, personal Facebook News Feeds? Hmmm…

There seems to be another sort of a grab for attention going on too:

Each Instant Article is associated with the URL where the web version is hosted on the publisher’s website. This means that Instant Articles are open and compatible with all of the ways that people share links around the web today:

When a friend or page you follow shares a link in your News Feed, we check to see if there is an Instant Article associated with that URL. If so, you will see it as an Instant Article. If not, it will open on the web browser.

When you share an Instant Article on Facebook or using email, SMS, or Twitter, you are sharing the link to the publisher website so anyone can open the article no matter what platform they use.

Associating each Instant Article with a URL makes it easy for publishers to adopt Instant Articles without changing their publishing workflows and means that people can read and share articles without thinking about the platform or technology behind the scenes.

Something like this maybe?

Which is to say, this?

Or maybe not. Maybe there is some enlightened self interest in this, and perhaps Facebook will see a reason to start letting its content out via open syndication formats, like RSS.

Or maybe RSS will end up sinking the Facebook platform, by allowing Facebook users to go off the platform but still accept content from it?

Whatever the case, as Facebook becomes a set of social platform companies rather than a single platform company, I wonder: will it have an open standard, feed based syndication bus to help content flow within and around those companies? Even if that content is locked inside the confines of a Facebook-parent-company-as-web attention wall?

PS So the ‘related content’ feature on my WordPress blog associates this post with an earlier one: Is Facebook Stifling the Free Flow of Information?, which it seems was lamenting an earlier decision by Facebook to disable the import of content into Facebook using RSS…?! What goes around, comes around, it seems?!

First Steps in a Conversational Slackbot interface to CQC Inspection Data

A few months ago, I started to have a play with ratings data from the CQC – the Care Quality Commission. I don’t seem to have posted the scraper tools I ended up with anywhere, but I have started playing with them again over the last couple of weeks in the context of my slackbot explorations.

In particular, I’ve started working about a conversational workflow for keeping track of inspections in a particular local area. At the current time, the CQC website only seems to support alerts relating to new reports at the level of a particular location, although it is possible to get faceted search results relating to reports published over the course of the last week or last month. For the local journalist wanting to keep tabs on reports associated with local providers, this means setting up a data beat that includes checking the CQC website on a regular basis, firstly to see whether there are any new reports, and secondly to see what sort of reports they are.

And as a report from OnTheWight today shows (Beacon Health Centre rated ‘Good’ by CQC), this can include good news stories as well as bad news one.

So what’s the thing I’ve been exploring in the slack context? A time saver, hopefully. In the first case, it provides a quick way of checking up on reports from the local area released over the previous week or previous month:

To begin with, we can ask for a summary report of recent inspections:

The report does a bit of counting – to provide some element of context – and provides a highlights statement regarding the overall result from each of the reports. (At the moment, I don’t sort the order in which reports are presented. There are opportunities here for prioritising which results to show. I also need to think about wether results should be provided as a single slackbot response, as is currently the case, or using a separate (and hence, “star-able”) response for each report.)

After briefly skimming over the recent results, we can tunnel down into a slightly more detailed summary of the report by passing in a location ID:

As part of the report, this returns a link to the CQC website so we can inspect the full result at source. I’ve also got a scraper that pulls the full reports from the CQC site, but at the moment it’s not returning the result to slack (I think there’s a message size limit which I’m overflowing, so I need to find what that limit it and split up the response to get round it.). That said, slack is perhaps not the best place to return long reports? Maybe a better way would be to submit quoted report components into a draft WordPress blog post?

We can also pull back administrative information regarding a location from its location ID.

This report also includes information about other locations operated by the same provider. (I think I do have a report somewhere that summarises the report ratings over all the locations associated with a given provider, so we can look to see how well other establishments operated by the same provider are rated, but I haven’t wired that into the Slack bot yet.)

There are several other ways we can develop this conversation…

Company number and charity number information is available for some providers, which means it should be to trivial to ask return company registration information and company directors information from Companies House or OpenCorporates, and perhaps even data from the Charities Commission.

Rather more scruffily, perhaps, we could use location name and postcode to try a search on the Food Standards Agency website to see if we can find food ratings data for establishments of interest.

There might also be opportunities for linking in items from local spending data to try to capture local authority spend with a particular location or provider. This would be simplified if council payments to CQC rated establishments or providers included the CQC location or provider ID, but that’s perhaps too much to ask for.

If nothing else, however, this demonstrates a casual conversational way in which a local journalist might be able to use slack as part of a local data beat to run regular, periodic checks over recent reports published by the CQC relating to local care establishments.

Some Idle Thoughts on Managing Temporal Posts in WordPress

Now that I’ve got a couple of my own WordPress blogs running off the back of my Reclaim Hosting account, I’ve started to look again at possible ways of tinkering with WordPress.

The first thing I had a look at was posting a draft WordPress post from a script.

Using a WordPress role editor plugin (e.g. a long the lines of this User Role Editor) it’s easy enough to create a new role with edit and upload permissions only [WordPress roles and capabilities], and create a new ‘autoposter’ user with that role. Code like the following then makes it easy enough to upload an image to WordPress, grab the URL, insert it into a post, and then submit the post – where it will, by default, appear as a draft post:

#Ish Via: http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.compat import xmlrpc_client
from wordpress_xmlrpc.methods import media, posts
from wordpress_xmlrpc.methods.posts import NewPost

wp = Client('http://blog.example.org/xmlrpc.php', ACCOUNTNAME, ACCOUNT_PASSWORD)

def wp_simplePost(client,title='ping',content='pong, <em>pong<em>'):
    post = WordPressPost()
    post.title = title
    post.content = content
    response = client.call(NewPost(post))
    return response

def wp_uploadImageFile(client,filename):

    #mimemap
    mimes={'png':'image/png', 'jpg':'image/jpeg'}
    mimetype=mimes[filename.split('.')[-1]]
    
    # prepare metadata
    data = {
            'name': filename,
            'type': mimetype,  # mimetype
    }

    # read the binary file and let the XMLRPC library encode it into base64
    with open(filename, 'rb') as img:
            data['bits'] = xmlrpc_client.Binary(img.read())

    response = client.call(media.UploadFile(data))
    return response

def quickTest():
    txt = "Hello World"
    txt=txt+'<img src="{}"/><br/>'.format(wp_uploadImageFile(wp,'hello2world.png')['url'])
    return txt

quickTest()

Dabbling with this then got me thinking about the different sorts of things that WordPress allows you to publish in general. It seems to me that there are essentially three main types of thing you can publish:

posts: the timestamped elements that appear in a reverse chronological order in a WordPress blog. Posts can also be tagged and categorised and viewed via a tag or category page. Posts can be ‘persisted’ at the top of the posts page by setting them as a “sticky” post.
pages: static content pages typically used to contain persistent, unchanging content. For example, an “About” page. Pages can also be organised hierarchically, with child subpages defined relative to a specified ‘parent’ page.
sidebar elements and widgets: these can contain static or dynamic content.

(By the by, a range of third party plugins appear to support the conversion of posts to pages, for example Post Type Switcher [untested] or the bulk converter Convert Post Types [untested].)

Within a page or a post, we can also include a shortcode element that can be used to include a small piece of templated text or generated from the execution of some custom code (which it seems could be python: running a python script from a WordPress shortcode). Shortcodes run each time a page is loaded, although you can use the WordPress Transients database API to implement a simple cache for them to improve performance (eg as described here and here).

Within a post, page or widget, we can also embed dynamic content. For example, we could embed a map that displays dynamically created markers that are essentially out of the control of the page or post publisher. Note that by default WordPress strips iframes from content (and it also seems reluctant to allow the upload of html files to the media gallery, at least by default). The preferred way to include custom embedded content seems to be to define a shortcode to embed the required content, although there are plugins around that allow you to embed iframes. (I didn’t spot one that let you inline the content of the iframe using srcdoc though?)

When we put together the Isle of Wight planning applications : Mapped page, one of the issues related to how updates to the map should be posted over time.

That is, should the map be uploaded to a fixed page and show only the most recent data, should it be posted as a timestamped post, to provide archival copies of the page, or should it be posted to a page and support a timeslider/history function?

Thinking about this again, the distinction seems to rely on what sort of (re)discovery we want to encourage or support. For example, if the page is a destination page, then we should probably use a page with a fixed URL for the most recent map. Older maps could be accessed via archive links, or perhaps subpages, if a time-filter wasn’t available on a single map view. Alternatively, we might want to alert readers to the map, in which case it might make more sense to use a timestamped post. (We could of course use a post to announce an update to the page, perhaps including a screenshot of the latest map in the post.)

It also strikes me that we need to consider publication schedules by a news outlet compared to the publication schedules associated with a particular dataset.

For example, Land Registry House Prices Paid data is published on a monthly basis a few weeks after each month the data has been collected for. In this case, it probably makes sense to publish on a monthly basis.

But what about care home or food outlet inspection data? The CQC publish data as it becomes available, although searches support the retrieval of data for a particular area published over the last week or last month relative the time the search is made. The Food Standards Agency produce updates to data download files on a daily basis, but the file for any particular area is only updated when it contains new data. (So on any given day, you don’t know which, if any, area files will be updated.)

In this case, it may well be that a news outlet may want to do a couple of things:

publish summaries of reports over the last week or last month, on a weekly or monthly schedule – “The CQC published reports for N care homes in the region over the last month, of which X were positive and Y were negative”, etc.
engage in a more immediate or responsive publication of stories around particular reports as they are published by the responsible agency. In this case, the journalist needs to find a way of discovering stories in a timely fashion, either through signing up to alerts or inspecting the agency site on a regular basis.

Again, it might be that we can use posts and pages in complementary way: pages that act as fixed destination sites with a fixed URL, and perhaps links off to archived historical sub-pages, as well as related news stories, that contain the latest summary; and posts that announce timely reports as well as ‘page updated’ announcements when the slower-changing page is updated.

More abstractly, it probably makes sense to consider the relative frequencies with which data is originally published (also considering whether the data is published according to a fixed schedule, or in a more responsive way as and when data becomes available), the frequency with which journalists check the data site, and the frequency with which journalists actually publish data related stories.

Tweetable Bullet Points

Reading through Pew Research blog post announcing the highlights of a new report just now (Libraries at the Crossroads), I spotted various Twitter icons around the document, not unlike the sort of speech bubble comment icons used in paragraph level commenting systems.

Nice – built in opportunities for social amplification of particular bullet points:-)

PS thinks: hmm, maybe I could build something similar into auto-generated copy for contexts like this?

PPS via @ned_potter, the suggestion that maybe they’re making use of this WordPress plugin: ClickToTweet. Hmmm… And via @mhawksey, this plugin: Inline Tweet Sharer.

Robot Journalists or Robot Press Secretaries? Why Automated Reporting Doesn’t Need to be That Clever

Automated content generators, aka robot journalists, are turning everywhere at the moment, it seems: the latest to cross my radar being a mention of “Dreamwriter” from Chinese publisher Tencent (End of the road for journalists? Tencent’s Robot reporter ‘Dreamwriter’ churns out perfect 1,000-word news story – in 60 seconds) to add to the other named narrative language generating bots I’m aware of, Automated Insight’s Wordsmith and Narrative Science’s Quill, for example.

Although I’m not sure of the detail, I assume that all of these platforms make use of quite sophisticated NLG (natural language generation) algorithms, to construct phrases, sentences, paragraphs and stories from atomic facts, identified story points, journalistic tropes and encoded linguistic theories.

One way of trying to unpick the algorithms is to critique, or even try to reverse engineer, stories known to be generated by the automatic content generators, looking for clues as to how they’re put together. See for example this recent BBC News story on Robo-journalism: How a computer describes a sports match.

Chatting to media academic Konstantin Dörr/@kndoerr in advance of the Future of Journalism conference in Cardiff last week (I didn’t attend the conference, just took the opportunity to grab a chat with Konstantin a couple of hours before his presentation on the ethical challenges of algorithmic journalism) I kept coming back to thoughts raised by my presentation at the Community Journalism event the day before [unannotated slides] about the differences between what I’m trying to explore and these rather more hyped-up initiatives.

In the first place, as part of the process, I’m trying to stay true to posting relatively simple – and complete – recipes that describe the work I’m doing so that others can play along. Secondly, in terms of the output, I’m not trying to do the NLG thing. Rather, I’m taking a template based approach – not much more than a form letter mail merge approach – to putting data into a textual form. Thirdly, the audience for the output is not the ultimate reader of a journalistic piece; rather, the intended audience is an intermediary, a journalist or researcher who needs an on-ramp providing them with useable access to data relevant to them that they can then use as the possible basis for a story.

In other words, the space I’m exploring is in-part supporting end-user development / end user programming (for journalist end-users, for example), in part automated or robotic press secretaries (not even robot reporters; see for example Data Reporting, not Data Journalism?) – engines that produce customised press releases from a national dataset at a local level that report a set of facts in a human readable way, perhaps along with supporting assets such as simple charts and very basic observational analysis (this month’s figures were more than last month’s figures, for example).

This model is one that supports a simple templated approach for a variety of reasons:

each localised report has the same form as any other localised report (eg a report on jobseeker’s allowance figures for the Isle of Wight can take the same form as a report for Milton Keynes);
it doesn’t matter so much if the report reads a little strangely, as long as the facts and claims are correct, because the output is not intended for final publication, as is, to the public – rather, it could be argued that it’s more like a badly written, fact based press statement that at least needs to go through a copy editor! In other words, we can start out scruffy…
the similarity in form of one report to another is not likely to be a distraction to the journalist in the way that it would be to a general public reader presented with several such stories and expecting an interesting – and distinct – narrative in each one. Indeed, the consistent presentation might well aid the journalist in quickly spotting the facts and deciding on a storyline and what contextualisation may be required to add further interpretative value to it.
targeting intermediary users rather than end user: the intermediary users all get to add their own value or style to the piece before the wider publication of the material, or use the data in support of other stories. That is, the final published form is not decided by the operator of the automatic content generator; rather, the automatically generated content is there to be customised, augmented, or used as supporting material, by an intermediary, or simply act as a “conversational” representation of a particular set of data provided to an intermediary.

The generation of the local datasets rom the national dataset is trivial – having generated code to slice out one dataset (by postcode or local authority, for example), we can slice out any other. The generation of the press releases from the local datasets can make use of the same template. This can be applied locally (a hyperlocal using it’s own template, for example) or centrally created and managed as part of a datawire service.

At the moment, the couple of automatically generated stories published with OnTheWight have been simple fact reporting, albeit via a human editor, rather than acting as the starting point for a more elaborate, contextualised, narrative report. But how might we extend this approach?

In the case of Jobseeker’s Allowance figures, contextualising paragraphs such as the recent closure of a local business, or the opening of another, as possible contributory factors to any month on month changes to the figures, could add colour or contextualisation to a monthly report.

Or we might invert the use of the figures, adding them as context to workforce, employment or labour related stories. For example, in the advent of a company closure, contextualisation of what the loss of numbers relative to local unemployment figures. (This fact augmented reporting is more likely to happen if the figures are readily available/to hand, as they are via autoresponder channels such as a Slackbot Data Wire.)

But I guess we have to start somewhere! And that somewhere is the simple (automatically produced, human copy edited) reporting of the facts.

PS in passing, I note via Full Fact that the Department of Health “will provide press officers [with an internal ‘data document’] with links to sources for each factual claim made in a speech, as well as contact details for the official or analyst who provided the information”, Department of Health to speed up responses to media and Full Fact. Which gets me thinking: what form might a press office publishing “data supported press releases” take, cf. a University Expert Press Room or Social Media Releases and the University Press Office, for example?

Creating a Google Custom Search Engine Over Hyperlocal Sites

Years ago, I used to spend quite a bit of time playing with Google Custom Search Engines, which allow you to run searches over a specified list of sites, trying to encourage librarians and educators to think about ways in which we might make use of them. I was reminded of this technology yesterday at the a Community Journalism conference, so thought it might be worth posting a quick how to about how to set up a CSE, in particular one that searches over the websites of hyperlocals listed on LocalWebList.net. (If you don’t want to see how it’s done, but do want to try it out, here’s my half-hour hack LocalWebList UK hyperlocal CSE.)

One way of creating a CSE is to manually enter the URLs of the sites you want to search over. Another is to use an annotations file that contains the URLs of sites you want to search over. These files can be hosted on your own site, or uploaded to Google (in the latter case, there is (small) limit on the size of file you can upload – 30KB.

The simplest annotations file is a two column (URL and Label) tab separated value file containing one row per site you want to include. Typically, sites are included using a URL pattern – onthewight.com/* for example, to say “index over all the pages on the onthewight.com domain.

The data file published by the LocalWebList includes a column containing the URL for the homepage for each hyperlocal site listed. We can download the datafile and then open it in the powerful data cleaning tool OpenRefine to inspect it:

If you skim through the URLs, you might notice that several sites have simple URLs (example.com), others are a bit more cluttered (example.com/index2.html), others point to sites like facebook. I’m going to make an arbitrary decision to ignore facebook sites and define patterns based on all the pages in a single domain.

To do that, I’m going to create a new column (url2) in OpenRefine from the URL column, that defines just such a pattern based on the original URL.

The following expression:

value.replace(/https?:\/\/([^\/]*)/,'$1').split('/')[0]+'/*'

uses a regular expression to manage just such a transformation.

I can inspect the unique values generated by this transformation by looking at a text facet applied to the new url2 column:

If you sort by count in the text facet, you will see several of the hyperlocal sites have websites hosed on aboutmyarea, or facebook. (Click on one of those links in the text facet to show the sites associated with those domains.) I am going to discount those links from my CSE, so hover over the link and click on the “include” setting to toggle it to “exclude”. Then click on the “invert” option to show all the sites that aren’t the ones you’ve selected as excluded.

This leaves us with sites that are more likely unique:

Having got a filtered lists of sites, we can generate an annotations file containing the URL patterns we want to search over and the CSE label. The label identifies to Google which CSE the URLs in the annotations file apply to. We get that code by generating a CSE…

When creating a new CSE, along with giving it a name, you;ll also have to seed it with at least one URL. Simply enter a pattern for a URL you know you want to include in the search engine.

Hit create, and you’ll have a new CSE…

From the “Advanced” tab, go to the CSE annotations area and find the code for your CSE:

Now we’re in a position to add the CSE code to our annotations file – so copy the CSE label for your CSE… We can create the annotations file in OpenRefine from the “Export” menu, where we select “Templating”:

The templating option allows us to define a custom export template. The template is built up from a header, a row separator, a footer, and a row pattern that describes how to write out each row. I define a simple template as follows, and then export the file.

(Note – there are other ways I could have done this (indeed, there are often “other ways”!). For example, I could have created a new column containing just the CSE label value, and then done a custom table export, selecting the url2 column and label column, along with the TSV output format.)

Export the annotations file and then import it into the CSE – hit the “Add” button in the CSE annotations area.

Once uploaded (and remember, there is a 30KB file size limit on this route), go back to the Basics tab: you should find that your custom search engine now lists as sites to be searched over the sites you included in your annotations file, as well as being provided with a link to your CSE.

You can tweak with some of the styling for the CSE from the “Look and Feel” menu option in the CSE admin pages sidebar.

If you now click on your CSE URL you should find you have a minimal Google Custom Search engine that searches over several hundred UK hyperlocal websites.

To add in some of the sites we originally excluded, eg the ones on the aboutmyarea domain, we could add specific URL patterns in explicitly via the CSE control panel.

Google Custom search engines can be really quick to set up in a minimal form, but can also be customised further – for example, with tweaks to the ranking algorithm or with custom annotations (see for example Search Engine Powered Courses).

You can also generate lists of URLs from things like homepage links in Twitter bios grabbed from a Twitter list (eg Using Twitter Lists to Define Custom Search Engines – that code appears to have rotted slightly, but I have a fix…Let me know via the comments if you’re interested in generating CSEs from Twitter lists etc).

As I mentioned at the start, it’s been some years since I played with Google Custom Search Engines – I was really hopeful for them at one point, but Google never really seems to give them any love (not necessarily a bad thing – perhaps they are just enough over and under the radar for Google to cut them?), and I couldn’t seem to persuade anyone else (in the OU at least) that they were worth spending any time on.

I think a few librarians did pick up on them though! And if there is interest in the hyperlocal community for seeing what we might do with them, I’d be happy to put my thinking cap back on, work up some more tutorials or use cases, and run training workshops etc etc.