Panama Papers, Quick Start in SQLite3

Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.

Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.

  • Download and install SQLite3
  • download the data from the Sunday Times and unzip it
  • on the command line/terminal, cd into the unzipped directory
  • create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
  • you should now be presented with a SQLite console command line. Run the command: .mode csv
  • we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
  • We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
  • so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
  • let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
  • see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
  • see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
  • exit SQLite console by running: .q
  • to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).

sunday_times_panama_data_—_sqlite3_—_82×41_and_ajh59_—_jupyter_mac_command_—_bash_—_80×24

sunday_times_panama_data_—_sqlite3_—_82×41

Have fun…

PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]

More Recognition/Identification Service APIs – Microsoft Cognitive Services

A couple of months ago, I posted A Quick Round-Up of Some *-Recognition Service APIs that described several off-the-shelf cloud hosted services from Google and IBM for processing text, audio and images.

Now it seems that Microsoft Cognitive Services (formally Project Oxford, in part) brings Microsoft’s tools to the party with a range of free tier and paid/metered services:

Microsoft_Cognitive_Services

So what’s on offer?

Vision

  • Computer Vision API: extract semantic features from an image, identify famous people (for some definition of “famous” that I can’t fathom), and extract text from images; 5,000 free transactions per month;
    https___www_microsoft_com_cognitive-services_en-us_computer-vision-api
    Microsoft_Cognitive_Services3
    Microsoft_Cognitive_Services5
  • Emotion API: extract emotion features from a photo of a person; photos – 30,000 free transactions per month;
    https___www_microsoft_com_cognitive-services_en-us_computer-vision-api2
  • Face API: extract face specific information from an image (location of facial features in an image); 30,000 free transactions per month;
    https___www_microsoft_com_cognitive-services_en-us_computer-vision-api3
  • Video API: 300 free transactions per month per feature.

Speech

Language

  • Bing Spell Check API: 5,000 free transactions per month
  • Language Understanding Intelligent Service (LUIS): language models for parsing texts; 100,000 free transactions per month;
  • Linguistic Analysis API: NLP sentence parser, I think… (tokenisation, parts of speech tagging, etc.) It’s dog slow and, from the times I got it to sort of work, this seems to be about the limit of what it can cope with (and even then it takes forever):
    Microsoft_Cognitive_Services6
    5,000 free transactions per month, 120 per minute (but you’d be luck to get anything done in a minute…);
  • Text Analytics API: sentiment analysis, topic detection and key phrase detection, language extraction; 5,000 free transactions;
  • Web Language Model API: “wordsplitter” – put in a string of words as a single string with space characters removed, and it’ll try to split the words out; 100,000 free transactions per month.

Knowledge

Search

There’s also a gallery of demo apps built around the APIs.

It’s seems then that we’ve moved into an era of commodity computing at the level of automated identification and metadata services, though many of them are still pretty ropey… The extent to which they will be developed and continue to improve will be the proof of just how useful they will be as utility services.

As far as the free usage caps on the Microsoft services, there seems to be a reasonable amount of freedom built in for folk who might want to try out some of these services in a teaching or research context. (I’m not sure if there are blocks for these services that can be wired in to the experiment flows in the Azure Machine Learning studio?)

I also wonder whether these are just the sorts of service that libraries should be aware of, and perhaps even work with in an informationista context…?!;-)

PS from the face, emotion and vision APIs, and perhaps entity extraction and sentiment analysis applied to any text extracted from images, I wonder if you could generate a range of stories automagically from a set of images. Would that be “art”? Or just #ds106 style playfulness?!

PPS Nov 2016 for photo-tagging, see also Amazon Rekognition.

A New Role for the Library – Gonzo Librarian Informationista

At the OU’s Future of Academic Libraries a couple of weeks ago, Sheila Corrall introduced a term and newly(?!) emerging role I hadn’t heard before coming out of the medical/health library area: informationist (bleurghh..).

According to a recent job ad (h/t Lorcan Dempsey):

The Nursing Informationist cultivates partnerships between the Biomedical Library and UCLA Nursing community by providing a broad range of information services, including in-depth reference and consultation service, instruction, collection development, and outreach.

Hmm… sounds just like a librarian to me?

Writing in the Journal of the Medical Library Association, The librarian as research informationist: a case study (101(4): 298–302,October, 2013), Lisa Federer described the  role in the following terms:

“The term “informationist” was first coined in 2000 to describe what the authors considered a new health sciences profession that combined expertise in library and information studies with subject matter expertise… Though a single model of informationist services has not been clearly defined, most descriptions of the informationist role assume that (1) informationists are “embedded” at the site where patrons conduct their work or need access to information, such as in a hospital, clinic, or research laboratory; and (2) informationists have academic training or specialized knowledge of their patrons’ fields of practice or research.”

Federer started to tighten up the definition in relation to research in particular:

Whereas traditional library services have generally focused on the “last mile” or finished product of the research process—the peer-reviewed literature—librarians have expertise that can help researchers create better research output in the form of more useful data. … The need for better research data management has given rise to a new role for librarians: the “research informationist.” Research informationists work with research teams at each step of the research process, from project inception and grant seeking to final publication, providing expert guidance on data management and preservation, bibliometric analysis, expert searching, compliance with grant funder policies regarding data management and open access, and other information-related areas.

This view is perhaps shared in a presentation on The Informationist: Pushing the Boundaries by Director of Library Services, Elaine Martin, in a presentation dated on Slideshare as October 2013:

The_Informationist__Pushing_the_Boundaries

Associated with the role are some competencies you might not normally expect from library staffer:

The_Informationist-3__Pushing_the_Boundaries

So – maybe here is the inkling of the idea that there could be a role for librarians skilled in working with information technologies in a more techie way than you might normally expect. (You’d normally expect a librarian to be able to use Boolean search, search limits and advanced search forms. You might not expect them to write their own custom SQL queries, or even build and populate their own databases that they can then query? But perhaps you’d expect a really techie informationist to?) And maybe also the idea that the informationist is a participant in a teaching or research activity?

The embedded nature of the informationist also makes me think of gonzo journalism, a participatory style of narrative journalism written from a first person perspective, often including the reporter as part of the story. Hunter S. Thompson is often held up as some sort of benchmark character for this style of writing, and I’d probably class Louis Theroux as a latter-day exemplar. The reporter as naif participant in which the journalist acts as a proxy for everyman’s – which is to say, our own – direct experience of the reported situation, is also in the gonzo style (see for example Feats of gonzo journalism have lost their lustre since George Plimpton’s pioneering days as a universal amateur).

So I’m wondering: isn’t the informationist actually a gonzo librarian, joining in with some activity and bring the skills of a librarian, or wider information scientist (or information technologist/technician) to the party…?

Another term introduced by Sheila Corrall and again, new to me, was “blended librarian”. According to Steven J. Bell and John Shank writing on The blended librarian in College and Research Libraries News, July/August 2004, pp 3722-375:

We define the “blended librarian” as an academic librarian who combines the traditional skill set of librarianship with the information technologist’s hardware/software skills, and the instructional or educational designer’s ability to apply technology appropriately in the teaching-learning process.

The focus of that paper was in part on defining a new role in which the skills and
knowledge of instructional design are wedded to our existing library and information technology skills
, but that doesn’t quite hit the spot for me. The paper also described six principles of blended librarianship, which are repeated on the LIS Wiki :

  1. Taking a leadership position as campus innovators and change agents is critical to the success of delivering library services in today’s “information society”.
  2. Committing to developing campus-wide information literacy initiatives on our campuses in order to facilitate our ongoing involvement in the teaching and learning process.
  3. Designing instructional and educational programs and classes to assist patrons in using library services and learning information literacy that is absolutely essential to gaining the necessary skills (trade) and knowledge (profession) for lifelong success.
  4. Collaborating and engaging in dialogue with instructional technologists and designers which is vital to the development of programs, services and resources needed to facilitate the instructional mission of academic libraries.
  5. Implementing adaptive, creative, proactive, and innovative change in library instruction can be enhanced by communicating and collaborating with newly created Instructional Technology/Design librarians and existing instructional designers and technologists.
  6. Transforming our relationship with faculty to emphasize our ability to assist them with integrating information technology and library resources into courses, but adding to that traditional role a new capacity to collaborate on enhancing student learning and outcome assessment in the area of information access, retrieval and integration.

Again, the emphasis on being able to work with current forms of instructional technology falls short of the mark for me.

But there is perhaps a glimmer of light in the principle associated with “assist[ing faculty] with integrating information technology and library resources into courses“, if we broaden that principle to include researchers as well as teachers, and then add in the idea that the informationist should also be helping explore, evaluate, advocate and teach on how to use emerging information technologies (including technologies associated with information and data processing, analysis an communication (that is, presentation; so things like data visualisation).

So I propose a new take on the informationist, adopting the term proposed in a second take tweet from Lorcan Dempsey: the informationista (which is far more playful, if nothing else, than informationist).

The informationista is someone like I, who tries share contemporary information skills (such as these), through participatory as well as teaching activities, blending techie skills with a library attitude. The informationista is also a hopeful and enthusiastic amateur (in the professional sense…) who explores ways in which new and emerging skills and technologies may be applied to the current situation.

At last, I have found my calling!;-)

See also: Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library (this has dated a bit but there is still quite a bit that can be retrieved from that sort of take, I think…)

PS see also notes on embedded librarians in the comments below.

Anatomy for First Year Computing and IT Students

For the two new first year computing and IT courses in production (due out October 2017), I’ve been given the newly created slacker role of “course visionary” (or something like that?!). My original hope for this was that I might be able to chip in some ideas about current trends and possibilities for developing our technology enhanced learning that would have some legs when the courses start in October 2017, and remain viable for the several years of course presentation, but I suspect the reality will be something different…

However it turns out, I thought that one of the things I’d use fragments of the time for would be to explore different possible warp threads through the courses. For example, one thread might be to take a “View Source” stance towards various technologies that would show students something of the anatomy of the computing related stuff that populates our daily lives. This is very much in the spirit of the Relevant Knowledge short courses we used to run, where one of the motivating ideas was to help learners make sense of the technological world around them. (Relevant Knowledge courses typically also tried to explore the social, political and economic context of the technology under consideration.)

So as a quick starter for ten, here are some of the things that could be explored in a tech anatomy strand.

The Anatomy of a URL

Learning to read a URL is a really handy to skill to have for several reasons. In the first place, it lets you hack the URL directly to find resources, rather than having to navigate or search the website through its HTML UI. In the second, it can make you a better web searcher: some understanding of URL structure allows you make more effective use of advanced search limits (such as site:, inurl:, filetype:, and so on); third, it can give you clues as to how the backend works, or what backend is in place (if you can recognise a WordPress installation as such, you can use knowledge about how the URLs are put together to interact with the installation more knowledgeably. For example, add ?feed=rss2&withoutcomments=1 to the end of a WordPress blog URL (such as this one) and you’ll get a single item RSS version of the page content.)

The Anatomy of a Web Page

The web was built on the principle of View Source, with built-in support still respected by today’s desktop web browsers at least, that lets you view the HTML, Javascript and CSS source that makes a web page what it is. Browsers also tend to come replete with developer tools that let you explore how the page works in even more detail. For example, I use Chrome developer tools frequently to look up particular elements in a web page when I’m building a scraper:

Isle_of_Wight_Council_-_Planning_Online_-_Planning_Application_Submissions_-_Search

(If you click the mobile phone icon, you can see what the page looks like on a selected class of mobile device.)

I also often look at the resources that have been loaded into the page:

Isle_of_Wight_Council_-_Planning_Online_-_Planning_Application_Submissions_-_Search2

Again, additional tools allow you to set the bandwidth rate (so you can feel how the page loads on a slower network connection) as well as recording a series of screenshots that show what the page looks like at various stages of its loading.

The Anatomy of a Tweet

As well as looking at how something like tweet is rendered in a webpage, it can also be instructive to see how a tweet is represented in machine terms by looking at what gets returned if you request the resource from the Twitter API. So for example, below is just part of what comes back when I ask the Twitter API for a single tweet:

anatomy tweet

You’ll see there’s quite a lot more information in there than just the tweet, including sender information.

The Anatomy of an Email Message

How does an email message get from the sender to the receiver? One thing you can do is to View Source on the header:

Outlook_MessageSource_2016-03-16_09_04_04__0000_0

Again, part of the reason for looking at the actual email “data” is so you can see what your email client is revealing to you, and what it’s hiding…

The Anatomy of a Powerpoint File

Filetypes like .xlsx (Microsoft Excel file), .docx (Microsoft Word file) and .pptx (Microsoft Powerpoint file) are actually compressed zip files. Change the suffix (eg pptx to zip and you can unzip it:

anatomy pptx

Once you’re inside, you can have access to individual image files, or other media resources, that are included in the document, as well as the rest of the “source” material for the document.

The Anatomy of an Image File

Image files are packed with metadata, as this peek inside a photo on John Naughton’s blog shows:

jpg-exif

We can also poke around with the actual image data, filtering the image in a variety of ways, changing the compression rate, and so on. We can even edit the image data directly…

Summary

Showing people how to poke around inside in a resource has several benefits: it gives you a strategy for exploring your curiosity about what makes a particular resource work (and perhaps also demonstrate that you can be curious about such things); it shows you how to start looking inside a resource (how to go about dissecting it doing the “View Source” thing); and it shows you how to start reading the entrails of the thing.

In so doing, it helps foster a sense of curiosity about how stuff works, as well as helping develop some of the skills that allow you to actually take things apart (and maybe put them back together again!) The detail also hooks you into the wider systemic considerations – why does a file need to record this or that field, for example, and how does the rest of the system make use of that information. (As MPs have recently been debating the Investigatory Powers Bill, I wonder how many of them have any clue about what sort of information can be gleaned from communications (meta)data, let alone what it looks like and how communications systems generate, collect and use it.)

PS Hmmm, thinks.. this could perhaps make sense as a series of OpenLearn posts?

Tagging Twitter Users From Their Twitter Bios Using OpenCalais and Alchemy API

Prompted by a request at the end of last week for some Twitter data I no longer have to hand, I revisited an old notebook script to try to tidy up some of my Twitter data grabbin’n’mungin’ scripts and have a quick play with some new toys, such as the pyLDAvis [demo] tool for analysing topic models in a notebook setting. (I generated my test models using gensimPy3, a Python 3 port of gensim, which all seemed to work okay…More on that in a later post, maybe…)

I also plugged in some entity extracting API calls for IBM’s Alchemy API and Thomson Reuters’ OpenCalais API. Both of these services provide a JSON response – and both are rate limited – so I’m cacheing responses for a while in a couple of MongoDB collections (one collection per service).

Here’s an example of the sort of response we can get from a call to Open Calais:

[{'_id': 213808950,
  'description': 'Daily Telegraph sub-editor, Newcastle United follower and future England cricketer',
  'reuters_entities': [{'text': 'follower and future England cricketer',
    'type': 'Position'},
   {'text': 'Daily Telegraph', 'type': 'Company'},
   {'text': 'sub-editor', 'type': 'Position'},
   {'text': 'Daily Telegraph', 'type': 'PublishedMedium'}],
  'screen_name': 'laurieallsopp'}]

Looking at that data, which I retrieved from my MongoDB (the reuters_entities are the bits I got back from the OpenCalais API), I wondered how I could query the database to pull back just the Position info, or just bios associated with a PublishedMedium.

It turns out that the $elemMatch property was the one I needed to allow what is essentially a wildcarded search into the path of the list of arrays (it can be nested if you need to search deeper…):

load_from_mongo('twitter','calaisdata',
                criteria={'reuters_entities':{'$elemMatch':{'type':'PublishedMedium'}}},
                projection={'screen_name':1, 'description':1,'reuters_entities.text':1,
                            'reuters_entities':{'$elemMatch':{'type':'PublishedMedium'}}})

calaisquery

In that example, the criteria definition limits returned records to those of type PublishedMedium, and the projection is used to return the first such matching element.

I can also run queries on job titles, or roles, as this example using grabbed AlchemyAPI data shows:

load_from_mongo('twitter','alchemydata',
                criteria={'ibm_entities':{'$elemMatch':{'type':'JobTitle'}}},
                projection={'screen_name':1, 'description':1,'ibm_entities.text':1,
                            'ibm_entities':{'$elemMatch':{'type':'JobTitle'}}})

alchemyquery

And so it follows that I could try to search for folk tagged as an editor (or variant thereof: editorial director, for example), modifying the query to additionally perform a case-insensitive search (I’m using pymongo to query the database):

criteria={'ibm_entities':{'$elemMatch':{'type':'JobTitle',
                                        'text':{ '$regex' : 'editor', '$options':'i' }}}}

For a case insensitive but otherwise exact search, use an expression of the form "^editor$" to force the search on the tag to match at the start (^) and end ($) of the string…

I’m not sure if such use of the entity data complies with the license terms though!

And of course, it would probably be much easier to just lookup whether the description contains a particular word or phrase!

Using Google to Look Up Where You Live via the Physical Location of Your Wifi Router

During a course team meeting today, I idly mentioned that we should be able to run a simple browser based activity involving the geolocation of a student’s computer based on Google knowing the location of their wifi router. I was challenged about the possibility of this, so I did a quick bit of searching to see if there was an easy way of looking up the MAC addresses (BSSID) of wifi access points that were in range, but not connected to:

show_wifi_access_point_mac_address_chrome_os_x_-_Google_Search

which turned up:

The airport command with '-s' or '-I' options is useful: /System/Library/PrivateFrameworks/Apple80211.framework/Resources/airport

airport-mac

(On Windows, the equivalent is maybe something like netsh wlan show network mode=bssid ??? And then call it via python.)

The second part of the jigsaw was to try to find a way of looking up a location from a wifi access point MAC address – it seems that the Google geolocation API does that out of the can:

The_Google_Maps_Geolocation_API_ _ _Google_Maps_Geolocation_API_ _ _Google_Developers_and_Add_New_Post_‹_OUseful_Info__the_blog____—_WordPress

An example of how to make a call is also provided, as long as you have an API key… So I got a key and gave it a go:

wifi-post

:-)

Looking at the structure of the example Google calls, you can enter several wifi MAC addresses, along with signal strength, and the API will presumably triangulate based on that information to give a more precise location.

The geolocation API also finds locations from cell tower IDs.

So back to the idea of a simple student activity to sniff out the MAC addresses of wifi routers their computer can see from the workplace or home, and then look up the location using the Google geolocation API and pop it on a map.

Which is actually the sort of thing your browser will do when you turn on geolocation services:

Mozilla_Firefox_Web_Browser_—_Geolocation_in_Firefox_—_Mozilla

But maybe when you run the commands yourself, it feels a little bit more creepy?

PS sort of very loosely related, eg in terms of trying to map spaces from signals in the surrounding aether, a technique for trying to map the insides of a room based on it’s audio signature in response to a click of the fingers: http://www.r-bloggers.com/intro-to-sound-analysis-with-r/

PPS here’s a start to a Python script to grab the MAC addresses and do the geolocation calls

import sys
print(sys.platform)

#/System/Library/PrivateFrameworks/Apple80211.framework/Resources/airport
import subprocess
def getWifiMacAddresses():
    #autodetect platform and then report based on this?
    results = subprocess.check_output(["/System/Library/PrivateFrameworks/Apple80211.framework/Resources/airport", "-s"])
    
    #win?
    #results = subprocess.check_output(["netsh", "wlan", "show", "network"])

    #linux?
    #! apt-get -y install wireless-tools
    #results = subprocess.check_output(["iwlist","scanning"])
    
    results = results.decode("utf-8") # needed in python 3
    ls = results.split("\n")
    ls = ls[1:]
    macAddr={}
    for l in [x.strip() for x in ls if x.strip()!='']:
        ll=l.split(' ')
        macAddr[l.strip().split(' ')[0]]=(l.strip().split(' ')[1], l.strip().split(' ')[2])
    return macAddr


#For Mac:
postjson={'wifiAccessPoints':[]}
hotspots=getWifiMacAddresses()
for h in hotspots:
    addr,db=hotspots[h]
    postjson['wifiAccessPoints'].append({'macAddress':addr, 'signalStrength':int(db)})
    
url='https://www.googleapis.com/geolocation/v1/geolocate?key={}'.format(googleMapsAPIkey)

import requests
r = requests.post(url, json=postjson)
r.json()

Grabbing Screenshots of folium Produced Choropleth Leaflet Maps from Python Code Using Selenium

I had a quick play with the latest updates to the folium python package today, generating a few choropleth maps around some of today’s Gov.UK data releases.

The problem I had was that folium generates an interactive Leaflet map as an HTML5 document (eg something like an interactive Google map), but I wanted a static image of it – a png file. So here’s a quick recipe showing how I did that, using a python function to automatically capture a screengrab of the map…

First up, a simple function to get a rough centre for the extent of some boundaries in a geoJSON boundaryfile containing the boundaries for LSOAs in the Isle of Wight:

#GeoJSON from https://github.com/martinjc/UK-GeoJson
import json
import fiona
fi=fiona.open(GEO_JSON)
fi.bounds

centre_lon, centre_lat=((bounds[0]+bounds[2])/2,(bounds[1]+bounds[3])/2)

Now we can get some data – I’m going to use the average travel time to a GP from today’s Journey times to key services by lower super output area data release and limit it to the Isle of Wight data.

import pandas as pd

gp='https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/485260/jts0505.xls'
xl=pd.ExcelFile(gp)
#xl.sheet_names

tname='JTS0505'
dbd=xl.parse(tname,skiprows=6)

iw=dbd[dbd['LA_Code']=='E06000046']

The next thing to do is generate the map – folium makes this quite easy to do: all I need to do is point to the geoJSON file (geo_path), declare where to find the labels I’m using to identify each shape in that file (key_on), include my pandas dataframe (data), and state which columns include the shape/area identifiers and the values I want to visualise (columns=[ID_COL, VAL_COL]).

import folium
m = folium.Map([centre_lat,centre_lon], zoom_start=11)

m.choropleth(
    geo_path='../IWgeodata/lsoa_by_lad/E06000046.json',
    data=iw,
    columns=['LSOA_code', 'GPPTt'],
    key_on='feature.properties.LSOA11CD',
    fill_color='PuBuGn', fill_opacity=1.0
    )
m

The map object is included in the variable m. If I save the map file, I can then use the selenium testing package to open a browser window that displays the map, generate a screen grab of it and save the image, and then close the browser. Note that I found I had to add in a slight delay because the map tiles occasionally took some time to load.

#!pip install selenium

import os
import time
from selenium import webdriver

delay=5

#Save the map as an HTML file
fn='testmap.html'
tmpurl='file://{path}/{mapfile}'.format(path=os.getcwd(),mapfile=fn)
m.save(fn)

#Open a browser window...
browser = webdriver.Firefox()
#..that displays the map...
browser.get(tmpurl)
#Give the map tiles some time to load
time.sleep(delay)
#Grab the screenshot
browser.save_screenshot('map.png')
#Close the browser
browser.quit()

Here’s the image of the map that was captured:

map

I can now upload the image to WordPress and include it in an automatically produced blog post:-)

PS before I hit on the Selenium route, I dabbled with a less useful, but perhaps still handy library for taking screenshots: pyscreenshot.

#!pip install pyscreenshot
import pyscreenshot as ImageGrab

im=ImageGrab.grab(bbox=(157,200,1154,800)) # X1,Y1,X2,Y2
#To grab the whole screen, omit the bbox parameter

#im.show()
im.save('screenAreGrab.png',format='png')

The downside was I had to find the co-ordinates of the area of the screen I wanted to grab by hand, which I couldn’t find a way of automating… Still, could be handy…

Finding Common Phrases or Sentences Across Different Documents

As mentioned in the previous post, I picked up on a nice little challenge from my colleague Ray Corrigan a couple days ago to find common sentences across different documents.

My first, rather naive, thought was to segment each of the docs into sentences and then compare sentences using a variety of fuzzy matching techniques, retaining the ones that sort-of matched. That approach was a bit ropey (I’ll describe it in another post), but whilst pondering it over a dog walk a much neater idea suggested itself – compare n-grams of various lengths over the two documents. At it’s heart, all we need to do is find the intersection of the ngrams that occur in each document.

So here’s a recipe to do that…

First, we need to get documents into a text form. I started off with PDF docs, but it was easy enough to extract the text using textract.

!pip install textract

import textract
txt = textract.process('ukpga_19840012_en.pdf')

The next step is to compare docs for a particular size n-gram – the following bit of code finds the common ngrams of a particular size and returns them as a list:

import nltk
from nltk.util import ngrams as nltk_ngrams

def common_ngram_txt(tokens1,tokens2,size=15):
    print('Checking ngram length {}'.format(size))
    ng1=set(nltk_ngrams(tokens1, size))
    ng2=set(nltk_ngrams(tokens2, size))

    match=set.intersection(ng1,ng2)
    print('..found {}'.format(len(match)))

    return match

I want to be able to find common ngrams of various lengths, so I started to put together the first fumblings of an n-gram sweeper.

The core idea was really simple – starting with the largest common n-gram, detect increasingly smaller n-grams; then do a concordance report on each of the common ngrams to show how that ngram appeared in the context of each document. (See n-gram / Multi-Word / Phrase Based Concordances in NLTK.)

Rather than generate lots of redundant reports – if I detected the common 3gram “quick brown fox”, I would also find the common ngrams “quick brown” and “brown fox” – I started off with the following heuristic: if a common n-gram is part of a longer common n-gram, ignore it. But this immediately turned up a problem. Consider the following case:

Document 1: the quick brown fox
Document 2: the quick brown fox and the quick brown cat and the quick brown dog

Here, there is a common 4-tuple:the quick brown fox. There is also a common 3-tuple: the quick brown, which a concordance plot would reveal as being found in the context of a cat and a dog as well as a fox. What I really need to do is keep a copy of common n-gram locations that are not contained within the context of a longer n-gram context in the second document, but drop copies of locations where it is subsumed in an already found longer ngram.

Indexing on token number within the second doc, I need to return something like this:

([('the', 'quick', 'brown', 'fox'),
  ('the', 'quick', 'brown'),
  ('the', 'quick', 'brown')],
 [[0, 3], [10, 12], [5, 7]]

which shows up the shorter common ngrams only in places where it is not part of the longer common ngram.

In the following, n_concordance_offset() finds the location of a phrase token list within a document token list. The ngram_sweep_txt() scans down a range of ngram lengths, starting with the longest, trying to identify locations that are not contained with an already discovered longer ngram.

def n_concordance_offset(text,phraseList):
    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #via http://stackoverflow.com/a/3852792/454773
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])

    return intersects

def ngram_sweep_txt(txt1,txt2,ngram_min=8,ngram_max=50):
    tokens1 = nltk.word_tokenize(txt1)
    tokens2 = nltk.word_tokenize(txt2)

    text1 = nltk.Text( tokens1 )
    text2 = nltk.Text( tokens2 )

    ngrams=[]
    strings=[]
    ranges=[]
    for i in range(ngram_max,ngram_min-1,-1):
        #Find long ngrams first
        newsweep=common_ngram_txt(tokens1,tokens2,size=i)
        for m in newsweep:
            localoffsets=n_concordance_offset(text2,m)

            #We need to avoid the problem of masking shorter ngrams by already found longer ones
            #eg if there is a common 3gram in a doc2 4gram, but the 4gram is not in doc1
            #so we need to see if the current ngram is contained within the doc index of longer ones already found

            for o in localoffsets:
                fnd=False
                for r in ranges:
                    if o>=r[0] and o<=r[1]:
                        fnd=True
                if not fnd:
                    ranges.append([o,o+i-1])
                    ngrams.append(m)
    return ngrams,ranges,txt1,txt2

def ngram_sweep(fn1,fn2,ngram_min=8,ngram_max=50):
    txt1 = textract.process(fn1).decode('utf8')
    txt2 = textract.process(fn2).decode('utf8')
    ngrams,ranges,txt1,txt2=ngram_sweep_txt(txt1,txt2,ngram_min=ngram_min,ngram_max=ngram_max)
    return ngrams,ranges,txt1,txt2

What I really need to do is automatically detect the largest n-gram and work back from there, perhaps using a binary search starting with an n-gram the size of the number of tokens in the shortest doc… But that's for another day…

Having discovered common phrases, we need to report them. The following n_concordance() function, (based on this) does just that; the concordance_reporter() function manages the outputs.

import textract

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    #via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)

    phraseList=nltk.word_tokenize(phrase)

    intersects= n_concordance_offset(text,phraseList)

    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                        for offset in intersects])

    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def concordance_reporter(fn1='Draft_Equipment_Interference_Code_of_Practice.pdf',
                         fn2='ukpga_19940013_en.pdf',fo='test.txt',ngram_min=10,ngram_max=15,
                         left_margin=5,right_margin=5,n=5):

    fo=fn2.replace('.pdf','_ngram_rep{}.txt'.format(n))

    f=open(fo, 'w+')
    f.close()

    print('Handling {}'.format(fo))
    ngrams,strings, txt1,txt2=ngram_sweep(fn1,fn2,ngram_min,ngram_max)
    #Remove any redundancy in the ngrams...
    ngrams=set(ngrams)
    with open(fo, 'a') as outfile:
        outfile.write('REPORT FOR ({} and {}\n\n'.format(fn1,fn2))
        print('found {} ngrams in that range...'.format(len(ngrams)))
        for m in ngrams:
            mt=' '.join(m)
            outfile.write('\n\-------\n{}\n\n'.format(mt.encode('utf8')))
            for c in n_concordance(txt1,mt,left_margin,right_margin):
                outfile.write('<<<<>>>>{}\n\n'.format(c.encode('utf8')))
    return

Finally, the following function makes it easier to compare a document of interest with several other documents:

for f in ['Draft_Investigatory_Powers_Bill.pdf','ukpga_19840012_en.pdf',
          'ukpga_19940013_en.pdf','ukpga_19970050_en.pdf','ukpga_20000023_en.pdf']:
    concordance_reporter(fn2=f,ngram_min=10,ngram_max=40,left_margin=15,right_margin=15)

Here’s an example of the sort of report it produces:

REPORT FOR (Draft_Equipment_Interference_Code_of_Practice.pdf and ukpga_19970050_en.pdf

\-------
concerning an individual ( whether living or dead ) who can be identified from it

&gt;&gt;&gt;&gt;&gt;personal information is information held in confidence concerning an individual ( whether living or dead ) who can be identified from it , and the material in question relates 

&lt;&lt;&lt;&lt;
<section>&gt;&gt;&gt;&gt;must cancel a warrant if he is satisfied that the action authorised by it is no longer necessary . 4.13 The person who made the application 

&lt;&lt;&lt;&lt;&lt;an authorisation given in his absence if satisfied that the action authorised by it is no longer necessary . ( 6 ) If the authorising officer 

&lt;&lt;&lt;&lt;&gt;&gt;&gt;&gt;one or more offences and : It involves the use of violence , results in substantial financial gain or is conduct by a large number of persons in pursuit of a common purpose ; or a person aged twenty-one or 

&lt;&lt;&lt;&lt;&gt;&gt;&gt;&gt;in confidence if it is held subject to an express or implied undertaking to hold it in confidence or is subject to a restriction on 

&gt;&gt;&gt;&gt;&gt;in confidence if it is held subject to an express or implied undertaking to hold it in confidence or it is subject to a restriction 

&lt;&lt;&lt;&lt;&gt;&gt;&gt;&gt;a person aged twenty-one or over with no previous convictions could reasonably be expected to be sentenced to three years’ imprisonment or more . 4.5 

&lt;&lt;&lt;&lt;&gt;&gt;&gt;&gt;to have effect the Secretary of State considers it necessary for the authorisation to continue to have effect for the purpose for which it was given , the Secretary of State may 

&lt;&lt;&lt;&lt;&lt;in whose absence it was given , considers it necessary for the authorisation to continue to have effect for the purpose for which it was issued , he may , in writing  

The first line in a block is the common phrase, the >>> elements are how the phrase appears in the first doc, the <<< elements are how it appears in the second doc. The width of the right and left margins of the contextual / concordance plot are parameterised and can be easily increased.

This seems such a basic problem – finding common phrases in different documents – I'd have expected there to be a standard solution to this? But in the quick search I tried, I couldn't find one? It was quite a fun puzzle to play with though, and offers lots of scope for improvement (I suspect it's a bit ropey when it comes to punctuation, for example). But it's a start…:-)

There's lots could be done on a UI front, too. For example, it'd be nice to be able to link documents, so you can click through from the first to show where the phrase came from in the second. But to do that requires annotating the original text, which in turn means being able to accurately identify where in a doc a token sequence appears. But building UIs is hard and time consuming… it'd be so much easier if folk could learn to use a code line UI!;-)

If you know of any “standard” solutions or packages for dealing with this sort of problem, please let me know via the comments:-)

PS The code could also do with some optimisation – eg if we know we’re repeatedly comparing against a base doc, it’s foolish to keep opening and tokenising the base doc…

PPS see also cltk which has a range of (text reuse functions](https://github.com/cltk/cltk/blob/master/cltk/text_reuse/comparison.py), for example:

from cltk.text_reuse.comparison import long_substring
#Find longest common substring
long_substring(str1, str2)

There may also be use some useful leads here: Identifying similar text (and plagiarism).

Slackbot Data Wire, Initial Sketch

Via a round-up post from Matt Jukes/@jukesie (Interesting elsewhere: Bring on the Bots), I was prompted to look again at Slack. OnTheWight’s Simon Perry originally tried to hook me in to Slack, but I didn’t need another place to go to check messages. Simon had also mentioned, in passing, how it would be nice to be able to get data alerts into Slack, but I’d not really followed it through, until the weekend, when I read again @jukesie’s comment that “what I love most about it [Slack] is the way it makes building simple, but useful (or at least funny), bots a breeze.”

After a couple of aborted attempts, I found a couple of python libraries to wrap the Slack API: pyslack and python-rtmbot (the latter also requires python-slackclient).

Using pyslack to send a message to Slack was pretty much a one-liner:

#Create API token at https://api.slack.com/web
token='xoxp-????????'

#!pip install pyslack
import slack
import slack.chat
slack.api_token = token
slack.chat.post_message('#general', 'Hello world', username='testbot')

general___OUseful_Slack

I was quite keen to see how easy it would be to reuse one of more of my data2text sketches as the basis for an autoresponder that could get accept a local data request from a Slack user and provide a localised data response using data from a national dataset.

I opted for a JSA (Jobseekers Allowance) textualiser (as used by OnTheWight and reported here: Isle of Wight innovates in a new area of Journalism and also in this journalism.co.uk piece: How On The Wight is experimenting with automation in news) that I seem to have bundled up into a small module, which would let me request JSA figures for a council based on a council identifier. My JSA textualiser module has a couple of demos hardwired into it (one for the Isle of Wight, one for the UK) so I could easily call on those.

To put together an autoresponder, I used the python-rtmbot, putting the botcode folder into a plugins folder in the python-rtmbot code directory.

The code for the bot is simple enough:

from nomis import *
import nomis_textualiser as nt
import pandas as pd

nomis=NOMIS_CONFIG()

import time
crontable = []
outputs = []

def process_message(data):

	text = data["text"]
	if text.startswith("JSA report"):
		if 'IW' in text: outputs.append([data['channel'], nt.otw_rep1(nt.iwCode)])
		elif 'UK' in text: outputs.append([data['channel'], nt.otw_rep1(nt.ukCode)])
	if text.startswith("JSA rate"):
		if 'IW' in text: outputs.append([data['channel'], nt.rateGetter(nt.iwCode)])
		elif 'UK' in text: outputs.append([data['channel'], nt.rateGetter(nt.ukCode)])

general___OUseful_Slack2

Rooting around, I also found a demo I’d put together for automatically looking up a council code from a Johnston Press newspaper title using a lookup table I’d put together at some point (I don’t remember how!).

Which meant that by using just a tiny dab of glue I could extend the bot further to include a lookup of JSA figures for a particular council based on the local rag JP covering that council. And the glue is this, added to the process_message() function definition:

	def getCodeForTitle(title):
		code=jj_titles[jj_titles['name']==title]['code_admin_district'].iloc[0]
		return code

	if text.startswith("JSA JP"):
		jj_titles=pd.read_csv("titles.csv")
		title=text.split('JSA JP')[1].strip()
		code=getCodeForTitle(title)

		outputs.append([data['channel'], nt.otw_rep1(code)])
		outputs.append([data['channel'], nt.rateGetter(code)])

general___OUseful_Slack3

This is quite an attractive route, I think, for national newsgroups: anyone in the group can create a bot to generate press release style copy at a local level from a national dataset, and then make it available to reporters from other titles in the group – who can simply key in by news title.

But it could work equally well for a community network of hyperlocals, or councils – organisations that are locally based and individually do the same work over and over again on national datasets.

The general flow is something a bit like this:

Tony_Hirst_-_Cardiff_-_community_journalism_-_data_wire_pptx

which has a couple of very obvious pain points:

Tony_Hirst_-2_Cardiff_-_community_journalism_-_data_wire_pptx

Firstly, finding the local data from the national data, cleaning the data, etc etc. Secondly, making some sort of sense of the data, and then doing some proper journalistic work writing a story on the interesting bits, putting them into context and explaining them, rather than just relaying the figures.

What the automation route does is to remove some of the pain, and allow the journalist to work up the story from the facts, presented informatively.

Tony_Hirst_-3_Cardiff_-_community_journalism_-_data_wire_pptx

This is a model I’m currently trying to work up with OnTheWight and one I’ll be talking about briefly at the What next for community journalism? event in Cardiff on Wednesday [slides].

PS Hmm.. this just in, The Future of the BBC 2015 [PDF] [announcement].

Local Accountability Reporting Service

Under this proposal, the BBC would allocate licence fee funding to invest in a service that reports on councils, courts and public services in towns and cities across the UK. The aim is to put in place a network of 100 public service reporters across the country.

Reporting would be available to the BBC but also, critically, to all reputable news organisations. In addition, while it would have to be impartial and would be run by the BBC, any news organisation — news agency, independent news provider, local paper as well as the BBC itself—could compete to win the contract to provide the reporting team for each area.

A shared data journalism centre Recent years have seen an explosion in data journalism. New stories are being found daily in government data, corporate data, data obtained under the Freedom of Information Act and increasing volumes of aggregated personalised data. This data offers new means of sourcing stories and of holding public services, politicians and powerful organisations to account.

We propose to create a new hub for data journalism, which serves both the BBC and makes available data analysis for news organisations across the UK. It will look to partner a university in the UK, as the BBC seeks to build a world-class data journalism facility that informs local, national and global news coverage.

A News Bank to syndicate content

The BBC will make available its regional video and local audio pieces for immediate use on the internet services of local and regional news organisations across the UK.

Video can be time-consuming and resource-intensive to produce. The News Bank would make available all pieces of BBC video content produced by the BBC’s regional and local news teams to other media providers. Subject to rights and further discussion with the industry we would also look to share longer versions of content not broadcast, such as sports interviews and press conferences.

Content would be easily searchable by other news organisations, making relevant material available to be downloaded or delivered by the outlets themselves, or for them to simply embed within their own websites. Sharing of content would ensure licence fee payers get maximum value from their investment in local journalism, but it would also provide additional content to allow news organisations to strengthen their offer to audiences without additional costs. We would also continue to enhance linking out from BBC Online, building on the work of Local Live.

Hmm… Share content – or share “pre-content”. Use BBC expertise to open up the data to more palatable forms, forms that the BBC’s own journalists can work with, but also share those intermediate forms with the regionals, locals and hyperlocals?

Data Literacy – Do We Need Data Scientists, Or Data Technicians?

One of the many things I vaguely remember studying from my school maths days are the various geometric transformations – rotations, translations and reflections – as applied particularly to 2D shapes. To a certain extent, knowledge of these operations helps me use the limited Insert Shape options in Powerpoint, as I pick shapes and arrows from the limited palette available and then rotate and reflect them to get the orientation I require.

But of more pressing concern to me on a daily basis is the need to engage in data transformations, whether as summary statistic transformations (find the median or mean values within several groups of the same dataset, for example, or calculating percentage differences away from within group means across group members for multiple groups, or shape transformations, reshaping a dataset from a wide to a long format, for example, melting a subset of columns or recasting a molten dataset into a wider format. (If that means nothing to you, I’m not surprised. But if you’ve ever worked with a dataset and copied and pasted data from multiple columns in to multiple rows to get it to look right/into the shape you want, you’ve suffered by not knowing how to reshape your dataset!)

Even though I tinker with data most days, I tend to avoid all but the simplest statistics. I know enough to know I don’t understand most statistical arcana, but I suspect there are folk who do know how to do that stuff properly. But what I do know from my own tinkering is that before I can run even the simplest stats, I often have to do a lot of work getting original datasets into a state where I can actually start to work with them.

The same stumbling blocks presumably present themselves to the data scientists and statisticians who not only know how to drive arcane statistical tests but also understand how to interpret and caveat them. Which is where tools like Open Refine come in…

Further down the pipeline are the policy makers and decision makers who use data to inform their policies and decisions. I don’t see why these people should be able to write a regexp, clean a dirty dataset, denormalise a table, write a SQL query, run a weird form of multivariate analysis, or reshape a dataset and then create a novel data visualisation from it based on a good understanding of the principles of The Grammar of Graphics; but I do think they should be able to pick up on the stories contained within the data and critique the way it is presented, as well as how the data was sourced and the operations applied to it during analysis, in addition to knowing how to sensibly make use of the data as part of the decision making or policy making process.

A recent Nesta report (July 2015) on Analytic Britain: Securing the right skills for the data-driven economy [PDF] gave a shiny “analytics this, analytics that” hype view of something or other (I got distracted by the analytics everything overtone), and was thankfully complemented by a more interesting report from the Universities UK report (July 2015) on Making the most of data: Data skills training in English universities [PDF].

In its opening summary, the UUK report found that “[t]he data skills shortage is not simply characterised by a lack of recruits with the right technical skills, but rather by a lack of recruits with the right combination of skills”, and also claimed that “[m]any undergraduate degree programmes teach the basic technical skills needed to understand and analyse data”. Undergrads may learn basic stats, but I wonder how many of them are comfortable with the hand tools of data wrangling that you need to be familiar with if you ever want to turn real data into something you can actually work with? That said, the report does give a useful review of data skills developed across a range of university subject areas.

(Both reports championed the OU-led urban data school, though I have to admit I can’t find any resources associated with that project? Perhaps the OU’s Smart Cities MOOC on FutureLearn is related to it? As far as I know, OUr Learn to Code for Data Analysis MOOC isn’t?)

From my perspective, I think it’d be a start if folk learned:

  • how to read simple charts;
  • how to identify meaningful stories in charts;
  • how to use data stories to inform decision making.

I also worry about the day-to-day practicalities of working with data in a hands on fashion and the roles associated with various data related tasks that fall along any portrayal of the data pipeline. For example, of the top of my head I think we can distinguish between things like:

  • data technician roles – for example, reshaping and cleaning datasets;
  • data engineering roles – managing storage, building and indexing databases, for example;
  • data analyst/science and data storyteller roles – that is, statisticians who can work with clean and well organised datasets to pull out structures, trends and patterns from within them;
  • data graphics/visualisation practitioners – who have the eye and the skills for developing visual ways of uncovering and relating the stories, trends, patterns and structures hidden in datasets, perhaps in support of the analyst, perhaps in support of the decision-making end-user ;
  • and data policymakers and data driven decision makers, who can phrase questions in such a way that makes it possible to use data to inform the decision or policymaking process, even if they don’t have to skills to wrangle or analyse the data that they can then use.

I think there is also a role for data questionmasters who can phrase and implement useful and interesting queries that can be applied to datasets, which might also fall to the data technician. I also see a role for data technologists, who are perhaps strong as a data technician, but with an appreciation of the engineering, science, visualisation and decision/policy making elements, though not necessarily strong as a practitioner in any of those camps.

(Data carpentry as a term is also useful, describing a role that covers many of the practical skills requirements I’d associate with a data technician, but that additionally supports the notion of “data craftsmanship”? A lot of data wrangling does come down to being a craft, I think, not least because the person working at the raw data end of the lifecycle may often develop specialist, hand crafted tools for working with the data that an analyst would not be able to justify spending the development time on.)

Here’s another carving of the data practitioner roles space, this time from Liz Lyon & Aaron Brenner (Bridging the Data Talent Gap: Positioning the iSchool as an Agent for Change, International Journal of Digital Curation, 10:1 (2015)):

Bridging_the_Data_Talent_Gap__Positioning_the_iSchool_as_an_Agent_for_Change___Lyon___International_Journal_of_Digital_Curation

The Royal Statistical Society Data Manifesto [PDF] (September 2014) argues for giving “[p]oliticians, policymakers and other professionals working in public services (such as regulators, teachers, doctors, etc.) … basic training in data handling and statistics to ensure they avoid making poor decisions which adversely affect citizens” and suggest that we need to “prepare for the data economy” by “skill[ing] up the nation”:

We need to train teachers from primary school through to university lecturers to encourage data literacy in young people from an early age. Basic data handling and quantitative skills should be an integral part of the taught curriculum across most A level subjects. … In particular, we should ensure that all students learn to handle and interpret real data using technology.

I like the sentiment of the RSS manifesto, but fear the Nesta buzzword hype chasing and the conservatism of the universities (even if the UUK report is relatively open minded).

On the one hand, we often denigrate the role of the technician, but I think technical difficulties associated with working with real data are often a real blocker; which means we either skill up ourselves, or recognise the need for skilled data technicians. On the other, I think there is a danger of hyping “analytics this” and “data science that” – even if only as part of debunking it – because it leads us away from the more substantive point that analytics this, data science that is actually about getting numbers into a form that tell stories that we can use to inform decisions and policies. And that’s more about understanding patterns and structures, as well as critiquing data collection and analysis methods, than it is about being a data technician, engineer, analyst, geek, techie or quant.

Which is to say – if we need to develop data literacy, what does that really mean for the majority?

PS Heh heh – Kin Lane captures further life at the grotty end of the data lifecycle: Being a Data Janitor and Cleaning Up Data Portability Vomit.