Category: Infoskills

When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor

Although not necessarily the best way of publishing data, data tables in PDF documents can often be extracted quite easily, particularly if the tables are regular and the cell contents reasonably space.

For example, official timing sheets for F1 races are published by the FIA as event and timing information in a set of PDF documents containing tabulated timing data:


In the past, I’ve written a variety of hand crafted scrapers to extract data from the timing sheets, but the regular way in which the data is presented in the documents means that they are quite amenable to scraping using a PDF table extractor such as Tabula. Tabula exists as both a server application, accessed via a web browser, or as a service using the tabula extractor Java application.

I don’t recall how I came across it, but the tabulizer R package provides a wrapper for tabula extractor (bundled within the package), that lets you access the service via it’s command line calls. (One dependency you do need to take care of is to have Java installed; adding Java into an RStudio docker container would be one way of taking care of this.)

Running the default extractor command on the above PDF pulls out the data of the inner table:

extract_tables('Best Sector Times.pdf')


Where the data is spread across multiple pages, you get a data frame per page.


Note that the headings for the distinct tables are omitted. Tabula’s “table guesser” identifies the body of the table, but not the spanning column headers.

The default settings are such that tabula will try to scrape data from every page in the document.


Individual pages, or sets of pages, can be selected using the pages parameter. For example:

  • extract_tables('Lap Analysis.pdf',pages=1
  • extract_tables('Lap Analysis.pdf',pages=c(2,3))

Specified areas for scraping can also be specified using the area parameter:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 230, 500)))

The area parameter appears to take co-ordinates in the form: top, left, width, height is now fixed to take co-ordinates in the same form as those produced by tabula app debug: top, left, bottom, right.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.


The tabula console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.


Using a combination of “guess” to find the dominant table, and specified areas, we can extract the data we need from the PDF and combine it to provide a structured and clearly labeled dataframe.

On my to do list: add this data source recipe to the Wrangling F1 Data With R book…

Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose

Pondering the Sunday Times Panama Papers directors/companies database yesterday (Panama Papers, Quick Start in SQLite3), I thought it was about time I got my head round using a graph database to store this sort of relational information.

Getting started with the neo4j database has been on my to do list for some time, so when pointed to a post on the neo4j about Analyzing the Panama Papers with Neo4j: Data Models, Queries & More, I thought I should give it a go.

So here’s a quick start to the first part – getting a working environment up and running. A quick search turned up a set of examples of how to get started using Neo4j using Jupyter notebooks by Nicole White, so I opted for a notebook/neo4j combination. The following docker-compose.yml file fires up a notebook server and a neo4j in separate docker containers and links them together:

  image: kbastani/docker-neo4j:latest
    - "7474:7474"
    - "1337:1337"
    - /opt/data

  image: jupyter/scipy-notebook
    - "8888:8888"
    - neo4j:neo4j
    - .:/home/jovyan/work

Launching the docker CLI from Kitematic, I can cd into the directory containing the docker-compose.yml file and run the command docker-compose up -d to download and launch the containers.


In the browser, launch a Jupyter terminal and pull down the example notebooks by running the command:

git clone

The notebooks will be downloaded in to the folder neo4j-jupyter. Still in the terminal, create a figure directory, as required for the hello-world.ipynb notebook.

mkdir -p neo4j-jupyter/figure

You’ll also need to install the py2neo python package. By default, this will be installed into the Python 3 path:

pip install py2neo

The example notebooks run in a Python 2 kernel, so we need to install the package into that environment too:

source activate python2
pip install py2neo


Now you should be able to run the example notebooks. One thing to note though – you will need to change the connection details for the neo4j database slightly, In the appropriate notebook code cells, change the default graph = Graph() connection to:

graph = Graph("http://neo4j:7474/db/data/")


I’ve run out of time to do any more just now. In the next post on this topic, I’ll see if I can work out how to get the Sunday Times data into neo4j

Panama Papers, Quick Start in SQLite3

Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.

Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.

  • Download and install SQLite3
  • download the data from the Sunday Times and unzip it
  • on the command line/terminal, cd into the unzipped directory
  • create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
  • you should now be presented with a SQLite console command line. Run the command: .mode csv
  • we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
  • We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
  • so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
  • let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
  • see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
  • see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
  • exit SQLite console by running: .q
  • to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).



Have fun…

PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]

More Recognition/Identification Service APIs – Microsoft Cognitive Services

A couple of months ago, I posted A Quick Round-Up of Some *-Recognition Service APIs that described several off-the-shelf cloud hosted services from Google and IBM for processing text, audio and images.

Now it seems that Microsoft Cognitive Services (formally Project Oxford, in part) brings Microsoft’s tools to the party with a range of free tier and paid/metered services:


So what’s on offer?


  • Computer Vision API: extract semantic features from an image, identify famous people (for some definition of “famous” that I can’t fathom), and extract text from images; 5,000 free transactions per month;
  • Emotion API: extract emotion features from a photo of a person; photos – 30,000 free transactions per month;
  • Face API: extract face specific information from an image (location of facial features in an image); 30,000 free transactions per month;
  • Video API: 300 free transactions per month per feature.



  • Bing Spell Check API: 5,000 free transactions per month
  • Language Understanding Intelligent Service (LUIS): language models for parsing texts; 100,000 free transactions per month;
  • Linguistic Analysis API: NLP sentence parser, I think… (tokenisation, parts of speech tagging, etc.) It’s dog slow and, from the times I got it to sort of work, this seems to be about the limit of what it can cope with (and even then it takes forever):
    5,000 free transactions per month, 120 per minute (but you’d be luck to get anything done in a minute…);
  • Text Analytics API: sentiment analysis, topic detection and key phrase detection, language extraction; 5,000 free transactions;
  • Web Language Model API: “wordsplitter” – put in a string of words as a single string with space characters removed, and it’ll try to split the words out; 100,000 free transactions per month.



There’s also a gallery of demo apps built around the APIs.

It’s seems then that we’ve moved into an era of commodity computing at the level of automated identification and metadata services, though many of them are still pretty ropey… The extent to which they will be developed and continue to improve will be the proof of just how useful they will be as utility services.

As far as the free usage caps on the Microsoft services, there seems to be a reasonable amount of freedom built in for folk who might want to try out some of these services in a teaching or research context. (I’m not sure if there are blocks for these services that can be wired in to the experiment flows in the Azure Machine Learning studio?)

I also wonder whether these are just the sorts of service that libraries should be aware of, and perhaps even work with in an informationista context…?!;-)

PS from the face, emotion and vision APIs, and perhaps entity extraction and sentiment analysis applied to any text extracted from images, I wonder if you could generate a range of stories automagically from a set of images. Would that be “art”? Or just #ds106 style playfulness?!

PPS Nov 2016 for photo-tagging, see also Amazon Rekognition.

A New Role for the Library – Gonzo Librarian Informationista

At the OU’s Future of Academic Libraries a couple of weeks ago, Sheila Corrall introduced a term and newly(?!) emerging role I hadn’t heard before coming out of the medical/health library area: informationist (bleurghh..).

According to a recent job ad (h/t Lorcan Dempsey):

The Nursing Informationist cultivates partnerships between the Biomedical Library and UCLA Nursing community by providing a broad range of information services, including in-depth reference and consultation service, instruction, collection development, and outreach.

Hmm… sounds just like a librarian to me?

Writing in the Journal of the Medical Library Association, The librarian as research informationist: a case study (101(4): 298–302,October, 2013), Lisa Federer described the  role in the following terms:

“The term “informationist” was first coined in 2000 to describe what the authors considered a new health sciences profession that combined expertise in library and information studies with subject matter expertise… Though a single model of informationist services has not been clearly defined, most descriptions of the informationist role assume that (1) informationists are “embedded” at the site where patrons conduct their work or need access to information, such as in a hospital, clinic, or research laboratory; and (2) informationists have academic training or specialized knowledge of their patrons’ fields of practice or research.”

Federer started to tighten up the definition in relation to research in particular:

Whereas traditional library services have generally focused on the “last mile” or finished product of the research process—the peer-reviewed literature—librarians have expertise that can help researchers create better research output in the form of more useful data. … The need for better research data management has given rise to a new role for librarians: the “research informationist.” Research informationists work with research teams at each step of the research process, from project inception and grant seeking to final publication, providing expert guidance on data management and preservation, bibliometric analysis, expert searching, compliance with grant funder policies regarding data management and open access, and other information-related areas.

This view is perhaps shared in a presentation on The Informationist: Pushing the Boundaries by Director of Library Services, Elaine Martin, in a presentation dated on Slideshare as October 2013:


Associated with the role are some competencies you might not normally expect from library staffer:


So – maybe here is the inkling of the idea that there could be a role for librarians skilled in working with information technologies in a more techie way than you might normally expect. (You’d normally expect a librarian to be able to use Boolean search, search limits and advanced search forms. You might not expect them to write their own custom SQL queries, or even build and populate their own databases that they can then query? But perhaps you’d expect a really techie informationist to?) And maybe also the idea that the informationist is a participant in a teaching or research activity?

The embedded nature of the informationist also makes me think of gonzo journalism, a participatory style of narrative journalism written from a first person perspective, often including the reporter as part of the story. Hunter S. Thompson is often held up as some sort of benchmark character for this style of writing, and I’d probably class Louis Theroux as a latter-day exemplar. The reporter as naif participant in which the journalist acts as a proxy for everyman’s – which is to say, our own – direct experience of the reported situation, is also in the gonzo style (see for example Feats of gonzo journalism have lost their lustre since George Plimpton’s pioneering days as a universal amateur).

So I’m wondering: isn’t the informationist actually a gonzo librarian, joining in with some activity and bring the skills of a librarian, or wider information scientist (or information technologist/technician) to the party…?

Another term introduced by Sheila Corrall and again, new to me, was “blended librarian”. According to Steven J. Bell and John Shank writing on The blended librarian in College and Research Libraries News, July/August 2004, pp 3722-375:

We define the “blended librarian” as an academic librarian who combines the traditional skill set of librarianship with the information technologist’s hardware/software skills, and the instructional or educational designer’s ability to apply technology appropriately in the teaching-learning process.

The focus of that paper was in part on defining a new role in which the skills and
knowledge of instructional design are wedded to our existing library and information technology skills
, but that doesn’t quite hit the spot for me. The paper also described six principles of blended librarianship, which are repeated on the LIS Wiki :

  1. Taking a leadership position as campus innovators and change agents is critical to the success of delivering library services in today’s “information society”.
  2. Committing to developing campus-wide information literacy initiatives on our campuses in order to facilitate our ongoing involvement in the teaching and learning process.
  3. Designing instructional and educational programs and classes to assist patrons in using library services and learning information literacy that is absolutely essential to gaining the necessary skills (trade) and knowledge (profession) for lifelong success.
  4. Collaborating and engaging in dialogue with instructional technologists and designers which is vital to the development of programs, services and resources needed to facilitate the instructional mission of academic libraries.
  5. Implementing adaptive, creative, proactive, and innovative change in library instruction can be enhanced by communicating and collaborating with newly created Instructional Technology/Design librarians and existing instructional designers and technologists.
  6. Transforming our relationship with faculty to emphasize our ability to assist them with integrating information technology and library resources into courses, but adding to that traditional role a new capacity to collaborate on enhancing student learning and outcome assessment in the area of information access, retrieval and integration.

Again, the emphasis on being able to work with current forms of instructional technology falls short of the mark for me.

But there is perhaps a glimmer of light in the principle associated with “assist[ing faculty] with integrating information technology and library resources into courses“, if we broaden that principle to include researchers as well as teachers, and then add in the idea that the informationist should also be helping explore, evaluate, advocate and teach on how to use emerging information technologies (including technologies associated with information and data processing, analysis an communication (that is, presentation; so things like data visualisation).

So I propose a new take on the informationist, adopting the term proposed in a second take tweet from Lorcan Dempsey: the informationista (which is far more playful, if nothing else, than informationist).

The informationista is someone like I, who tries share contemporary information skills (such as these), through participatory as well as teaching activities, blending techie skills with a library attitude. The informationista is also a hopeful and enthusiastic amateur (in the professional sense…) who explores ways in which new and emerging skills and technologies may be applied to the current situation.

At last, I have found my calling!;-)

See also: Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library (this has dated a bit but there is still quite a bit that can be retrieved from that sort of take, I think…)

PS see also notes on embedded librarians in the comments below.

Anatomy for First Year Computing and IT Students

For the two new first year computing and IT courses in production (due out October 2017), I’ve been given the newly created slacker role of “course visionary” (or something like that?!). My original hope for this was that I might be able to chip in some ideas about current trends and possibilities for developing our technology enhanced learning that would have some legs when the courses start in October 2017, and remain viable for the several years of course presentation, but I suspect the reality will be something different…

However it turns out, I thought that one of the things I’d use fragments of the time for would be to explore different possible warp threads through the courses. For example, one thread might be to take a “View Source” stance towards various technologies that would show students something of the anatomy of the computing related stuff that populates our daily lives. This is very much in the spirit of the Relevant Knowledge short courses we used to run, where one of the motivating ideas was to help learners make sense of the technological world around them. (Relevant Knowledge courses typically also tried to explore the social, political and economic context of the technology under consideration.)

So as a quick starter for ten, here are some of the things that could be explored in a tech anatomy strand.

The Anatomy of a URL

Learning to read a URL is a really handy to skill to have for several reasons. In the first place, it lets you hack the URL directly to find resources, rather than having to navigate or search the website through its HTML UI. In the second, it can make you a better web searcher: some understanding of URL structure allows you make more effective use of advanced search limits (such as site:, inurl:, filetype:, and so on); third, it can give you clues as to how the backend works, or what backend is in place (if you can recognise a WordPress installation as such, you can use knowledge about how the URLs are put together to interact with the installation more knowledgeably. For example, add ?feed=rss2&withoutcomments=1 to the end of a WordPress blog URL (such as this one) and you’ll get a single item RSS version of the page content.)

The Anatomy of a Web Page

The web was built on the principle of View Source, with built-in support still respected by today’s desktop web browsers at least, that lets you view the HTML, Javascript and CSS source that makes a web page what it is. Browsers also tend to come replete with developer tools that let you explore how the page works in even more detail. For example, I use Chrome developer tools frequently to look up particular elements in a web page when I’m building a scraper:


(If you click the mobile phone icon, you can see what the page looks like on a selected class of mobile device.)

I also often look at the resources that have been loaded into the page:


Again, additional tools allow you to set the bandwidth rate (so you can feel how the page loads on a slower network connection) as well as recording a series of screenshots that show what the page looks like at various stages of its loading.

The Anatomy of a Tweet

As well as looking at how something like tweet is rendered in a webpage, it can also be instructive to see how a tweet is represented in machine terms by looking at what gets returned if you request the resource from the Twitter API. So for example, below is just part of what comes back when I ask the Twitter API for a single tweet:

anatomy tweet

You’ll see there’s quite a lot more information in there than just the tweet, including sender information.

The Anatomy of an Email Message

How does an email message get from the sender to the receiver? One thing you can do is to View Source on the header:


Again, part of the reason for looking at the actual email “data” is so you can see what your email client is revealing to you, and what it’s hiding…

The Anatomy of a Powerpoint File

Filetypes like .xlsx (Microsoft Excel file), .docx (Microsoft Word file) and .pptx (Microsoft Powerpoint file) are actually compressed zip files. Change the suffix (eg pptx to zip and you can unzip it:

anatomy pptx

Once you’re inside, you can have access to individual image files, or other media resources, that are included in the document, as well as the rest of the “source” material for the document.

The Anatomy of an Image File

Image files are packed with metadata, as this peek inside a photo on John Naughton’s blog shows:


We can also poke around with the actual image data, filtering the image in a variety of ways, changing the compression rate, and so on. We can even edit the image data directly…


Showing people how to poke around inside in a resource has several benefits: it gives you a strategy for exploring your curiosity about what makes a particular resource work (and perhaps also demonstrate that you can be curious about such things); it shows you how to start looking inside a resource (how to go about dissecting it doing the “View Source” thing); and it shows you how to start reading the entrails of the thing.

In so doing, it helps foster a sense of curiosity about how stuff works, as well as helping develop some of the skills that allow you to actually take things apart (and maybe put them back together again!) The detail also hooks you into the wider systemic considerations – why does a file need to record this or that field, for example, and how does the rest of the system make use of that information. (As MPs have recently been debating the Investigatory Powers Bill, I wonder how many of them have any clue about what sort of information can be gleaned from communications (meta)data, let alone what it looks like and how communications systems generate, collect and use it.)

PS Hmmm, thinks.. this could perhaps make sense as a series of OpenLearn posts?

Tagging Twitter Users From Their Twitter Bios Using OpenCalais and Alchemy API

Prompted by a request at the end of last week for some Twitter data I no longer have to hand, I revisited an old notebook script to try to tidy up some of my Twitter data grabbin’n’mungin’ scripts and have a quick play with some new toys, such as the pyLDAvis [demo] tool for analysing topic models in a notebook setting. (I generated my test models using gensimPy3, a Python 3 port of gensim, which all seemed to work okay…More on that in a later post, maybe…)

I also plugged in some entity extracting API calls for IBM’s Alchemy API and Thomson Reuters’ OpenCalais API. Both of these services provide a JSON response – and both are rate limited – so I’m cacheing responses for a while in a couple of MongoDB collections (one collection per service).

Here’s an example of the sort of response we can get from a call to Open Calais:

[{'_id': 213808950,
  'description': 'Daily Telegraph sub-editor, Newcastle United follower and future England cricketer',
  'reuters_entities': [{'text': 'follower and future England cricketer',
    'type': 'Position'},
   {'text': 'Daily Telegraph', 'type': 'Company'},
   {'text': 'sub-editor', 'type': 'Position'},
   {'text': 'Daily Telegraph', 'type': 'PublishedMedium'}],
  'screen_name': 'laurieallsopp'}]

Looking at that data, which I retrieved from my MongoDB (the reuters_entities are the bits I got back from the OpenCalais API), I wondered how I could query the database to pull back just the Position info, or just bios associated with a PublishedMedium.

It turns out that the $elemMatch property was the one I needed to allow what is essentially a wildcarded search into the path of the list of arrays (it can be nested if you need to search deeper…):

                projection={'screen_name':1, 'description':1,'reuters_entities.text':1,


In that example, the criteria definition limits returned records to those of type PublishedMedium, and the projection is used to return the first such matching element.

I can also run queries on job titles, or roles, as this example using grabbed AlchemyAPI data shows:

                projection={'screen_name':1, 'description':1,'ibm_entities.text':1,


And so it follows that I could try to search for folk tagged as an editor (or variant thereof: editorial director, for example), modifying the query to additionally perform a case-insensitive search (I’m using pymongo to query the database):

                                        'text':{ '$regex' : 'editor', '$options':'i' }}}}

For a case insensitive but otherwise exact search, use an expression of the form "^editor$" to force the search on the tag to match at the start (^) and end ($) of the string…

I’m not sure if such use of the entity data complies with the license terms though!

And of course, it would probably be much easier to just lookup whether the description contains a particular word or phrase!