Category: Library

Panama Papers, Quick Start in SQLite3

Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.

Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.

  • Download and install SQLite3
  • download the data from the Sunday Times and unzip it
  • on the command line/terminal, cd into the unzipped directory
  • create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
  • you should now be presented with a SQLite console command line. Run the command: .mode csv
  • we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
  • We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
  • so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
  • let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
  • see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
  • see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
  • exit SQLite console by running: .q
  • to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).



Have fun…

PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]

Made-Up eBook Physics

Some reflections on reading a subscription based, “scholarly” ebook just now…

First, I can read the book online, or download it.


If I download it I need to get myself a special pair of spectacles to the read magic ink it’s written it.

I also need to say how long I want to the “loan period” to be. (I don’t know if this is metered according to multiples of 24 hours, or end the of the calendar based return day.) At the end of the loan period, I think I can keep the book but suspect that a library rep will come round to my house and either lock the book in a box to which only they have the key, or run the pages through a shredder (I’m not sure which).

Looking at the online copy of the book, there are various quotas associated with how I can use it.


A tool bar provides me with various options: the Image view is a crappy resolution view, albeit one that provides and infinite scroll through the book.


The PDF view lets me view a PDF version of the current page, though I can’t copy from it. (I do seem to be able to download it though, using the embedded PDF reader, without affecting any quotas?)


If I select the Copy option, it takes me into the PDF view and does let me highlight and copy text from that page. However, if I try to copy from too many pages, that “privilege” is removed…


As far as user experience goes, pretty rubbish on first use, and many of the benefits of having the electronic version, as compared to a print version, have been defensively (aggressively?!) coded against. This doesn’t achieve anything other than introduce inconvenience. So for example, having run out of my copy quota, I manually typed a copy of the sentence I wasn’t allowed to highlight and cmd/ctrl-C.

More Recognition/Identification Service APIs – Microsoft Cognitive Services

A couple of months ago, I posted A Quick Round-Up of Some *-Recognition Service APIs that described several off-the-shelf cloud hosted services from Google and IBM for processing text, audio and images.

Now it seems that Microsoft Cognitive Services (formally Project Oxford, in part) brings Microsoft’s tools to the party with a range of free tier and paid/metered services:


So what’s on offer?


  • Computer Vision API: extract semantic features from an image, identify famous people (for some definition of “famous” that I can’t fathom), and extract text from images; 5,000 free transactions per month;
  • Emotion API: extract emotion features from a photo of a person; photos – 30,000 free transactions per month;
  • Face API: extract face specific information from an image (location of facial features in an image); 30,000 free transactions per month;
  • Video API: 300 free transactions per month per feature.



  • Bing Spell Check API: 5,000 free transactions per month
  • Language Understanding Intelligent Service (LUIS): language models for parsing texts; 100,000 free transactions per month;
  • Linguistic Analysis API: NLP sentence parser, I think… (tokenisation, parts of speech tagging, etc.) It’s dog slow and, from the times I got it to sort of work, this seems to be about the limit of what it can cope with (and even then it takes forever):
    5,000 free transactions per month, 120 per minute (but you’d be luck to get anything done in a minute…);
  • Text Analytics API: sentiment analysis, topic detection and key phrase detection, language extraction; 5,000 free transactions;
  • Web Language Model API: “wordsplitter” – put in a string of words as a single string with space characters removed, and it’ll try to split the words out; 100,000 free transactions per month.



There’s also a gallery of demo apps built around the APIs.

It’s seems then that we’ve moved into an era of commodity computing at the level of automated identification and metadata services, though many of them are still pretty ropey… The extent to which they will be developed and continue to improve will be the proof of just how useful they will be as utility services.

As far as the free usage caps on the Microsoft services, there seems to be a reasonable amount of freedom built in for folk who might want to try out some of these services in a teaching or research context. (I’m not sure if there are blocks for these services that can be wired in to the experiment flows in the Azure Machine Learning studio?)

I also wonder whether these are just the sorts of service that libraries should be aware of, and perhaps even work with in an informationista context…?!;-)

PS from the face, emotion and vision APIs, and perhaps entity extraction and sentiment analysis applied to any text extracted from images, I wonder if you could generate a range of stories automagically from a set of images. Would that be “art”? Or just #ds106 style playfulness?!

PPS Nov 2016 for photo-tagging, see also Amazon Rekognition.

A New Role for the Library – Gonzo Librarian Informationista

At the OU’s Future of Academic Libraries a couple of weeks ago, Sheila Corrall introduced a term and newly(?!) emerging role I hadn’t heard before coming out of the medical/health library area: informationist (bleurghh..).

According to a recent job ad (h/t Lorcan Dempsey):

The Nursing Informationist cultivates partnerships between the Biomedical Library and UCLA Nursing community by providing a broad range of information services, including in-depth reference and consultation service, instruction, collection development, and outreach.

Hmm… sounds just like a librarian to me?

Writing in the Journal of the Medical Library Association, The librarian as research informationist: a case study (101(4): 298–302,October, 2013), Lisa Federer described the  role in the following terms:

“The term “informationist” was first coined in 2000 to describe what the authors considered a new health sciences profession that combined expertise in library and information studies with subject matter expertise… Though a single model of informationist services has not been clearly defined, most descriptions of the informationist role assume that (1) informationists are “embedded” at the site where patrons conduct their work or need access to information, such as in a hospital, clinic, or research laboratory; and (2) informationists have academic training or specialized knowledge of their patrons’ fields of practice or research.”

Federer started to tighten up the definition in relation to research in particular:

Whereas traditional library services have generally focused on the “last mile” or finished product of the research process—the peer-reviewed literature—librarians have expertise that can help researchers create better research output in the form of more useful data. … The need for better research data management has given rise to a new role for librarians: the “research informationist.” Research informationists work with research teams at each step of the research process, from project inception and grant seeking to final publication, providing expert guidance on data management and preservation, bibliometric analysis, expert searching, compliance with grant funder policies regarding data management and open access, and other information-related areas.

This view is perhaps shared in a presentation on The Informationist: Pushing the Boundaries by Director of Library Services, Elaine Martin, in a presentation dated on Slideshare as October 2013:


Associated with the role are some competencies you might not normally expect from library staffer:


So – maybe here is the inkling of the idea that there could be a role for librarians skilled in working with information technologies in a more techie way than you might normally expect. (You’d normally expect a librarian to be able to use Boolean search, search limits and advanced search forms. You might not expect them to write their own custom SQL queries, or even build and populate their own databases that they can then query? But perhaps you’d expect a really techie informationist to?) And maybe also the idea that the informationist is a participant in a teaching or research activity?

The embedded nature of the informationist also makes me think of gonzo journalism, a participatory style of narrative journalism written from a first person perspective, often including the reporter as part of the story. Hunter S. Thompson is often held up as some sort of benchmark character for this style of writing, and I’d probably class Louis Theroux as a latter-day exemplar. The reporter as naif participant in which the journalist acts as a proxy for everyman’s – which is to say, our own – direct experience of the reported situation, is also in the gonzo style (see for example Feats of gonzo journalism have lost their lustre since George Plimpton’s pioneering days as a universal amateur).

So I’m wondering: isn’t the informationist actually a gonzo librarian, joining in with some activity and bring the skills of a librarian, or wider information scientist (or information technologist/technician) to the party…?

Another term introduced by Sheila Corrall and again, new to me, was “blended librarian”. According to Steven J. Bell and John Shank writing on The blended librarian in College and Research Libraries News, July/August 2004, pp 3722-375:

We define the “blended librarian” as an academic librarian who combines the traditional skill set of librarianship with the information technologist’s hardware/software skills, and the instructional or educational designer’s ability to apply technology appropriately in the teaching-learning process.

The focus of that paper was in part on defining a new role in which the skills and
knowledge of instructional design are wedded to our existing library and information technology skills
, but that doesn’t quite hit the spot for me. The paper also described six principles of blended librarianship, which are repeated on the LIS Wiki :

  1. Taking a leadership position as campus innovators and change agents is critical to the success of delivering library services in today’s “information society”.
  2. Committing to developing campus-wide information literacy initiatives on our campuses in order to facilitate our ongoing involvement in the teaching and learning process.
  3. Designing instructional and educational programs and classes to assist patrons in using library services and learning information literacy that is absolutely essential to gaining the necessary skills (trade) and knowledge (profession) for lifelong success.
  4. Collaborating and engaging in dialogue with instructional technologists and designers which is vital to the development of programs, services and resources needed to facilitate the instructional mission of academic libraries.
  5. Implementing adaptive, creative, proactive, and innovative change in library instruction can be enhanced by communicating and collaborating with newly created Instructional Technology/Design librarians and existing instructional designers and technologists.
  6. Transforming our relationship with faculty to emphasize our ability to assist them with integrating information technology and library resources into courses, but adding to that traditional role a new capacity to collaborate on enhancing student learning and outcome assessment in the area of information access, retrieval and integration.

Again, the emphasis on being able to work with current forms of instructional technology falls short of the mark for me.

But there is perhaps a glimmer of light in the principle associated with “assist[ing faculty] with integrating information technology and library resources into courses“, if we broaden that principle to include researchers as well as teachers, and then add in the idea that the informationist should also be helping explore, evaluate, advocate and teach on how to use emerging information technologies (including technologies associated with information and data processing, analysis an communication (that is, presentation; so things like data visualisation).

So I propose a new take on the informationist, adopting the term proposed in a second take tweet from Lorcan Dempsey: the informationista (which is far more playful, if nothing else, than informationist).

The informationista is someone like I, who tries share contemporary information skills (such as these), through participatory as well as teaching activities, blending techie skills with a library attitude. The informationista is also a hopeful and enthusiastic amateur (in the professional sense…) who explores ways in which new and emerging skills and technologies may be applied to the current situation.

At last, I have found my calling!;-)

See also: Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library (this has dated a bit but there is still quite a bit that can be retrieved from that sort of take, I think…)

PS see also notes on embedded librarians in the comments below.

Full Text Is [Not] Available…

Whenever I go to a library conference, I come away re-motivated. At The Future of Academic Libraries Symposium held at the OU yesterday, which also hosted the official launch of the OU Digital Archive and a celebration of the career of ever-welcoming Librarian Nicky Whitsed as she heads off to pastures new, I noticed again that I’m happiest when thinking about the role of the Library and information professional, and what it means in a world where discovery of, access to, and processing of information is being expanded every day (and whether what’s newly possible is part of the Library remit).

I’ll post more thoughts on the day later, but for now, a bit of library baiting…! [Hmm, thinks.. maybe this is when I happiest?!;-)]

The OU Library recently opted in to a new discovery system. Aside from the fact that the authentication doesn’t always seem to work seemlessly, there seems to be a signalling issue with the search results:


When is available not available? When does green mean go? If it said “Full text available” but had a red indicator, I might get the idea that the thing exists in a full text online version, but that I don’t have access to it. But with the green light? That’s like saying the book is on-shelf but it being on a shelf in a bookshop adjunct to the library.

Here’s another example, from the OU repository, where the formally published intellectual academic research outputs of members of the University are published:


As you can see, this particular publication is not available via the repository, due to copyright restrictions and the publishing model of particular journal involved, but neither does the Library subscribe to the journal. (Which got me wondering – if we did an audit of just the records in the repository and looked up the journal/conference publication details for each one, how many of those items would the OU Library have a subscription to?)

One of the ways I think Libraries have arguably let down their host institutions is in allowing the relationship with the publishers to get into the state it currently is. Time was when the departmental library would have copies of preprints or offprints of articles that had been published in journals (though I don’t recall them also being delivered to the central library?) As it is, we can still make a direct request of the author for a copy of a paper. But the Library – whilst supporting discovery of outputs from the OU academic community – is not able to deliver those actual outputs directly? Which seems odd to me…

Enjoy your retirement, Nicky!:-)

Unthinkable Thinking Made Real?

A few days ago was one of the highlights of my conference year, Internet Librarian International, which started on Monday when I joined up with Brian Kelly once again for a second (updated) outing of our Preparing for the Future workshop (Brian’s resources; my reference slides (unannotated for now, will be updated at some point)).

I hope to post some reflections on that over the next few days, but for now would like to mention one of the presentations on Tuesday – Thinking the unthinkable: a library without a catalogue by Johan Tilstra (@johantilstra) from Utrecht University Library. This project seems to have been in progress for some time, the main idea being that discovery happens elsewhere: the university library should not be focussing on providing discovery services, but instead should be servicing the delivery of content surfaced or discovered elsewhere. To support this, Utrecht are developing a browser extension – UU Easy Access – that will provide full text access to remote resources. As the blurb puts it, [the extension] detects when you are on a website which the Utrecht University Library has a subscription to. This makes it easy for you to get access through the Library Proxy.

This reminded me of an old experiment back from the days I hassled the library regularly, the OU Library Traveller extension (actually, a Greasemonkey script; remember Greasemonkey?;-)

It seems I only posted fragmentary posts about this doodle (OU Library Traveller – Title Lookup and OU Library Traveller – eBook Greedy, for example) but for those without long memories, here’s a brief recap: a long time ago, Jon Udell published a library lookup bookmarklet that would scan the URL of a web page you were on to see if it contained an ISBN (a universal book number), and if so, it would try to open up the page corresponding to that book on your local library catalogue.

I forget the various iterations involved, or related projects in the area (such as scripts that looked for ISBNs or DOIs in a webpage and rewrote them as links to a search on that resource via a library catalogue, or dereferencing of the doi through a doi lookup and libezproxy service), but at some point I ended up with a Greasemonkey script that would pop up a floating panel on a page that contained an ISBN show whether that book was in the OU Library, or available as a full text e-book. (Traffic light colour coded links also showed if the resource was available, owned by the library but currently unavailable, or not avaialble.) I also had – still have, still use regularly – a bookmarklet that will rewrite the URL for subscription based content, such as an academic journal paper, so it goes via the OU library and (hopefully) provides me with full text access: OU libezproxy bookmarklet (see also Arcadia project: bookmarklets; I think some original, related “official-ish, not quite, yet, in testing” OU Library bookmarklets are still available here).

So the “Thinking the Unthinkable” presentation got me thinking that perhaps I had also been thinking along similar lines, as well as that perhaps I should revisit the code to provide an extension that would automatically enhance pages that contained somewhere about them an ISBN or DOI or web domain recognised by the OU’s libezproxy. (If any OU library devs are reading this, (Owen?!;-) it’d be really useful to have a service that could take a URL and then return a boolean flag to say whether or not the OU libezproxy service could do something useful with that URL… or provide me with a list of domains that the OU libezproxy service likes so I could locally decide whether to try to reroute a URL through it…) Hmm….

As I dug through old blog posts, I was also reminded of a couple of other things. Firstly, another competition hack that tried to associate courses with books using a service published by Dave Patten at the University of Huddersfield. Hmm… Thinks… Related content… or alternative content, maybe… so if I’m on a journal page somewhere, maybe I could identify whether it’s OA available from a university repository..? (Which I guess is what Google Scholar often does when it links to a PDF copy of a paper?)

Secondly, I was reminded of another presentation I gave at ILI six years ago (the slides are indecipherable and without annotation) on “The Invisible Library” (which built on from a similarly titled internal OU presentation a few weeks earlier).

The original idea was that libraries could provide invisible helpdesk support through monitoring social media channels, but also included elements of providing locally mediated access to remotely discovered items in an invisible way through things like the OU Library Traveller. It also seems to refer to “contentless” libraries, (eg as picked up in this April Fool), and perhaps foreshadows the idea of an open access academic library.

So I wonder – time to revisit this properly, and try to recapture the (unthinkable?) thinking I was thinking back then?

PS I also notice that around that time I was experimenting with Google Custom search engines. This is the second time in as many months I’ve rediscovered my CSE doodles (previously with Creating a Google Custom Search Engine Over Hyperlocal Sites). Maybe it’s time I revisited them again, too…?

Idle Thoughts on “Data Literacy” in the Library…

In part for a possible OU Library workshop, in part trying to mull over possible ideas for an upcoming ILI2015 workshop with Brian Kelly, I’ve been pondering what sorts of “data literacy” skills are in-scope for a typical academic library.

As a starting point, I wonder if this slicing is useful, based on the ideas of data management, discovery, reporting and sensemaking.


It identifies four different, though interconnected, sorts of activity, or concern:

  • Data curation questions – research focus – covering the management and dissemination of research data, as well as dissemination issues. This is mainly about policy, but begs the question about who to go to for the technical “data engineering” issues, and assumes that the researcher can do the data analysis/data science bits.
  • Data resourcing – teaching focus – finding and perhaps helping identify processes to preserve data for use in teaching context.
  • Data reporting – internal process focus – capturing, making sense of/analysing, and communicating data relating to library related resources or activities; to what extent should each librarian be able to use and invoke data as evidence relating to day job activities. Could include giving data to course teams about resource utilisation, research teams to demonstrate impact ito tracking downloads and use of OU published resources.
  • Date sensemaking – info skills focus – PROMPT in a data context, but also begging the question about who to go to for “data computing” applications or skills support (cf academic/scientific computing support, application training); also relates to ‘visual literacy’ in sense of interpreting data visualisations, methods for engaging in data storytelling and academic communication.

Poking in to each of those areas a little further, here’s what comes to mind at first thought…

Data Curation

The library is often the nexus of activity around archiving and publishing research papers as part of an open access archive (in the OU, this is via ORO: Open Research Online). Increasingly, funders (and publishers) require that researchers make data available too, often under an open data license. Into this box I’m thinking of those activities related to supporting the organisation, management, archiving, and publication of data related to research. It probably makes sense to frame this in the context of a formal lifecycle of a research project and either the various touchpoints that the lifecycle might have with the library, or those areas of the lifecycle where particular data issues arise. I’m sure such things exists, but what follows is an off-the-of-my-head informal take on it…!

Initial questions might relate to putting together (and costing) a research data management plan (planning/bidding, data quality policies, metadata plans etc). There might also be requests for advice about sharing data across research partners (which might extend privacy or data protection issues over and above any immediate local ones). In many cases, there may be concerns about linking to other datasets (for example, in terms of licensing or permissions, or relating to linked or derived data use; mapping is often a big concern here), or other, more mundane, operational issues (how do I share large datafiles that are too big to email?). Increasingly, there are likely to be publication/dissemination issues (how/where/in what format do I publish my data so it can be reused, how should I license it?) and legacy data management issues (how/where can I archive my data? what file formats should I use?). A researcher might also need support in thinking through consequences – or requirements – of managing data in a particular way. For example, particular dissemination or archiving requirements might inform the choice of data management solution from the start: if you use an Access database, or directory full of spreadsheets, during the project with one set of indexing, search or analysis requirements, you might find a certain amount of re-engineering work needs to be done in the dissemination phase if there is a requirement that the data is published at record level on a public webpage with different search or organisational requirements.

What is probably out of scope for the library in general terms, although it may be in scope for more specialised support units working out of the library, is providing support in actual technology decisions (as opposed to raising technology specification concerns…) or operations: choice of DBMS, for example, or database schema design. That said, who does provide this support, or whom should the library suggest might be able to provide such support services?

(Note that these practical, technical issues are totally in scope for the forthcoming OU course TM351 – Data management and analysis…;-)

Data resourcing

For the reference librarian, requests are likely to come in from teaching staff, students, or researchers about where to locate or access different sources of data for a particular task. For teaching staff, this might include identifying datasets that can be used in the context of a particular course, possibly over several years. This might require continuity of access via a persistent URL to different sorts of dataset: a fixed (historical) dataset, for example, or a current, “live” dataset, reporting the most recent figures month on month or year on year. Note that there may be some overlap with data management issues, for example, ensuring that data is both persistent and provided in a format that will remain appropriate for student use over several years.

Researchers too might have third party data discovery or access requests, particularly with respect to accessing commercial or privately licensed data. Again, there may be overlaps with data management concerns, such as how to managing secondary data/third party data appropriately so it doesn’t taint the future licensing or distribution of first party or derived data, for example.

Students, like researchers, might have very specific data access requests – either for particular datasets, or for specific facts – or require more general support, such as advice in citing or referencing sources of secondary data they have accessed or used.

Data reporting

In the data reporting bin, I’m thinking of various data reporting tasks the library might be asked to perform by teaching staff or researchers, as well data stuff that has to be done as internally within the library, by librarians, for themselves. That is, tasks within the library that require librarians to employ their own data handling skills.

So for example, a course team might want to know what library managed resources referenced from course material are being when and by how many students. Or learning analytics projects may request access to data to help build learner retention models.

A research team might be interested in number of research paper or data downloads from the local repository, or citation analyses, or other sources of bibliometric data, such as journal metrics or altmetrics, for assessing the impact of a particular project.

And within the library, there may be a need for working with and analysing data to support the daily operations of the library – staffing requirements on the helpdesk based on an analysis of how and when students call on it, perhaps – or to feed into future planning. Looking at journal productivity, for example, (how often journals are accessed, or cited, within the institution) when it comes to renewal (or subscription checking) time; or at a more technical level, building recommendation systems on top of library usage data. Monitoring the performance of particular areas of the library website through website analytic, or even linking out to other datasets and looking at the impact of library resource utilisation by individual students on their performance.

Date sensemaking

In this category, I’m lumping together a range of practical tools and skills to complement to the tools and skills that a library might nurture through information skills training activities (something that’s also in scope for TM351…). So for example, one are might be providing advice about how to visualise data as part of a communication or reporting activity, both in terms of general data literacy (use a bar chart, not a pie chart for this sort of data; switch the misleading colours off; sort the data to better communicate this rather than that, etc) as well as tool recommendations (try using this app to generate these sorts of charts, or this webservice to plot that sort of map). Another might be how to read, interpret, or critique a data visualisation (looking at crappy visualisations can help here!;-), or rate the quality of a dataset in much the same way you might rate the quality of an article.

At a more specialist level, there may be a need to service requests about what tools to use to work with a particular dataset, for example, a digital humanities researcher looking for advice on a text mining project?

I’m also not sure how far along the scale of search skills library support needs to go, or whether different levels of (specialist?) support need to be provided for undergrads, postgrads and researchers? Certainly, if your data is in a tabular format, even just as a Google spreadsheet, you become much more powerful as a user if you can frame complex data queries (pivot tables, any one?) or start customising SQL queries. Being able to merge datasets, filter them (by row, or by column), or facet them, cluster them or fuzzy join them are really powerful dataskills to have – and that can conveniently be developed within a single application such as OpenRefine!;-)

Note that there is likely to be some cross-over here also between the resource discovery role described above and helping folk develop their own data discovery and criticism skills. And there may also be requirements for folk in the library to work on their own data sensemaking skills in order to do the data reporting stuff…


So, is that a useful way of carving up the world of data, as the library might see it?

The four different perspectives on data related activities within the library described above cover not only data related support services offered by the library to other units, but also suggest a need for data related skills within the library to service its own operations.

What I guess I need to do is flesh out each of the topics with particular questions that exemplify the sort of question that might be asked in each context by different sorts of patron (researcher, educator, learner). If you have any suggestions/examples, please feel free to chip them in to the comments below…;-)