Category: Library

DH Box – Digital Humanities Virtual Workbench

As well as offering digital application shelves, should libraries offer, or act as instituional sponsors of, digital workbenches?

I’ve previously blogged about things like SageMathCloud, and application based learning environment, and the IBM Data Scientist Workbench, and today came across another example: DHBox, CUNY’s digital humanities lab in the cloud (wiki), which looks like it may have been part of a Masters project?


If you select the demo option, a lab context is spawned for you, and provides access to a range of tools: staples, such as RStudio and Jupyter notebooks, a Linux terminal, and several website creation tools: Brackets, Omeka and WordPress (though the latter two didn’t work for me).


(The toolbar menu reminded me of Stringle / DockLE ;-)

There’s also a file browser, which provides a common space for organising – and uploading – your own files. Files created in one application are saved to the shared file area and available for use on other applications.


The applications are being a (demo) password authentication scheme, which makes me wonder if persistent accounts are in the project timeline?


Once inside the application, you have full control over it. If you need additional packages in RStudio, for example, then just install them:


They work, too!


On the Jupyter notebook front, you get access to Python3 and R kernels:



In passing, I notice that RStudio’s RMarkdown now demonstrates some notebook like activity, demonstrating the convergence between document formats such as Rmd (and ipymd) and notebook style UIs [video].

Code for running your own DHBox installation is available on Github (DH-Box/dhbox), though I haven’t had a chance to give it a try yet. One thing it’d be nice to see is a simple tutorial showing how to add in another tool of your own (OpenRefine, for example?) If I get a chance to play with this – and can get it running – I’ll try to see if I can figure out such an example.

It also reminded me that I need to play with my own install of tmpnb, not least because of  the claim that “tmpnb can run any Docker container”.  Which means I should be able to set up my own tmpRStudio, or tmpOpenRefine environment?

If visionary C. Titus Brown gets his way with a pitched for MyBinder hackathon, that might extend that project’s support for additional data science applications such as RStudio, as well as generalising the infrastructure on which myBinder can run. Such as Reclaimed personal hosting environments, perhaps?!;-)

That such combinations are now popping up all over the web makes me think that they’ll be a commodity service anytime soon. I’d be happy to argue this sort of thing could be used to support a “technology enhanced learning environment”, as well as extending naturally into“technology enhanced research environments”, but from what I can tell, TEL means learning analytics and not practical digital tools used to develop digital skills? (You could probably track the hell of of people using such environments if you wanted to, though I still don’t see what benefits are supposed to accrue from such activity?)

It also means I need to start looking out for a new emerging trend to follow, not least because data2text is already being commoditised at the toy/play entry level. And it won’t be VR. (Pound to a penny the Second Life hipster, hypster, shysters will be chasing that. Any VR campuses out there yet?!) I’d like to think we might see inroads being made into AR, but suspect that too will always be niche, outside certain industry and marketing applications. So… hmmm… Allotments… that’s where the action’ll be… and not in a tech sense…

Application Shelves for the Digital Library – Fragments

Earlier today, I came across BioShaDock: a community driven bioinformatics shared Docker-based tools registry (BioShadock registry). This collects together a wide range of containerised applications and tools relevant to the bioinformatics community. Users can take one or more applications “off-the-shelf” and run them, without having to go through any complicated software installation process themselves, even if the installation process is a nightmare confusion of dependency hell: the tools are preinstalled and ready to run…

The container images essentially represent reference images that can be freely used by the community. The application containers come pre-installed and ready to run, exact replicas of their parent reference image. The images can be versioned with different versions or builds of the application, so you can reference the use of a particular version of an application and provide a way of sharing exactly that version with other people.

So could we imagine this as a specialist reference shelf in a Digital Library? A specialist reference application shelf, with off-the-shelf, ready-to-run run tools, anywhere, any time?

Another of the nice things about containers is that you can wire them together using things like Docker Compose or Panamax templates to provide a suite of integrated applications that can work with each other. Linked containers can pass information between each other in isolation from the rest of the world. One click can provision and launch all the applications, wired together. And everything can be versioned and archived. Containerised operations can also be sequenced too (eg using DRAY docker pipelines or OpenAPI).

Sometimes, you might want to bundle a set of applications together in a single, shareable package as a virtual machine. These can be versioned, and shared, so everyone has access to the same tools installed in the same way within a single virtual machine. Things like the DH Box, “a push-button Digital Humanities laboratory” (DHBox on github); or the Data Science Toolbox. These could go on to another part of the digital library applications shelf – a more “general purpose toolbox” area, perhaps?

As a complement to the “computer area” in the physical library that provides access to software on particular desktops, the digital library could have “execution rooms” that will actually let you run the applications taken off the shelf, and access them through a web browser UI, for example. So runtime environments like mybinder or tmpnb. Go the the digital library execution room (which is just a web page, though you may have to authenticate to gain access to the server that will actually run the code for you..), say which container, container collection, or reference VM you want to run, and click “start”. Or take the images home you with (that is, run them on your own computer, or on a third party host).

Some fragments relating to the above to try to help me situate this idea of runnable, packaged application shelves with the context of the library in general…

  • libraries have been, and still are, one of the places you go access IT equipment and learn IT skills;
  • libraries used to be, and still are, a place you could go to get advice on, and training in, advanced information skills, particularly discovery, critical reading and presentation;
  • libraries used to be, and still, a locus for collections of things that are often valuable to community or communities associated with a particular library;
  • libraries provide access to reference works or reference materials that provide a common “axiomatic” basis for particular activities;
  • libraries are places that provide access to commercial databases;
  • libraries provide archival and preservation services;
  • libraries may be organisational units that support data and information management needs of their host organisation.

Some more fragments:

  • the creation of a particular piece of work may involve many different steps;
  • one or more specific tools may be involved in the execution of each step;
  • general purpose tools may support the actions required perform a range of tasks to a “good enough” level of quality;
  • specialist tools may provide a more powerful environment for performing a particular task to a higher level of quality

Some questions:

  • what tools are available for performing a particular information related task or set of tasks?
  • what are the best tools for performing a particular information related task or set of tasks?
  • where can I get access to the tools required for a particular task without having to install them myself?
  • how can I effectively organise a workflow that requires the use of several different tools?
  • how can I preserve, document or reference the workflow so that I can reuse it or share it with others?

Some observations:

  • Docker containers provide a way of packaging an application or tool so that it can be “run anywhere”;
  • Docker containers may be linked together in particular compositions so that they can interoperate with each other;
  • docker container images may be grouped together in collections within a subject specific registry: for example, BioShaDock.

Panama Papers, Quick Start in SQLite3

Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.

Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.

  • Download and install SQLite3
  • download the data from the Sunday Times and unzip it
  • on the command line/terminal, cd into the unzipped directory
  • create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
  • you should now be presented with a SQLite console command line. Run the command: .mode csv
  • we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
  • We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
  • so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
  • let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
  • see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
  • see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
  • exit SQLite console by running: .q
  • to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).



Have fun…

PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]

Made-Up eBook Physics

Some reflections on reading a subscription based, “scholarly” ebook just now…

First, I can read the book online, or download it.


If I download it I need to get myself a special pair of spectacles to the read magic ink it’s written it.

I also need to say how long I want to the “loan period” to be. (I don’t know if this is metered according to multiples of 24 hours, or end the of the calendar based return day.) At the end of the loan period, I think I can keep the book but suspect that a library rep will come round to my house and either lock the book in a box to which only they have the key, or run the pages through a shredder (I’m not sure which).

Looking at the online copy of the book, there are various quotas associated with how I can use it.


A tool bar provides me with various options: the Image view is a crappy resolution view, albeit one that provides and infinite scroll through the book.


The PDF view lets me view a PDF version of the current page, though I can’t copy from it. (I do seem to be able to download it though, using the embedded PDF reader, without affecting any quotas?)


If I select the Copy option, it takes me into the PDF view and does let me highlight and copy text from that page. However, if I try to copy from too many pages, that “privilege” is removed…


As far as user experience goes, pretty rubbish on first use, and many of the benefits of having the electronic version, as compared to a print version, have been defensively (aggressively?!) coded against. This doesn’t achieve anything other than introduce inconvenience. So for example, having run out of my copy quota, I manually typed a copy of the sentence I wasn’t allowed to highlight and cmd/ctrl-C.

More Recognition/Identification Service APIs – Microsoft Cognitive Services

A couple of months ago, I posted A Quick Round-Up of Some *-Recognition Service APIs that described several off-the-shelf cloud hosted services from Google and IBM for processing text, audio and images.

Now it seems that Microsoft Cognitive Services (formally Project Oxford, in part) brings Microsoft’s tools to the party with a range of free tier and paid/metered services:


So what’s on offer?


  • Computer Vision API: extract semantic features from an image, identify famous people (for some definition of “famous” that I can’t fathom), and extract text from images; 5,000 free transactions per month;
  • Emotion API: extract emotion features from a photo of a person; photos – 30,000 free transactions per month;
  • Face API: extract face specific information from an image (location of facial features in an image); 30,000 free transactions per month;
  • Video API: 300 free transactions per month per feature.



  • Bing Spell Check API: 5,000 free transactions per month
  • Language Understanding Intelligent Service (LUIS): language models for parsing texts; 100,000 free transactions per month;
  • Linguistic Analysis API: NLP sentence parser, I think… (tokenisation, parts of speech tagging, etc.) It’s dog slow and, from the times I got it to sort of work, this seems to be about the limit of what it can cope with (and even then it takes forever):
    5,000 free transactions per month, 120 per minute (but you’d be luck to get anything done in a minute…);
  • Text Analytics API: sentiment analysis, topic detection and key phrase detection, language extraction; 5,000 free transactions;
  • Web Language Model API: “wordsplitter” – put in a string of words as a single string with space characters removed, and it’ll try to split the words out; 100,000 free transactions per month.



There’s also a gallery of demo apps built around the APIs.

It’s seems then that we’ve moved into an era of commodity computing at the level of automated identification and metadata services, though many of them are still pretty ropey… The extent to which they will be developed and continue to improve will be the proof of just how useful they will be as utility services.

As far as the free usage caps on the Microsoft services, there seems to be a reasonable amount of freedom built in for folk who might want to try out some of these services in a teaching or research context. (I’m not sure if there are blocks for these services that can be wired in to the experiment flows in the Azure Machine Learning studio?)

I also wonder whether these are just the sorts of service that libraries should be aware of, and perhaps even work with in an informationista context…?!;-)

PS from the face, emotion and vision APIs, and perhaps entity extraction and sentiment analysis applied to any text extracted from images, I wonder if you could generate a range of stories automagically from a set of images. Would that be “art”? Or just #ds106 style playfulness?!

A New Role for the Library – Gonzo Librarian Informationista

At the OU’s Future of Academic Libraries a couple of weeks ago, Sheila Corrall introduced a term and newly(?!) emerging role I hadn’t heard before coming out of the medical/health library area: informationist (bleurghh..).

According to a recent job ad (h/t Lorcan Dempsey):

The Nursing Informationist cultivates partnerships between the Biomedical Library and UCLA Nursing community by providing a broad range of information services, including in-depth reference and consultation service, instruction, collection development, and outreach.

Hmm… sounds just like a librarian to me?

Writing in the Journal of the Medical Library Association, The librarian as research informationist: a case study (101(4): 298–302,October, 2013), Lisa Federer described the  role in the following terms:

“The term “informationist” was first coined in 2000 to describe what the authors considered a new health sciences profession that combined expertise in library and information studies with subject matter expertise… Though a single model of informationist services has not been clearly defined, most descriptions of the informationist role assume that (1) informationists are “embedded” at the site where patrons conduct their work or need access to information, such as in a hospital, clinic, or research laboratory; and (2) informationists have academic training or specialized knowledge of their patrons’ fields of practice or research.”

Federer started to tighten up the definition in relation to research in particular:

Whereas traditional library services have generally focused on the “last mile” or finished product of the research process—the peer-reviewed literature—librarians have expertise that can help researchers create better research output in the form of more useful data. … The need for better research data management has given rise to a new role for librarians: the “research informationist.” Research informationists work with research teams at each step of the research process, from project inception and grant seeking to final publication, providing expert guidance on data management and preservation, bibliometric analysis, expert searching, compliance with grant funder policies regarding data management and open access, and other information-related areas.

This view is perhaps shared in a presentation on The Informationist: Pushing the Boundaries by Director of Library Services, Elaine Martin, in a presentation dated on Slideshare as October 2013:


Associated with the role are some competencies you might not normally expect from library staffer:


So – maybe here is the inkling of the idea that there could be a role for librarians skilled in working with information technologies in a more techie way than you might normally expect. (You’d normally expect a librarian to be able to use Boolean search, search limits and advanced search forms. You might not expect them to write their own custom SQL queries, or even build and populate their own databases that they can then query? But perhaps you’d expect a really techie informationist to?) And maybe also the idea that the informationist is a participant in a teaching or research activity?

The embedded nature of the informationist also makes me think of gonzo journalism, a participatory style of narrative journalism written from a first person perspective, often including the reporter as part of the story. Hunter S. Thompson is often held up as some sort of benchmark character for this style of writing, and I’d probably class Louis Theroux as a latter-day exemplar. The reporter as naif participant in which the journalist acts as a proxy for everyman’s – which is to say, our own – direct experience of the reported situation, is also in the gonzo style (see for example Feats of gonzo journalism have lost their lustre since George Plimpton’s pioneering days as a universal amateur).

So I’m wondering: isn’t the informationist actually a gonzo librarian, joining in with some activity and bring the skills of a librarian, or wider information scientist (or information technologist/technician) to the party…?

Another term introduced by Sheila Corrall and again, new to me, was “blended librarian”. According to Steven J. Bell and John Shank writing on The blended librarian in College and Research Libraries News, July/August 2004, pp 3722-375:

We define the “blended librarian” as an academic librarian who combines the traditional skill set of librarianship with the information technologist’s hardware/software skills, and the instructional or educational designer’s ability to apply technology appropriately in the teaching-learning process.

The focus of that paper was in part on defining a new role in which the skills and
knowledge of instructional design are wedded to our existing library and information technology skills
, but that doesn’t quite hit the spot for me. The paper also described six principles of blended librarianship, which are repeated on the LIS Wiki :

  1. Taking a leadership position as campus innovators and change agents is critical to the success of delivering library services in today’s “information society”.
  2. Committing to developing campus-wide information literacy initiatives on our campuses in order to facilitate our ongoing involvement in the teaching and learning process.
  3. Designing instructional and educational programs and classes to assist patrons in using library services and learning information literacy that is absolutely essential to gaining the necessary skills (trade) and knowledge (profession) for lifelong success.
  4. Collaborating and engaging in dialogue with instructional technologists and designers which is vital to the development of programs, services and resources needed to facilitate the instructional mission of academic libraries.
  5. Implementing adaptive, creative, proactive, and innovative change in library instruction can be enhanced by communicating and collaborating with newly created Instructional Technology/Design librarians and existing instructional designers and technologists.
  6. Transforming our relationship with faculty to emphasize our ability to assist them with integrating information technology and library resources into courses, but adding to that traditional role a new capacity to collaborate on enhancing student learning and outcome assessment in the area of information access, retrieval and integration.

Again, the emphasis on being able to work with current forms of instructional technology falls short of the mark for me.

But there is perhaps a glimmer of light in the principle associated with “assist[ing faculty] with integrating information technology and library resources into courses“, if we broaden that principle to include researchers as well as teachers, and then add in the idea that the informationist should also be helping explore, evaluate, advocate and teach on how to use emerging information technologies (including technologies associated with information and data processing, analysis an communication (that is, presentation; so things like data visualisation).

So I propose a new take on the informationist, adopting the term proposed in a second take tweet from Lorcan Dempsey: the informationista (which is far more playful, if nothing else, than informationist).

The informationista is someone like I, who tries share contemporary information skills (such as these), through participatory as well as teaching activities, blending techie skills with a library attitude. The informationista is also a hopeful and enthusiastic amateur (in the professional sense…) who explores ways in which new and emerging skills and technologies may be applied to the current situation.

At last, I have found my calling!;-)

See also: Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library (this has dated a bit but there is still quite a bit that can be retrieved from that sort of take, I think…)

Full Text Is [Not] Available…

Whenever I go to a library conference, I come away re-motivated. At The Future of Academic Libraries Symposium held at the OU yesterday, which also hosted the official launch of the OU Digital Archive and a celebration of the career of ever-welcoming Librarian Nicky Whitsed as she heads off to pastures new, I noticed again that I’m happiest when thinking about the role of the Library and information professional, and what it means in a world where discovery of, access to, and processing of information is being expanded every day (and whether what’s newly possible is part of the Library remit).

I’ll post more thoughts on the day later, but for now, a bit of library baiting…! [Hmm, thinks.. maybe this is when I happiest?!;-)]

The OU Library recently opted in to a new discovery system. Aside from the fact that the authentication doesn’t always seem to work seemlessly, there seems to be a signalling issue with the search results:


When is available not available? When does green mean go? If it said “Full text available” but had a red indicator, I might get the idea that the thing exists in a full text online version, but that I don’t have access to it. But with the green light? That’s like saying the book is on-shelf but it being on a shelf in a bookshop adjunct to the library.

Here’s another example, from the OU repository, where the formally published intellectual academic research outputs of members of the University are published:


As you can see, this particular publication is not available via the repository, due to copyright restrictions and the publishing model of particular journal involved, but neither does the Library subscribe to the journal. (Which got me wondering – if we did an audit of just the records in the repository and looked up the journal/conference publication details for each one, how many of those items would the OU Library have a subscription to?)

One of the ways I think Libraries have arguably let down their host institutions is in allowing the relationship with the publishers to get into the state it currently is. Time was when the departmental library would have copies of preprints or offprints of articles that had been published in journals (though I don’t recall them also being delivered to the central library?) As it is, we can still make a direct request of the author for a copy of a paper. But the Library – whilst supporting discovery of outputs from the OU academic community – is not able to deliver those actual outputs directly? Which seems odd to me…

Enjoy your retirement, Nicky!:-)