Category: Library

And the Library Said: “Thou Shalt Learn to DO Full References But We Will Not Allow You to Search By Them”

OU Library guidance to students on citations for journal articles reads as follows:


Using this reference I should be able to run a pretty good known item search – or not, as the case may be?


So where does the full reference  – Journal, for example – help exactly? On Google, maybe… (Actually, the one search may search all tuples in different fields – so the title as well as a the journal title – and generate retrieval/ranking factors based on that?)

References and search contexts are complementary  – for a reference to be effective,it needs to work with your search context, which typically means the user interface of your search system  – for a specific known item reference this typically means the (hidden away) advanced search interface.

So I wonder: whilst we penalise students from not using full, formal references (even though they often provide enough of a reference to find the item on Google), the officially provided search tools don’t let you use the information in the formal reference in a structured way to retrieve and hopefully access (rather than  discover – the reference is the discovery component) the desired item?

Or am I reading the above search UI incorrectly…?

PS in terms of teaching material design, and referencing the above citation example, erm….?


Because of course I’m not searching for a Journal that has something to do with Frodo Baggins – I’m searching for an article

PPS I’m also finding more and more that the subscription journal content I want to access is from journals that the OU Library doesn’t subscribe to. I’m not sure how many of the journals it does subscribe to that are bundled are never accessed (the data should reveal that)? So I wonder – as academics (and maybe students), should we instead be given a budget code we could use to buy the articles we want? And for articles used by students in courses, get a ‘site license” for some articles?

Now what was the URL of that pirated academic content site someone from the library told me about again…?

PS from the Library – don’s use the reference – just bung the title in the search onebox like you would do on a web search engine..


Hmm… but if I have a full reference, I should be able to run a search that returns just a single result, for exactly the item I want? Or maybe returns links to a few different instances (from different suppliers) of just that resource? But then – which is the preferred one (the Library search ranks different suppliers of the same work according to what algorithm?)

Or perhaps the library isn’t really about supporting know item retrieval – it’s about supporting serendipity and the serendipitous discovery of related items? (Though that begs the question about how the related item list is algorithmically generated?)

Or maybe ease of use has won out – and running a scruffy search then filtering down by facet gives a good chance of effective retrieval with an element of serendipity around similar resources?

(Desperately tries to remember all the arguments libraries used to make against one box searching…)

Libraries Are Where You Go to Help Make Sense of The World

After a break of a couple of years, I’ll be doing a couple of sessions at ILI 2016 next week; for readers with a long memory, the Internet Librarian International conference is where I used to go to berate academic librarians every year about how they weren’t keeping up with the internet, and this year will perhaps be a return to those stomping grounds for me ;-)

One of the questions I used to ask – and probably still could – was where in the university I should go to get help with visualising a data set, either to help me make sense of it, or as part of a communications exercise. Back when IT was new, libraries used to be a place you could go to get help with certain sorts of information skills and study skills (such as essay writing skills), which included bits of training and advice on how to use appropriate software applications. As the base level in digital information skills increases – many people are able to figure out how to open a spreadsheet on their own.


But has the Overton window for librarians offering IT support developed with the times? Where should students wanting to develop more refined skills – how to start cleaning a dataset, for example, or visualising one sensibly, or even just learning how to read a chart properly on the one hand, or tell a story with data on the other – actually go? And what about patrons who want to be able to make better use of automation to help them in their information related tasks (screenscraping, for example, or extracting text, images or tables from hundreds of pages of PDFs); or who want help accessing commodity “AI” services accessed via APIs? Or need support in writing scientific communications that sensibly embed code and its outputs, or mathematical or musical notation, particular for a web based (i.e. potentially interactive) journal or publication? Or who just need to style a chart in a particular way?

Now it’s quite likely that, having been playing with tech for more years than I care to remember, I’m afflicted by “the curse of knowledge“, recently contextualised for libraries by Lorcan Dempsey, quoting Steven Pinker. In the above paragraph, I half-assume readers know what screenscraping is, for example, as well as why it’s blindingly obvious (?!) why you might want to be able to do it, even if you don’t know how to do it? (For librarians, there’s a couple of things to note there: firstly, what it is and why you might want to do it; secondly, which might be a referral to tools, if not training, what sorts of tool might be able to help you with it.)

But the question remains – there’s a lot of tech power tools out there that can help you retrieve, sort, search, analyse, organise and present information out there, but where do I go for help?

If not the library, where?

If not the library, why not?

The end result is often: the internet. For which, for many in the UK, read: Google.

Anyway, via the twitterz (I think…) a couple of weeks ago, I spotted this interesting looking job ad from Harvard:

Visualization Specialist
School/Unit Harvard College Library
Location USA – MA – Cambridge
Job Function Library
Time Status Full-time
Department Harvard College Library – Services for Maps, Media, Data, and Government Information

Duties & Responsibilities – Summary
Reporting to the Head, Social Sciences and Visualization in the unit for Maps, Media, Data and Government Information, the Visualization Specialist works with staff and faculty to identify hardware and software needs, and to develop scalable, sustainable practices related to data visualization services. This position designs and delivers workshops and training sessions on data visualization tools and methods, and develops a range of instructional materials to support library users with data visualization needs in the Social Sciences and Humanities.

The Visualization Specialist will coordinate responsibilities with other unit staff and may supervise student employees.

Duties and Responsibilities
– Advises, consults, instructs, and serves as technical lead with data visualization projects with library, faculty teaching, and courses where students are using data.
– Identifies, evaluates and recommends new and emerging digital research tools for the Libraries and Harvard research community.
– Develops and supports visualization services in response to current trends, teaching and learning–especially as it intersects with Library collections and programs.
– Collaborates in developing ideas and concepts effectively across diverse interdisciplinary audiences and serves as a point person for data visualization and analysis efforts in the Libraries and is attuned to both the quantitative and qualitative uses with datasets. Understands user needs for disseminating their visualizations as either static objects for print publications or interactive online objects to engage with.
– Develops relationships with campus units supporting digital research, including the Center for Government and International Studies, Institute for Quantitative Social Sciences, and Harvard Library Central Services, and academic departments engaged in data analysis and visualization.
– Develops, collects, and curates exemplar data sets from multiple fields to be used in visualization workshops and training materials

Basic Qualifications
– ALA-accredited master’s degree in Library or Information Science OR advanced degree in Social Sciences, Psychology, Design, Informatics, Statistics, or Humanities.
– Minimum of 3 years experience in working with data analysis and visualization in an academic setting.
– Demonstrated experience with data visualization tools and programming libraries.
– Proficiency with at least one programming language (such as Python and R).
– Ability to use a variety of tools to extract and manipulate data from various sources (such as relational databases, XML, web services and APIs).

Additional Qualifications

Experience supporting data analysis and visualization in a research setting.

Proficiency using tools and programming libraries to support text analysis.
Familiarity with geospatial technology.
Experience identifying and recommending new tools, technologies, and online delivery of visualizations.
Graphic design skills and proficiency using relevant software.

Many of the requisite skills resonate with the calls (taunts?) I used to make asking library folk where I should go for support with data related based questions. At which point you may be thinking – “okay, techie geeky stuff… scary…not our gig…”.

But if you focus of data visualisation many of which actually relate to communication issues – representation and presentation – rather than technical calls for help. For example, what sort of chart should I use to communicate this sort of thing? How can I change the look of a chart? How can I redesign a chart to help me communicate with it better?

And it’s not just the presentation of graphical information. Part of the reason I put together the F1 Data Junkie book was that I wanted to explore the RStudio/RMarkdown (Rmd) workflow for creating (stylish) technical documents. Just the other day I noticed that in the same way charts can be themed, themes for Rmd documents are now starting to appear – such as tint (Tint Is Not Tufte); in fact, it seems there’s a whole range of output themes already defined (see also several other HTML themes for Rmd output).


What’s nice about these templates is that they are defined separately from the actual source document. If you want to change from one format to another, things like the R rticles package make it easy. But how many librarians even know such workflows exist? How many have even heard of markdown?

It seems to me that tools around document creation are in a really exciting place at the moment, made more exciting once you start to think about how they fit into wider workflows (which actually makes them harder to promote, because folk are wedded to their current crappy workflows).

So are the librarians on board with that, at least, given their earlier history as word-processor evangelists?

See also: A New Role for the Library – Gonzo Librarian Informationista, including the comment Notes on: Exploring New Roles for Librarians. This also touches on the notion of an embedded librarian.

And this, from Martin Bean, previously VC of the OU, several years ago…

DH Box – Digital Humanities Virtual Workbench

As well as offering digital application shelves, should libraries offer, or act as instituional sponsors of, digital workbenches?

I’ve previously blogged about things like SageMathCloud, and application based learning environment, and the IBM Data Scientist Workbench, and today came across another example: DHBox, CUNY’s digital humanities lab in the cloud (wiki), which looks like it may have been part of a Masters project?


If you select the demo option, a lab context is spawned for you, and provides access to a range of tools: staples, such as RStudio and Jupyter notebooks, a Linux terminal, and several website creation tools: Brackets, Omeka and WordPress (though the latter two didn’t work for me).


(The toolbar menu reminded me of Stringle / DockLE ;-)

There’s also a file browser, which provides a common space for organising – and uploading – your own files. Files created in one application are saved to the shared file area and available for use on other applications.


The applications are being a (demo) password authentication scheme, which makes me wonder if persistent accounts are in the project timeline?


Once inside the application, you have full control over it. If you need additional packages in RStudio, for example, then just install them:


They work, too!


On the Jupyter notebook front, you get access to Python3 and R kernels:



In passing, I notice that RStudio’s RMarkdown now demonstrates some notebook like activity, demonstrating the convergence between document formats such as Rmd (and ipymd) and notebook style UIs [video].

Code for running your own DHBox installation is available on Github (DH-Box/dhbox), though I haven’t had a chance to give it a try yet. One thing it’d be nice to see is a simple tutorial showing how to add in another tool of your own (OpenRefine, for example?) If I get a chance to play with this – and can get it running – I’ll try to see if I can figure out such an example.

It also reminded me that I need to play with my own install of tmpnb, not least because of  the claim that “tmpnb can run any Docker container”.  Which means I should be able to set up my own tmpRStudio, or tmpOpenRefine environment?

If visionary C. Titus Brown gets his way with a pitched for MyBinder hackathon, that might extend that project’s support for additional data science applications such as RStudio, as well as generalising the infrastructure on which myBinder can run. Such as Reclaimed personal hosting environments, perhaps?!;-)

That such combinations are now popping up all over the web makes me think that they’ll be a commodity service anytime soon. I’d be happy to argue this sort of thing could be used to support a “technology enhanced learning environment”, as well as extending naturally into“technology enhanced research environments”, but from what I can tell, TEL means learning analytics and not practical digital tools used to develop digital skills? (You could probably track the hell of of people using such environments if you wanted to, though I still don’t see what benefits are supposed to accrue from such activity?)

It also means I need to start looking out for a new emerging trend to follow, not least because data2text is already being commoditised at the toy/play entry level. And it won’t be VR. (Pound to a penny the Second Life hipster, hypster, shysters will be chasing that. Any VR campuses out there yet?!) I’d like to think we might see inroads being made into AR, but suspect that too will always be niche, outside certain industry and marketing applications. So… hmmm… Allotments… that’s where the action’ll be… and not in a tech sense…

Application Shelves for the Digital Library – Fragments

Earlier today, I came across BioShaDock: a community driven bioinformatics shared Docker-based tools registry (BioShadock registry). This collects together a wide range of containerised applications and tools relevant to the bioinformatics community. Users can take one or more applications “off-the-shelf” and run them, without having to go through any complicated software installation process themselves, even if the installation process is a nightmare confusion of dependency hell: the tools are preinstalled and ready to run…

The container images essentially represent reference images that can be freely used by the community. The application containers come pre-installed and ready to run, exact replicas of their parent reference image. The images can be versioned with different versions or builds of the application, so you can reference the use of a particular version of an application and provide a way of sharing exactly that version with other people.

So could we imagine this as a specialist reference shelf in a Digital Library? A specialist reference application shelf, with off-the-shelf, ready-to-run run tools, anywhere, any time?

Another of the nice things about containers is that you can wire them together using things like Docker Compose or Panamax templates to provide a suite of integrated applications that can work with each other. Linked containers can pass information between each other in isolation from the rest of the world. One click can provision and launch all the applications, wired together. And everything can be versioned and archived. Containerised operations can also be sequenced too (eg using DRAY docker pipelines or OpenAPI).

Sometimes, you might want to bundle a set of applications together in a single, shareable package as a virtual machine. These can be versioned, and shared, so everyone has access to the same tools installed in the same way within a single virtual machine. Things like the DH Box, “a push-button Digital Humanities laboratory” (DHBox on github); or the Data Science Toolbox. These could go on to another part of the digital library applications shelf – a more “general purpose toolbox” area, perhaps?

As a complement to the “computer area” in the physical library that provides access to software on particular desktops, the digital library could have “execution rooms” that will actually let you run the applications taken off the shelf, and access them through a web browser UI, for example. So runtime environments like mybinder or tmpnb. Go the the digital library execution room (which is just a web page, though you may have to authenticate to gain access to the server that will actually run the code for you..), say which container, container collection, or reference VM you want to run, and click “start”. Or take the images home you with (that is, run them on your own computer, or on a third party host).

Some fragments relating to the above to try to help me situate this idea of runnable, packaged application shelves with the context of the library in general…

  • libraries have been, and still are, one of the places you go access IT equipment and learn IT skills;
  • libraries used to be, and still are, a place you could go to get advice on, and training in, advanced information skills, particularly discovery, critical reading and presentation;
  • libraries used to be, and still, a locus for collections of things that are often valuable to community or communities associated with a particular library;
  • libraries provide access to reference works or reference materials that provide a common “axiomatic” basis for particular activities;
  • libraries are places that provide access to commercial databases;
  • libraries provide archival and preservation services;
  • libraries may be organisational units that support data and information management needs of their host organisation.

Some more fragments:

  • the creation of a particular piece of work may involve many different steps;
  • one or more specific tools may be involved in the execution of each step;
  • general purpose tools may support the actions required perform a range of tasks to a “good enough” level of quality;
  • specialist tools may provide a more powerful environment for performing a particular task to a higher level of quality

Some questions:

  • what tools are available for performing a particular information related task or set of tasks?
  • what are the best tools for performing a particular information related task or set of tasks?
  • where can I get access to the tools required for a particular task without having to install them myself?
  • how can I effectively organise a workflow that requires the use of several different tools?
  • how can I preserve, document or reference the workflow so that I can reuse it or share it with others?

Some observations:

  • Docker containers provide a way of packaging an application or tool so that it can be “run anywhere”;
  • Docker containers may be linked together in particular compositions so that they can interoperate with each other;
  • docker container images may be grouped together in collections within a subject specific registry: for example, BioShaDock.

Panama Papers, Quick Start in SQLite3

Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.

Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.

  • Download and install SQLite3
  • download the data from the Sunday Times and unzip it
  • on the command line/terminal, cd into the unzipped directory
  • create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
  • you should now be presented with a SQLite console command line. Run the command: .mode csv
  • we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
  • We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
  • so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
  • let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
  • see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
  • see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
  • exit SQLite console by running: .q
  • to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).



Have fun…

PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]

Made-Up eBook Physics

Some reflections on reading a subscription based, “scholarly” ebook just now…

First, I can read the book online, or download it.


If I download it I need to get myself a special pair of spectacles to the read magic ink it’s written it.

I also need to say how long I want to the “loan period” to be. (I don’t know if this is metered according to multiples of 24 hours, or end the of the calendar based return day.) At the end of the loan period, I think I can keep the book but suspect that a library rep will come round to my house and either lock the book in a box to which only they have the key, or run the pages through a shredder (I’m not sure which).

Looking at the online copy of the book, there are various quotas associated with how I can use it.


A tool bar provides me with various options: the Image view is a crappy resolution view, albeit one that provides and infinite scroll through the book.


The PDF view lets me view a PDF version of the current page, though I can’t copy from it. (I do seem to be able to download it though, using the embedded PDF reader, without affecting any quotas?)


If I select the Copy option, it takes me into the PDF view and does let me highlight and copy text from that page. However, if I try to copy from too many pages, that “privilege” is removed…


As far as user experience goes, pretty rubbish on first use, and many of the benefits of having the electronic version, as compared to a print version, have been defensively (aggressively?!) coded against. This doesn’t achieve anything other than introduce inconvenience. So for example, having run out of my copy quota, I manually typed a copy of the sentence I wasn’t allowed to highlight and cmd/ctrl-C.

More Recognition/Identification Service APIs – Microsoft Cognitive Services

A couple of months ago, I posted A Quick Round-Up of Some *-Recognition Service APIs that described several off-the-shelf cloud hosted services from Google and IBM for processing text, audio and images.

Now it seems that Microsoft Cognitive Services (formally Project Oxford, in part) brings Microsoft’s tools to the party with a range of free tier and paid/metered services:


So what’s on offer?


  • Computer Vision API: extract semantic features from an image, identify famous people (for some definition of “famous” that I can’t fathom), and extract text from images; 5,000 free transactions per month;
  • Emotion API: extract emotion features from a photo of a person; photos – 30,000 free transactions per month;
  • Face API: extract face specific information from an image (location of facial features in an image); 30,000 free transactions per month;
  • Video API: 300 free transactions per month per feature.



  • Bing Spell Check API: 5,000 free transactions per month
  • Language Understanding Intelligent Service (LUIS): language models for parsing texts; 100,000 free transactions per month;
  • Linguistic Analysis API: NLP sentence parser, I think… (tokenisation, parts of speech tagging, etc.) It’s dog slow and, from the times I got it to sort of work, this seems to be about the limit of what it can cope with (and even then it takes forever):
    5,000 free transactions per month, 120 per minute (but you’d be luck to get anything done in a minute…);
  • Text Analytics API: sentiment analysis, topic detection and key phrase detection, language extraction; 5,000 free transactions;
  • Web Language Model API: “wordsplitter” – put in a string of words as a single string with space characters removed, and it’ll try to split the words out; 100,000 free transactions per month.



There’s also a gallery of demo apps built around the APIs.

It’s seems then that we’ve moved into an era of commodity computing at the level of automated identification and metadata services, though many of them are still pretty ropey… The extent to which they will be developed and continue to improve will be the proof of just how useful they will be as utility services.

As far as the free usage caps on the Microsoft services, there seems to be a reasonable amount of freedom built in for folk who might want to try out some of these services in a teaching or research context. (I’m not sure if there are blocks for these services that can be wired in to the experiment flows in the Azure Machine Learning studio?)

I also wonder whether these are just the sorts of service that libraries should be aware of, and perhaps even work with in an informationista context…?!;-)

PS from the face, emotion and vision APIs, and perhaps entity extraction and sentiment analysis applied to any text extracted from images, I wonder if you could generate a range of stories automagically from a set of images. Would that be “art”? Or just #ds106 style playfulness?!