Name (Date) Title, Available at: URL (Accessed: DATE): So What?

Academic referencing is designed, in part, to support the retrieval of material that is being referenced, as well as recognising provenance.

The following guidance, taken from the OU Library’s Academic Referencing Guidelines, is, I imagine, typical:

That page appears in an OU hosted Moodle course (OU Harvard guide to citing references) that requires authentication. So whilst stating the provenance, it won’t necessarily support the retrieval of content from that site for most people.

Where an (n.d) — no date — citation is provided, it also becomes hard for someone checking the page in the future whether or not the content has changed, and if so, which parts.

Looking at the referencing scheme for organisational websites, there’s no suggestion that authentication is required is listed in the citation (the same is true in the guidance for citing online newspaper articles).


I also didn’t see guidance offhand for how to reference pages where the page presentation is likely customised by “an algorithm” according to personal preferences or interaction history; placement of things like ads are generally dynamic, and often personalised (personalisation may be based on multiple things, such as the cookie state of the browser with which you are looking at a page, or the history of transactions (sites visited) from the IP address you are connecting to a site from).

This doesn’t matter for static content, but it does matter if you want to reference something like a screenshot / screencapture, for example showing the results of a particular search on a web search engine. In this case, adding a date and citing the page publisher (that is, the web search engine, for example) is about as good as you can get, but it misses a huge amount of context. The fact that you got extremist results might be because your web history reveals you to be a raging fanatic, and the fact that you grabbed the screenshot from the premises of your neo-extremist clubhouse just added more juice to the search. One partial solution to disabling personalisation features might be to run a search in a “private” browser session where cookies are disabled, and cite that fact, although this still won’t stop IP address profiling and browser fingerprinting.

I’ve pondered related things before, eg when asking Could Librarians Be Influential Friends? And Who Owns Your Search Persona?, as well as in a talk given 10 years ago and picked up at the time by Martin Weller on his original blog site (Your Search Is Valuable To Us; or should that be: Weller, M. (2008) 'Your Search Is Valuable To Us' *The Ed Techie*, 9 September [Blog] Available at (Accessed 26 September 2018).?).

Most of the time, however, web references are to static content, so what role does the Accessed on date play here? I can imagine discussions way back when, when this form was being agreed on (is there a history of the discussion that took place when formulating and adopting this form?) where someone said something like “what we need is to record the date the page was accessed on and capture it somewhere“, and then the second part of that phrase was lost or disregarded as being too “but how would we do that?”…

One of the issues we face in maintaining OU courses, where content starts being written 2 years before a course  start and is expected to last for 5+ years of presentation, is maintaining the integrity of weblinks. Over that period of time, you might expect pages to change in a couple of ways, even if the URL persists and the “content” part remains largely the same:

  • the page style (that is, the view as presented) may change;
  • the surrounding navigation or context (for example, sidebar content) may change.

But let’s suppose we can ignore those. Instead, let’s focus on how we can try to make sure that the a student can follow a link to the resource we intend.

One of the things I remember from years ago were conversations around keeping locally archived copies of webpages and presenting those copies to students, but I’m not sure this ever happened. (Instead, there was a short of middle ground  compromise of running link checkers, but I think that was just to spot 404 page not found errors rather than checking a hash made on the content you were interested in, which would be difficult.)

At one point, I religiously kept archived copies of pages I referenced in course materials so that if the page died, I could check back on my own copy to see what the sense of the page now lost was so I could find a sensible alternative, but a year or two off course production and that practice slipped.

Back to the (Accessed DATE) clause. So what? In Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs I mentioned a couple of Wikipedia bots that check link integrity on Wikipedia (see also: Internet Archive blog: More than 9 million broken links on Wikipedia are now rescued). These can perform actions like archiving web pages, checking links are still working, and changing broken links to point to an archived copy of the same link. I hinted that it would be useful if the VLE offered the same services. They don’t, at least, not going by reports from early starters to this year’s TM351 presentation who are already flagging up broken links (do we not run a link checker anymore? (I think I asked that in the Broken URLs post a year ago, too…?)

Which is where (Accessed DATE) comes in. If you do accede to that referencing convention, why not make sure that that an archived copy of that page, ideally made on that date. Someone chasing the reference can then see what you accessed, and perhaps if they are visiting the page somewhen in the future, see how the future page compares with the original. (This won’t help with authentication controlled content or personalised page content though.)

An easy way of archiving a page in a way that others can access it is to use the Internet Archive’s Wayback Machine (for example, If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine).

From the Wayback Machine homepage, you can simply add a link to a page you want to archive:


hit SAVE NOW (note, this is saving a different page; I forgot to save the screenshot of the previous one, even though I had grabbed it. Oops…):

and then you have access to the archived page, on the date it was accessed:

A more useful complete citation would now be Weller, M. (2008) 'Your Search Is Valuable To Us' *The Ed Techie*, 9 September [Blog] Available at (Accessed 26 September 2018. Archived at

Two more things…

Firstly, my original blog was hosted on an OU blog server; when that was decommissioned, I archived the posts on a subdomain of I’d managed to grab. That subdomain was deleted a few months ago, taking with it the original blog archive. Step in the Wayback Machine. It didn’t have a full copy of the original blog site, but I did manage to retrieve quite a few of the pages using this wayback machine downloader using the command wayback_machine_downloader or for a slightly later archivewayback_machine_downloader I made the original internal URLs relative (find . -name '*.html' | xargs perl -pi -e 's/http:\/\/\/Maths\/ajh59/./g'),  (or as appropriate for, used a similar approach to remove tracking scripts from the pages, uploaded the pages to Github (psychemedia/original-ouseful-blog-archive), enabled the repo as a Github pages site, and the pages are now at It looks like the best archive is at the UK Web Archive, but I can’t see a way of getting a bulk export from that?

Secondly, bots; VLE bots… Doing some maintenance on TM351, I notice it has callouts to other OU courses, including TU100, which has been replaced by TM111 and TM112. It would be handy to be able to automatically discover references to other courses made from within a course to support maintenance. Using some OU-XML schema markup to identify such references would be sensible? The OU-XML document source structure should provide a veritable playground for OU bots to scurry around. I wonder if there are any, and if so, what do they do?


PS via Richard Nurse, reminding me that Memento is also useful when trying to track down original content and/or retrieve content for broken link pages from the Internet Archive: UK Web Archive Mementos search and time travel.

Richard also comments that “OU modules are being web archived by OU Archive – 1st, mid point and last presentation -only have 1 in OUDA currently staff login only – on list to make them more widely available but prob only to staff given 3rd party rights in OU courses“. Interesting…

PPS And via Herbert Van de Sompel, a list of archives accessed via time travel, as well as a way of decorating web links to help make them a bit more resilient: Robust Links – Link Decoration.

By the by, Richard, Kevin Ashley and @cogdog/Alan also point me to the various browser extensions that make life easier adding pages to archives or digging into their history. Examples here: Memento tools. I’m not sure what advice the OU Library gives to students about things like this; certainly my experience of interactions with students, academics and editors alike around broken links suggests that not many of them are aware of the Internet Archive / UK web Archive, Wayback Machine, etc etc? – where the lede is usually buried…

Keeping Up With What’s Possible – Daily Satellite Imagery from AWS

Via @simonw’s rebooted blog, I  spotted this – Landsat on AWS: “Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes are available from the start of imagery capture. All new Landsat 8 scenes are made available each day, often within hours of production.”

What do things like this mean for research, and teaching?

For research, I’m guessing we’ve gone from a state 20 years ago – no data [widely] available – to 10 years ago – available under license, with a delay and perhaps as periodics snapshots – to now – daily availability. How does this imapct on research, and what sorts of research are possible? And how well suited are legacy workflows and tools to supporting work that can make use of daily updated datasets?

For teaching, the potential is there to do activities around a particular dataset that is current, but this introduces all sorts of issues when trying to write and support the activity (eg we don’t know what specific features the data will turn up in the future). We struggle with this anyway trying to write activities that give students an element of free choice or open-ended exploration where we don’t specifically constrain what they do. Which is perhaps why we tend to be so controlling – there is little opportunity for us to respond to something a student discovers for themselves.

The realtime-ish ness of data means we could engage students with contemporary issues, and perhaps enthuse them about the potential of working with datasets that we can only hint at or provide a grounding for in the course materials. There are also opportunities for introducing students to datasets and workflows that they might be able to use in their workplace, and as such act as a vector for getting new ways of working out of the Academy and out of the tech hinterland that the Academy may be aware of, and into more SMEs (helping SMEs avail themselves of emerging capabilities via OUr students).

At a more practical level, I wonder, if OU academics (research or teaching related) wanted to explore the LandSat 8 data on AWS, would they know how to get started?

What sort of infrastructure, training or support do we need to make this sort of stuff accessible to folk who are interested in exploring it for the first time (other than Jupyter notebooks, RStudio, and Docker of course!;-) ?

PS Alan Levine /@cogdog picks up on the question of what’s possible now vs. then: I might also note: this is how the blogosphere used to work on a daily basis 10-15 years ago…

Library ezproxy Access to Authenticated Subscription Content

I didn’t make it to ILI this year – for the first time they rejected all my submissions, so I guess I’m not even a proper shambrarian now :-( – but I was reminded  of  it, and the many interesting conversations I’ve had there in previous years, during a demo of the LEAN Library browser extension by Johan Tilstra (we’d first talked about a related application a couple of years ago at ILI).

The LEAN Library app is a browser extension (available for Chrome, Firefox, Safari) that can offers three services:

  • Library Access: seamless access to subscription content using a Library’s ezproxy service;
  • Library Assist: user support for particular subscription site;
  • Library Alternatives: provide alternative sources for content that a user doesn’t have access to in one subscription service that they do have access to in another service.

Johan also stressed several points in the demo:

  • once installed, and the user authenticated, the extension rewrites URLs on domains the library has subscribed to automatically; as soon as you land on a subscribed to site, you are redirected to a proxied version of the site that lets you download subscription content directly;
  • the extension pops up a branded library panel on subscribed to sites seamlessly, unlike a bookmarklet that requires user action to trigger the proxy behaviour; because the pop-up appears without any other user interaction required, when a user visits a site that they didn’t know was subscribed to, they are informed of the fact. This is really useful for raising awareness amongst library patrons of the service that is being provided by the library.

I used to try to make a similar sort of point back when I used to bait libraries regularly, under the mantle of trying to get folk to think about “invisible library” services (as well as how to make folk realise they were using such services):

The LEAN Library extension is sensitised to sites subscribed to from the library from a whitelist downloaded from a remote configuration site. In fact, LEAN Library host the complete configuration UI, that allows library managers to define and style the content pop-up and define the publisher domains for which a subscription exists. (FWIW, I’m not clear what happens when a journal on a publisher site that is not part of a subscription package is the one the user wants to access?)

This approach has a couple of advantages:

  • the extension doesn’t try every domain the user visits to see if it’s ezproxy-able, it has a local list of relevant domains;
  • if the Library updates its subscription list, so is the extension.

That said, if a user does try to download content, it’s not necessarily obvious how the library knows that the proxy page was “enabled” by the extension. (If the URL rewriter added a code parameter to the URL, would that be trackable in the ezproxy logs?)

It’s interesting to see how this idea has evolved over the years. The LEAN Library approach certainly steps up the ease of use from the administrator side, and arguably for users to. For example, the configuration site makes it easy for admins to customise the properties of the extension, which used to require handcrafting in the original versions of this sort of application.

As to past – I wonder if it’s worth digging through my old posts on a related idea, the OU Library Traveller, to see whether there are any UX nuggets, or possible features, that might be worth exploring again? That started out as a bookmarklet and built on several ideas by Jon Udell, before moving to a Greasemonkey script (Greasemonkey was an extension that let you run your own custom scripts via the extension):

PS In passing, I note that the OU libezproxy bookmarklet is still availableI still use my own bookmarklet several times a week. I also used to have a DOI version that let you highlight a doi and it would resolve it through the proxy (DOI and OpenURL Resolution)? There was a DOI-linkifier too, I think? Here’s a (legacy) related OU Library webservice: EZproxy DOI Resolver

Open Educational Resources from Government and Parliament

Mentioning to a colleague yesterday that the UK Parliamentary library published research briefings and reports on topics of emerging interest, as well as to support legislation, that often provided a handy, informed, and politically neutral  overview of a subject area that could make for a useful learning resource, the question was asked whether or not they might have anything on the “internet of things”. The answer is not much, but it got me thinking a bit more about the range of documents and document types produced across Parliament and Government that can be used to educate and inform, as well as contribute to debate.

In other words, to what extent might such documents be used in an educational sense, whether in the sense of providing knowledge and information about a topic, providing a structured review of a topic area and the issues associated with it, raising questions about an issue, or reporting on an analysis of it. (There are also opportunities for learning from some of the better Parliamentary debates, for example in terms of how to structure an argument, or explore the issues associated with an issue, but Hansard is out of scope of this post!)

(Also note that I’m coming at this as a technologist, interested as much in the social processes, concerns and consequences associated with science and technology as much as the deep equations and principles that tend to be be taught as the core of the subject, at least in HE. And that I’m interested not just on how we can support the teaching and learning of current undergrads, but also how we can enculturate them into the availability and use of certain types of resource that are likely to continue being produced into the future, and as such provide a class of resources that will continue to support the learning and education of students once they leave formal education.)

So using IoT as a hook to provide examples, here’s the range of documents I came up with. (At some point it maybe worth tabulating this to properly summarise the sorts of information these reports might came, the communicative point of the document (to inform, persuade, provide evidence for or against something, etc), and any political bias that may be likely (in policy docs, for example).

Parliamentary Library Research Briefings

The Parliamentary Library produces a range of research briefings to cover matters of general interest (Commons Briefing papers, Lords Library notes), perhaps identified through multiple questions asked of the Library by members?, as well as background to legislation (Commons Debate Packs, Lords in Focus), through the Commons and Lords Libraries respectively.

Some of the research briefings include data sets (do a web search for site: filetype:xlsx) which can also be quite handy.

There are also POSTnotes from the Parliamentary Office of Science and Technology, aka POST.

For access to briefings on matters currently in the House, the Parliament website provides timely/handy pages that list briefings documents for matters in the House today/this week. In addition, there are feeds available for recent briefings from all three: Commons Briefing Papers feed, Lords Library Notes feed, POSTnotes feed. If you’re looking for long reads and still use a feed reader, get subscribing;-)

Wider Parliamentary Documents

The Parliament website also supports navigation of topical issues such as Science and Technology, as well as sub-topics, such as Internet and Cybercrime. (I’m not sure how the topics/sub-topics are identified or how the graph is structured… That may be one to ask about when I chat to Parliamentary Library folk next week?:-)

Within the topic areas, relevant Commons and Lords related Library research briefings are listed, as well as
POSTnotes, Select Committee Reports and Early Day Motions.

(By the by, it’s also worth noting that chunks of the Parliament website are currently in scope of a website redesign.)

Government Documents

Along with legislation currently going through Parliament that is published on the Parliament website (along with Hansard reports that record, verbatim(-ish!) proceedings of debates in either House), explanatory notes provided by the Government department bringing a bill provide additional, supposedly more accessible/readable, information around it.

Reports are also published by government offices. For example, the Blackett review (2014) on the Internet of things was a Government Office for Science report from the UK Government Chief Scientific Adviser at the time (The Internet of Things: making the most of the Second Digital Revolution). Or how about a report from the Intellectual Property Office on Eight great technologies: The internet of things.

Briefing documents also appear in a variety of other guises. For example, competitions (such as the Centre for Defence Enterprise (CDE) competition on security for the internet of things, or Ofcom’s consultation on More radio spectrum for the Internet of Things) and consultations may both provide examples of how to start asking questions about a particular topic area (questions that may help to develop critical thinking, prompt critical reflection, or even provide ideas for assessment!).

Occasionally, you can also turn up a risk assessments or cost benefit analysis, such as this Department for Business, Energy & Industrial Strategy Smart meter roll-out (GB): cost-benefit analysis.

EC Parliamentary Research Service

In passing, it’s also worth noting that the EC Parliamentary Research Service also do briefings, such as this report on The Internet Of Things: Opportunities And Challenges, as well as publishing resources linked from topic based pages, such as the Digital Single Market them topic page on The Internet of Things.


In providing support for all members of the House, the Parliamentary research services must produce research briefings that can be used by both sides of the House. This may stand in contrast to documents produced by Government that may be influenced by particular policy (and political) objectives, or formal reports published by industry bodies and the big consultancies (the latter often producing reports that are either commissioned on behalf of government or published to try to promote some sort of commercial interest that can be sold to government) that may have a lobbying aim.

As I’ve suggested previously, (News, Courses and Scrutiny and Learning Problems and Consultation Based Curricula), maybe we could/should be making more use of them as part of higher education course readings, not just as a way of getting a quick, NPOV view over a topic area, bus also as a way of introduce students to a form of free and informed content, produced in timely way in response to issues of the day. In short, a source that will continue to remain relevant and current over the coming years, as students (hopefully) become lifelong, continuing learners.

And the Library Said: “Thou Shalt Learn to DO Full References But We Will Not Allow You to Search By Them”

OU Library guidance to students on citations for journal articles reads as follows:


Using this reference I should be able to run a pretty good known item search – or not, as the case may be?


So where does the full reference  – Journal, for example – help exactly? On Google, maybe… (Actually, the one search may search all tuples in different fields – so the title as well as a the journal title – and generate retrieval/ranking factors based on that?)

References and search contexts are complementary  – for a reference to be effective,it needs to work with your search context, which typically means the user interface of your search system  – for a specific known item reference this typically means the (hidden away) advanced search interface.

So I wonder: whilst we penalise students from not using full, formal references (even though they often provide enough of a reference to find the item on Google), the officially provided search tools don’t let you use the information in the formal reference in a structured way to retrieve and hopefully access (rather than  discover – the reference is the discovery component) the desired item?

Or am I reading the above search UI incorrectly…?

PS in terms of teaching material design, and referencing the above citation example, erm….?


Because of course I’m not searching for a Journal that has something to do with Frodo Baggins – I’m searching for an article

PPS I’m also finding more and more that the subscription journal content I want to access is from journals that the OU Library doesn’t subscribe to. I’m not sure how many of the journals it does subscribe to that are bundled are never accessed (the data should reveal that)? So I wonder – as academics (and maybe students), should we instead be given a budget code we could use to buy the articles we want? And for articles used by students in courses, get a ‘site license” for some articles?

Now what was the URL of that pirated academic content site someone from the library told me about again…?

PS from the Library – don’s use the reference – just bung the title in the search onebox like you would do on a web search engine..


Hmm… but if I have a full reference, I should be able to run a search that returns just a single result, for exactly the item I want? Or maybe returns links to a few different instances (from different suppliers) of just that resource? But then – which is the preferred one (the Library search ranks different suppliers of the same work according to what algorithm?)

Or perhaps the library isn’t really about supporting know item retrieval – it’s about supporting serendipity and the serendipitous discovery of related items? (Though that begs the question about how the related item list is algorithmically generated?)

Or maybe ease of use has won out – and running a scruffy search then filtering down by facet gives a good chance of effective retrieval with an element of serendipity around similar resources?

(Desperately tries to remember all the arguments libraries used to make against one box searching…)

Libraries Are Where You Go to Help Make Sense of The World

After a break of a couple of years, I’ll be doing a couple of sessions at ILI 2016 next week; for readers with a long memory, the Internet Librarian International conference is where I used to go to berate academic librarians every year about how they weren’t keeping up with the internet, and this year will perhaps be a return to those stomping grounds for me ;-)

One of the questions I used to ask – and probably still could – was where in the university I should go to get help with visualising a data set, either to help me make sense of it, or as part of a communications exercise. Back when IT was new, libraries used to be a place you could go to get help with certain sorts of information skills and study skills (such as essay writing skills), which included bits of training and advice on how to use appropriate software applications. As the base level in digital information skills increases – many people are able to figure out how to open a spreadsheet on their own.


But has the Overton window for librarians offering IT support developed with the times? Where should students wanting to develop more refined skills – how to start cleaning a dataset, for example, or visualising one sensibly, or even just learning how to read a chart properly on the one hand, or tell a story with data on the other – actually go? And what about patrons who want to be able to make better use of automation to help them in their information related tasks (screenscraping, for example, or extracting text, images or tables from hundreds of pages of PDFs); or who want help accessing commodity “AI” services accessed via APIs? Or need support in writing scientific communications that sensibly embed code and its outputs, or mathematical or musical notation, particular for a web based (i.e. potentially interactive) journal or publication? Or who just need to style a chart in a particular way?

Now it’s quite likely that, having been playing with tech for more years than I care to remember, I’m afflicted by “the curse of knowledge“, recently contextualised for libraries by Lorcan Dempsey, quoting Steven Pinker. In the above paragraph, I half-assume readers know what screenscraping is, for example, as well as why it’s blindingly obvious (?!) why you might want to be able to do it, even if you don’t know how to do it? (For librarians, there’s a couple of things to note there: firstly, what it is and why you might want to do it; secondly, which might be a referral to tools, if not training, what sorts of tool might be able to help you with it.)

But the question remains – there’s a lot of tech power tools out there that can help you retrieve, sort, search, analyse, organise and present information out there, but where do I go for help?

If not the library, where?

If not the library, why not?

The end result is often: the internet. For which, for many in the UK, read: Google.

Anyway, via the twitterz (I think…) a couple of weeks ago, I spotted this interesting looking job ad from Harvard:

Visualization Specialist
School/Unit Harvard College Library
Location USA – MA – Cambridge
Job Function Library
Time Status Full-time
Department Harvard College Library – Services for Maps, Media, Data, and Government Information

Duties & Responsibilities – Summary
Reporting to the Head, Social Sciences and Visualization in the unit for Maps, Media, Data and Government Information, the Visualization Specialist works with staff and faculty to identify hardware and software needs, and to develop scalable, sustainable practices related to data visualization services. This position designs and delivers workshops and training sessions on data visualization tools and methods, and develops a range of instructional materials to support library users with data visualization needs in the Social Sciences and Humanities.

The Visualization Specialist will coordinate responsibilities with other unit staff and may supervise student employees.

Duties and Responsibilities
– Advises, consults, instructs, and serves as technical lead with data visualization projects with library, faculty teaching, and courses where students are using data.
– Identifies, evaluates and recommends new and emerging digital research tools for the Libraries and Harvard research community.
– Develops and supports visualization services in response to current trends, teaching and learning–especially as it intersects with Library collections and programs.
– Collaborates in developing ideas and concepts effectively across diverse interdisciplinary audiences and serves as a point person for data visualization and analysis efforts in the Libraries and is attuned to both the quantitative and qualitative uses with datasets. Understands user needs for disseminating their visualizations as either static objects for print publications or interactive online objects to engage with.
– Develops relationships with campus units supporting digital research, including the Center for Government and International Studies, Institute for Quantitative Social Sciences, and Harvard Library Central Services, and academic departments engaged in data analysis and visualization.
– Develops, collects, and curates exemplar data sets from multiple fields to be used in visualization workshops and training materials

Basic Qualifications
– ALA-accredited master’s degree in Library or Information Science OR advanced degree in Social Sciences, Psychology, Design, Informatics, Statistics, or Humanities.
– Minimum of 3 years experience in working with data analysis and visualization in an academic setting.
– Demonstrated experience with data visualization tools and programming libraries.
– Proficiency with at least one programming language (such as Python and R).
– Ability to use a variety of tools to extract and manipulate data from various sources (such as relational databases, XML, web services and APIs).

Additional Qualifications

Experience supporting data analysis and visualization in a research setting.

Proficiency using tools and programming libraries to support text analysis.
Familiarity with geospatial technology.
Experience identifying and recommending new tools, technologies, and online delivery of visualizations.
Graphic design skills and proficiency using relevant software.

Many of the requisite skills resonate with the calls (taunts?) I used to make asking library folk where I should go for support with data related based questions. At which point you may be thinking – “okay, techie geeky stuff… scary…not our gig…”.

But if you focus of data visualisation many of which actually relate to communication issues – representation and presentation – rather than technical calls for help. For example, what sort of chart should I use to communicate this sort of thing? How can I change the look of a chart? How can I redesign a chart to help me communicate with it better?

And it’s not just the presentation of graphical information. Part of the reason I put together the F1 Data Junkie book was that I wanted to explore the RStudio/RMarkdown (Rmd) workflow for creating (stylish) technical documents. Just the other day I noticed that in the same way charts can be themed, themes for Rmd documents are now starting to appear – such as tint (Tint Is Not Tufte); in fact, it seems there’s a whole range of output themes already defined (see also several other HTML themes for Rmd output).


What’s nice about these templates is that they are defined separately from the actual source document. If you want to change from one format to another, things like the R rticles package make it easy. But how many librarians even know such workflows exist? How many have even heard of markdown?

It seems to me that tools around document creation are in a really exciting place at the moment, made more exciting once you start to think about how they fit into wider workflows (which actually makes them harder to promote, because folk are wedded to their current crappy workflows).

So are the librarians on board with that, at least, given their earlier history as word-processor evangelists?

See also: A New Role for the Library – Gonzo Librarian Informationista, including the comment Notes on: Exploring New Roles for Librarians. This also touches on the notion of an embedded librarian.

And this, from Martin Bean, previously VC of the OU, several years ago…

DH Box – Digital Humanities Virtual Workbench

As well as offering digital application shelves, should libraries offer, or act as instituional sponsors of, digital workbenches?

I’ve previously blogged about things like SageMathCloud, and application based learning environment, and the IBM Data Scientist Workbench, and today came across another example: DHBox, CUNY’s digital humanities lab in the cloud (wiki), which looks like it may have been part of a Masters project?


If you select the demo option, a lab context is spawned for you, and provides access to a range of tools: staples, such as RStudio and Jupyter notebooks, a Linux terminal, and several website creation tools: Brackets, Omeka and WordPress (though the latter two didn’t work for me).


(The toolbar menu reminded me of Stringle / DockLE ;-)

There’s also a file browser, which provides a common space for organising – and uploading – your own files. Files created in one application are saved to the shared file area and available for use on other applications.


The applications are being a (demo) password authentication scheme, which makes me wonder if persistent accounts are in the project timeline?


Once inside the application, you have full control over it. If you need additional packages in RStudio, for example, then just install them:


They work, too!


On the Jupyter notebook front, you get access to Python3 and R kernels:



In passing, I notice that RStudio’s RMarkdown now demonstrates some notebook like activity, demonstrating the convergence between document formats such as Rmd (and ipymd) and notebook style UIs [video].

Code for running your own DHBox installation is available on Github (DH-Box/dhbox), though I haven’t had a chance to give it a try yet. One thing it’d be nice to see is a simple tutorial showing how to add in another tool of your own (OpenRefine, for example?) If I get a chance to play with this – and can get it running – I’ll try to see if I can figure out such an example.

It also reminded me that I need to play with my own install of tmpnb, not least because of  the claim that “tmpnb can run any Docker container”.  Which means I should be able to set up my own tmpRStudio, or tmpOpenRefine environment?

If visionary C. Titus Brown gets his way with a pitched for MyBinder hackathon, that might extend that project’s support for additional data science applications such as RStudio, as well as generalising the infrastructure on which myBinder can run. Such as Reclaimed personal hosting environments, perhaps?!;-)

That such combinations are now popping up all over the web makes me think that they’ll be a commodity service anytime soon. I’d be happy to argue this sort of thing could be used to support a “technology enhanced learning environment”, as well as extending naturally into“technology enhanced research environments”, but from what I can tell, TEL means learning analytics and not practical digital tools used to develop digital skills? (You could probably track the hell of of people using such environments if you wanted to, though I still don’t see what benefits are supposed to accrue from such activity?)

It also means I need to start looking out for a new emerging trend to follow, not least because data2text is already being commoditised at the toy/play entry level. And it won’t be VR. (Pound to a penny the Second Life hipster, hypster, shysters will be chasing that. Any VR campuses out there yet?!) I’d like to think we might see inroads being made into AR, but suspect that too will always be niche, outside certain industry and marketing applications. So… hmmm… Allotments… that’s where the action’ll be… and not in a tech sense…

Application Shelves for the Digital Library – Fragments

Earlier today, I came across BioShaDock: a community driven bioinformatics shared Docker-based tools registry (BioShadock registry). This collects together a wide range of containerised applications and tools relevant to the bioinformatics community. Users can take one or more applications “off-the-shelf” and run them, without having to go through any complicated software installation process themselves, even if the installation process is a nightmare confusion of dependency hell: the tools are preinstalled and ready to run…

The container images essentially represent reference images that can be freely used by the community. The application containers come pre-installed and ready to run, exact replicas of their parent reference image. The images can be versioned with different versions or builds of the application, so you can reference the use of a particular version of an application and provide a way of sharing exactly that version with other people.

So could we imagine this as a specialist reference shelf in a Digital Library? A specialist reference application shelf, with off-the-shelf, ready-to-run run tools, anywhere, any time?

Another of the nice things about containers is that you can wire them together using things like Docker Compose or Panamax templates to provide a suite of integrated applications that can work with each other. Linked containers can pass information between each other in isolation from the rest of the world. One click can provision and launch all the applications, wired together. And everything can be versioned and archived. Containerised operations can also be sequenced too (eg using DRAY docker pipelines or OpenAPI).

Sometimes, you might want to bundle a set of applications together in a single, shareable package as a virtual machine. These can be versioned, and shared, so everyone has access to the same tools installed in the same way within a single virtual machine. Things like the DH Box, “a push-button Digital Humanities laboratory” (DHBox on github); or the Data Science Toolbox. These could go on to another part of the digital library applications shelf – a more “general purpose toolbox” area, perhaps?

As a complement to the “computer area” in the physical library that provides access to software on particular desktops, the digital library could have “execution rooms” that will actually let you run the applications taken off the shelf, and access them through a web browser UI, for example. So runtime environments like mybinder or tmpnb. Go the the digital library execution room (which is just a web page, though you may have to authenticate to gain access to the server that will actually run the code for you..), say which container, container collection, or reference VM you want to run, and click “start”. Or take the images home you with (that is, run them on your own computer, or on a third party host).

Some fragments relating to the above to try to help me situate this idea of runnable, packaged application shelves with the context of the library in general…

  • libraries have been, and still are, one of the places you go access IT equipment and learn IT skills;
  • libraries used to be, and still are, a place you could go to get advice on, and training in, advanced information skills, particularly discovery, critical reading and presentation;
  • libraries used to be, and still, a locus for collections of things that are often valuable to community or communities associated with a particular library;
  • libraries provide access to reference works or reference materials that provide a common “axiomatic” basis for particular activities;
  • libraries are places that provide access to commercial databases;
  • libraries provide archival and preservation services;
  • libraries may be organisational units that support data and information management needs of their host organisation.

Some more fragments:

  • the creation of a particular piece of work may involve many different steps;
  • one or more specific tools may be involved in the execution of each step;
  • general purpose tools may support the actions required perform a range of tasks to a “good enough” level of quality;
  • specialist tools may provide a more powerful environment for performing a particular task to a higher level of quality

Some questions:

  • what tools are available for performing a particular information related task or set of tasks?
  • what are the best tools for performing a particular information related task or set of tasks?
  • where can I get access to the tools required for a particular task without having to install them myself?
  • how can I effectively organise a workflow that requires the use of several different tools?
  • how can I preserve, document or reference the workflow so that I can reuse it or share it with others?

Some observations:

  • Docker containers provide a way of packaging an application or tool so that it can be “run anywhere”;
  • Docker containers may be linked together in particular compositions so that they can interoperate with each other;
  • docker container images may be grouped together in collections within a subject specific registry: for example, BioShaDock.

Panama Papers, Quick Start in SQLite3

Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.

Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.

  • Download and install SQLite3
  • download the data from the Sunday Times and unzip it
  • on the command line/terminal, cd into the unzipped directory
  • create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
  • you should now be presented with a SQLite console command line. Run the command: .mode csv
  • we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
  • We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
  • so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
  • let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
  • see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
  • see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
  • exit SQLite console by running: .q
  • to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).



Have fun…

PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]

Made-Up eBook Physics

Some reflections on reading a subscription based, “scholarly” ebook just now…

First, I can read the book online, or download it.


If I download it I need to get myself a special pair of spectacles to the read magic ink it’s written it.

I also need to say how long I want to the “loan period” to be. (I don’t know if this is metered according to multiples of 24 hours, or end the of the calendar based return day.) At the end of the loan period, I think I can keep the book but suspect that a library rep will come round to my house and either lock the book in a box to which only they have the key, or run the pages through a shredder (I’m not sure which).

Looking at the online copy of the book, there are various quotas associated with how I can use it.


A tool bar provides me with various options: the Image view is a crappy resolution view, albeit one that provides and infinite scroll through the book.


The PDF view lets me view a PDF version of the current page, though I can’t copy from it. (I do seem to be able to download it though, using the embedded PDF reader, without affecting any quotas?)


If I select the Copy option, it takes me into the PDF view and does let me highlight and copy text from that page. However, if I try to copy from too many pages, that “privilege” is removed…


As far as user experience goes, pretty rubbish on first use, and many of the benefits of having the electronic version, as compared to a print version, have been defensively (aggressively?!) coded against. This doesn’t achieve anything other than introduce inconvenience. So for example, having run out of my copy quota, I manually typed a copy of the sentence I wasn’t allowed to highlight and cmd/ctrl-C.