Category: Infoskills

DH Box – Digital Humanities Virtual Workbench

As well as offering digital application shelves, should libraries offer, or act as instituional sponsors of, digital workbenches?

I’ve previously blogged about things like SageMathCloud, and application based learning environment, and the IBM Data Scientist Workbench, and today came across another example: DHBox, CUNY’s digital humanities lab in the cloud (wiki), which looks like it may have been part of a Masters project?


If you select the demo option, a lab context is spawned for you, and provides access to a range of tools: staples, such as RStudio and Jupyter notebooks, a Linux terminal, and several website creation tools: Brackets, Omeka and WordPress (though the latter two didn’t work for me).


(The toolbar menu reminded me of Stringle / DockLE ;-)

There’s also a file browser, which provides a common space for organising – and uploading – your own files. Files created in one application are saved to the shared file area and available for use on other applications.


The applications are being a (demo) password authentication scheme, which makes me wonder if persistent accounts are in the project timeline?


Once inside the application, you have full control over it. If you need additional packages in RStudio, for example, then just install them:


They work, too!


On the Jupyter notebook front, you get access to Python3 and R kernels:



In passing, I notice that RStudio’s RMarkdown now demonstrates some notebook like activity, demonstrating the convergence between document formats such as Rmd (and ipymd) and notebook style UIs [video].

Code for running your own DHBox installation is available on Github (DH-Box/dhbox), though I haven’t had a chance to give it a try yet. One thing it’d be nice to see is a simple tutorial showing how to add in another tool of your own (OpenRefine, for example?) If I get a chance to play with this – and can get it running – I’ll try to see if I can figure out such an example.

It also reminded me that I need to play with my own install of tmpnb, not least because of  the claim that “tmpnb can run any Docker container”.  Which means I should be able to set up my own tmpRStudio, or tmpOpenRefine environment?

If visionary C. Titus Brown gets his way with a pitched for MyBinder hackathon, that might extend that project’s support for additional data science applications such as RStudio, as well as generalising the infrastructure on which myBinder can run. Such as Reclaimed personal hosting environments, perhaps?!;-)

That such combinations are now popping up all over the web makes me think that they’ll be a commodity service anytime soon. I’d be happy to argue this sort of thing could be used to support a “technology enhanced learning environment”, as well as extending naturally into“technology enhanced research environments”, but from what I can tell, TEL means learning analytics and not practical digital tools used to develop digital skills? (You could probably track the hell of of people using such environments if you wanted to, though I still don’t see what benefits are supposed to accrue from such activity?)

It also means I need to start looking out for a new emerging trend to follow, not least because data2text is already being commoditised at the toy/play entry level. And it won’t be VR. (Pound to a penny the Second Life hipster, hypster, shysters will be chasing that. Any VR campuses out there yet?!) I’d like to think we might see inroads being made into AR, but suspect that too will always be niche, outside certain industry and marketing applications. So… hmmm… Allotments… that’s where the action’ll be… and not in a tech sense…

More Docker Doodlings – Accessing GUI Apps Via a Browser from a Container Using Guacamole

In a PS to Using Docker as a Personal Productivity Tool – Running Command Line Apps Bundled in Docker Containers, I linked to a demonstration by Jessie Frazelle on how to connect to GUI based apps running in a container via X11. This is all very well if you have an X client, but it would be neater if we could find a way of treating the docker container as a virtual desktop container, and then accessing the app running inside it via the desktop presented through a browser.

Digging around, Guacamole looks like it provides a handy package for exposing a Linux desktop via a browser based user interface [video demo].

Very nice… Which got me wondering: can we run guacamole inside a container, alongside an X.11 producing app, to expose that app?

Via the Dockerfile referenced in Digikam on Mac OS/X or how to use docker to run a graphical app on Mac OS/X I tracked down linuxserver/dockergui, a Docker image that “makes it possible to use any X application on a headless server through a modern web browser such as chrome”.

Exciting:-) [UPDATE: note that that image uses an old version of the guacamole packages; I tried updating to the latest versions of the packages but it doesn’t just work so I rolled back. Support for Docker was introduced after the version used in the linuxserver/dockergui, but I don’t fully understand what that support does! Ideally, it’d be nice to run a guacamole container and then use docker-compose to link in the applications you want to expose to it? Is that possible? Anyone got an example of how to do it?]

So I gave it a go with Audacity. The files I used are contained in this gist that should also be embedded at the bottom of this post.

(Because the original linuxserver/dockergui was quite old, I downloaded their source files and built a current one of my own to seed my Audacity container.)

Building the Audacity container with:

docker build -t psychemedia/audacitygui .

and then running it with:

docker run -d -p 8080:8080 -p 3389:3389 -e "TZ=Europe/London" --name AudacityGui psychemedia/audacitygui

this is what pops up in the browser:


If we click through on the app, we’re presented with a launcher:


Select the app, and Hey, Presto!, Audacity appears…


It seems to work, too… Create a chirp, and then analyse it:


We seem to be able to load and save files in the nobody directory:


I tried exposing the /nobody folder by adding VOLUME /nobody to the Dockerfile and running a -v "${PWD}/files":/nobody switch, but it seemed to break things which is presumably a permissions thing? There are various user roles settings in the linuxserver/dockergui build files, so making poking around with those would fix things? Otherwise, we might have to see the container directly with any files we want in it?:-(

UPDATE: adding RUN mkdir -p /audacityfiles && adduser nobody root to the Dockerfile along with ./VOLUME /audacityfiles and then adding -v "${PWD}/files":/audacityfiles when I create the container allows me to share files in to the container, but I can’t seem to save to the folder? Nor do variations on the theme, such as s creating a subfolder in the nobody folder and giving the same ownership and permissions as the nobody folder. I can save into the nobody folder though. (Just not share it?)

WORKAROUND: in Kitematic, the Exec button on the container view toolbar takes you into the container shell. From there, you can copy files into the shared direcory. For example: cp /nobody/test.aup /nobody/share/test.aup Moving the share to a folder outside /nobody, eg to /audacityfiles means we can simply compy everything from /nobody to /audacityfiles.

Another niggle is with the sound – one of the reasons I tried the Audacity app… (If we can have the visual of the desktop, we want to try to push for sound too, right?!)

Unfortunately, when I tried to play the audio file I’d created, it wasn’t having any of it:


Looking at the log file of the container launch in Kitematic, it seems that ALSA (the Advanced Linux Sound Architecture project) wasn’t happy?


I suspect trying to fix this is a bit beyond my ken, as too are the sorting out the shared folder permissions, I suspect… (I don’t really do sysadmin – which is why I like the idea of ready-to-run application containers).

UPDATE 2: using a different build of the image – hurricane/dockergui:x11rdp1.3, from the linuxserver/dockergui x11rdp1.3 branch, audio does work, though at times it seemed to struggle a bit. I still can’t save files to shared folder though:-(

UPDATE 3: I pushed an image as psychemedia/audacity2. It works from the command line as:
docker run -d -p 8080:8080 -p 3389:3389 -e "TZ=Europe/London" --name AudacityGui -v "${PWD}":/nobody/share psychemedia/audacity2

Anyway – half-way there. If nothing else, we could create and analyse audio files visually in the browser using Audacity, even if we can’t get hold of those audio files or play them!

I’d hope there was a simple permissions fix to get the file sharing to work (anyone? anyone?! ;-) but I suspect the audio bit might be a little bit harder? But if you know of a fix, please let me know:-)

PS I just tried launching psychemedia/audacity2 public Dockerhub image via Docker Cloud, and it seemed to work…

Using Docker as a Personal Productivity Tool – Running Command Line Apps Bundled in Docker Containers

With its focus on enterprise use, it’s probably with good reason that the Docker folk aren’t that interested in exploring the role that Docker may have to play as a technology that supports the execution of desktop applications, or at least, applications for desktop users. (The lack of significant love for Kitematic seems to be representative of that.)

But I think that’s a shame; because for educational and scientific/research applications, docker can be quite handy as a way of packaging software that ultimately presents itself using a browser based user interface delivered over http, as I’ve demonstrated previously in the context of Jupyter notebooks, OpenRefine, RStudio, R Shiny apps, linked applications and so on.

I’ve also shown how we can use Docker containers to package applications that offer machine services via an http endpoint, such as Apache Tika.

I think this latter use case shows how we can start to imagine things like a “digital humanities application shelf” in a digital library (fragmentary thoughts on this), that allows users to take either an image of the application off the shelf (where an image is a thing that lets you fire up a pristine instance of the application), or a running instance of the application of the shelf. (Furthermore, the application can be run locally, on your own desktop computer, or in the cloud, for example, using something like a mybinder like service). The user can then use the application directly (if it has a browser based UI), or call on it from elsewhere (eg in the case of Apache Tika). Once they’re done, they can keep a copy of whatever files they were working with and destroy their running version of the application. If they need the application again, they can just pull a new copy (of the latest version of the app, or the version they used previously) and fire up a new instance of it.

Another way of using Docker came to mind over the weekend when I saw a video demonstrating the use of the contentmine scientific literature analysis toolset. The contentmine installation instructions are a bit of a fiddle for the uninitiated, so I thought I’d try to pop them into a container. That was easy enough (for a certain definition of easy – it was a faff getting node to work and npm to be found, the Java requirements took a couple of goes, and I;m sure the image is way bigger than it really needs to be…), as the Dockerfile below/in the gist shows.

But the question then was how to access the tools? The tools themselves are commandline apps, so the first thing we want to do is to be able to call into the container to run the command. A handy post by Mike English entitled Distributing Command Line Tools with Docker shows how to do this, so that’s all good then…

The next step is to consider how to retain copies of the files created by the command line apps, or pass files to the apps for processing. If we have a target host directory and mount it into the container as as a shared volume, we can keep the files on our desktop or allow the container to create files into the host directory. Then they’ll be accessible to us all the time, even if we destroy the container.

The gist that should be embedded below shows the Dockerfile and a simple batch file passes the Contentmine tool commands into the container which then executes them. The batch file idea could be further extended to produce a set of command shortcuts that essentially alias the Contentmine commands (eg a ./getpapers command rather than a ./contentmine getpapers command, or that combine the various steps associated with a particular pipeline or workflow – getpapers/norma/cmine, for example – into a single command.

UPDATE: the CenturyLinkLabs DRAY docker pipeline looks interesting in this respect for sequencing a set of docker containers and passing the output of one as the input to the next.

If there are other folk out there looking at using Docker specifically for self-managed “pull your own container” individual desktop/user applications, rather than as a devops solution for deploying services at scale, I’d love to chat…:-)

PS for several other examples of using Docker for desktop apps, including accessing GUI based apps using X WIndows / X11, see Jessie Frazelle’s post Docker Containers on the Desktop.

PPS See also More Docker Doodlings – Accessing GUI Apps Via a Browser from a Container Using Guacamole for an attempt at exposing a GUI based app, such as Audacity, running in a container via a browser. Note that I couldn’t get a shared folder or the audio to work, although the GUI bit did…

PPPS I wondered how easy it would be to run command-line containers from within Jupyter notebook itself running in inside a container, but got stuck. Related question on Stack Overflow here.

The rest of the way this post is published is something of an experiment – everything below the line is pulled in from a gist using the WordPress – embedding gists shortcode…

Harvesting Searched for Tweets Using Python

Via Tanya Elias/eliast05, a query regarding tools for harvesting historical tweets. I haven’t been keeping track of Twitter related tools over the last few years, so my first thought is often “could Martin Hawksey’s TAGSexplorer do it?“!

But I’ve also had a the twecoll Python/command line package on my ‘to play with’ list for a while, so I though I’d give it a spin. Note that the code requires python to be installed (which it will be, by default, on a Mac).

On the command line, something like the following should be enough to get you up and running if you’re on a Mac (run the commands in a Terminal, available from the Utilities folder in the Applications folder). If wget is not available, download the twecoll file to the twitterstuff directory, and save it as twecoll (no suffix).

#Change directory to your home directory
$ cd

#Create a new directory - twitterstuff - in you home directory
$ mkdir twitterstuff

#Change directory into that directory
$ cd twitterstuff

#Fetch the twecoll code
$ wget
--2016-05-02 14:51:23--
Connecting to||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31445 (31K) [text/plain]
Saving to: 'twecoll'
twecoll                    100%[=========================================>]  30.71K  --.-KB/s   in 0.05s  
2016-05-02 14:51:24 (564 KB/s) - 'twecoll' saved [31445/31445]

#If you don't have wget installed, download the file from:
#and save it in the twitterstuff directory as twecoll (no suffix)

#Show the directory listing to make sure the file is there
$ ls

#Change the permissions on the file to 'user - executable'
$ chmod u+x twecoll

#Run the command file - the ./ reads as: 'in the current directory'
$ ./twecoll tweets -q "#lak16"

Running the code the first time prompts you for some Twitter API credentials (follow the guidance on the twecoll homepage), but this only needs doing once.

Testing the app, it works – tweets are saved as a text file in the current directory with an appropriate filename and suffix .twt – BUT the search doesn’t go back very far in time. (Is the Twitter search API crippled then…?)

Looking around for an alternative, the GetOldTweets python script, which again can be run from the command line; download the zip file from Github, move it into the twitterstuff directory, and unzip it. On the command line (if you’re still in the twitterstuff directory, run:


to check the name of the folder (something like GetOldTweets-python-master) and then cd into it:

cd GetOldTweets-python-master/

to move into the unzipped folder.

Note that I found I had to install pyquery to get the script to run; on the command line, run: easy_install pyquery.

This script does not require credentials – instead it scrapes the Twitter web search. Data limits for the search can be set explicitly.

python --querysearch '#lak15' --since 2015-03-10 --until 2015-09-12 --maxtweets 500

Tweets are saved into the file output_got.csv and are semicolon delimited.

A couple of things I noticed with this script: it’s slow (because it “scrolls” through pages and pages of Twitter search results, which only have a small number of results on each) and on occasion seems to hang (I’m not sure if it gets stuck in an infinite loop; on a couple of occasions I used ctrl-z to break out). In such a case, it doesn’t currently save results as you go along, so you have nothing; reduce the --maxtweets value, and try again. On occasion, when running the script under the default Mac python 2.7, I noticed that there may be encoding issues in tweets which break the output, so again the file can’t get written,

Both packages run from the command line, or can be scripted from a Python programme (though I didn’t try that). If the GetOldTweets-python package can be tightened up a bit (eg in respect of UTF-8/tweet encoding issues, which are often a bugbear in Python 2.7), it looks like it could be a handy little tool. And for collecting stuff via the API (which requires authentication), rather than by scraping web results from advanced search queries, twecoll looks as if it could be quite handy too.

When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor

Although not necessarily the best way of publishing data, data tables in PDF documents can often be extracted quite easily, particularly if the tables are regular and the cell contents reasonably space.

For example, official timing sheets for F1 races are published by the FIA as event and timing information in a set of PDF documents containing tabulated timing data:


In the past, I’ve written a variety of hand crafted scrapers to extract data from the timing sheets, but the regular way in which the data is presented in the documents means that they are quite amenable to scraping using a PDF table extractor such as Tabula. Tabula exists as both a server application, accessed via a web browser, or as a service using the tabula extractor Java application.

I don’t recall how I came across it, but the tabulizer R package provides a wrapper for tabula extractor (bundled within the package), that lets you access the service via it’s command line calls. (One dependency you do need to take care of is to have Java installed; adding Java into an RStudio docker container would be one way of taking care of this.)

Running the default extractor command on the above PDF pulls out the data of the inner table:

extract_tables('Best Sector Times.pdf')


Where the data is spread across multiple pages, you get a data frame per page.


Note that the headings for the distinct tables are omitted. Tabula’s “table guesser” identifies the body of the table, but not the spanning column headers.

The default settings are such that tabula will try to scrape data from every page in the document.


Individual pages, or sets of pages, can be selected using the pages parameter. For example:

  • extract_tables('Lap Analysis.pdf',pages=1
  • extract_tables('Lap Analysis.pdf',pages=c(2,3))

Specified areas for scraping can also be specified using the area parameter:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 230, 500)))

The area parameter appears to take co-ordinates in the form: top, left, width, height is now fixed to take co-ordinates in the same form as those produced by tabula app debug: top, left, bottom, right.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.


The tabula console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.


Using a combination of “guess” to find the dominant table, and specified areas, we can extract the data we need from the PDF and combine it to provide a structured and clearly labeled dataframe.

On my to do list: add this data source recipe to the Wrangling F1 Data With R book…

Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose

Pondering the Sunday Times Panama Papers directors/companies database yesterday (Panama Papers, Quick Start in SQLite3), I thought it was about time I got my head round using a graph database to store this sort of relational information.

Getting started with the neo4j database has been on my to do list for some time, so when pointed to a post on the neo4j about Analyzing the Panama Papers with Neo4j: Data Models, Queries & More, I thought I should give it a go.

So here’s a quick start to the first part – getting a working environment up and running. A quick search turned up a set of examples of how to get started using Neo4j using Jupyter notebooks by Nicole White, so I opted for a notebook/neo4j combination. The following docker-compose.yml file fires up a notebook server and a neo4j in separate docker containers and links them together:

  image: kbastani/docker-neo4j:latest
    - "7474:7474"
    - "1337:1337"
    - /opt/data

  image: jupyter/scipy-notebook
    - "8888:8888"
    - neo4j:neo4j
    - .:/home/jovyan/work

Launching the docker CLI from Kitematic, I can cd into the directory containing the docker-compose.yml file and run the command docker-compose up -d to download and launch the containers.


In the browser, launch a Jupyter terminal and pull down the example notebooks by running the command:

git clone

The notebooks will be downloaded in to the folder neo4j-jupyter. Still in the terminal, create a figure directory, as required for the hello-world.ipynb notebook.

mkdir -p neo4j-jupyter/figure

You’ll also need to install the py2neo python package. By default, this will be installed into the Python 3 path:

pip install py2neo

The example notebooks run in a Python 2 kernel, so we need to install the package into that environment too:

source activate python2
pip install py2neo


Now you should be able to run the example notebooks. One thing to note though – you will need to change the connection details for the neo4j database slightly, In the appropriate notebook code cells, change the default graph = Graph() connection to:

graph = Graph("http://neo4j:7474/db/data/")


I’ve run out of time to do any more just now. In the next post on this topic, I’ll see if I can work out how to get the Sunday Times data into neo4j

Panama Papers, Quick Start in SQLite3

Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.

Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.

  • Download and install SQLite3
  • download the data from the Sunday Times and unzip it
  • on the command line/terminal, cd into the unzipped directory
  • create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
  • you should now be presented with a SQLite console command line. Run the command: .mode csv
  • we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
  • We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
  • so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
  • let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
  • see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
  • see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
  • exit SQLite console by running: .q
  • to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).



Have fun…

PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]