Rebuilding a fresh version of the TM351 VM from scratch yesterday, I got an error trying to install tty.js, a node.js app that provides a “terminal desktop in the browser”.
Looking into a copy of the VM where tty.js does work, I could discover the version of node I’d previously successfully used, as well as check all the installed package versions:
### Show nodejs version and packages > node -v v0.10.35 > npm list -g /usr/local/lib ├─┬ email@example.com │ ├── firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org │ ├── email@example.com ... │ └── firstname.lastname@example.org └─┬ email@example.com ├─┬ firstname.lastname@example.org │ ├── email@example.com ...
Using this information, I could then use nvm, a node.js version manager, installed via:
curl https://raw.githubusercontent.com/creationix/nvm/v0.23.3/install.sh | NVM_DIR=/usr/local/lib/ bash
to install, from a new shell, the version I knew worked:
nvm install 0.10.35
npm install tty.js
(I should probably add the tty.js version in there too? npm install firstname.lastname@example.org perhaps? )
The terminal can then be run as a demon from:
/usr/local/lib/node_modules/tty.js/bin/tty.js --port 3000 --daemonize
What this got me wondering was: are there any utilities that let you capture a nodejs configuration, for example, and the recreate it in a new machine. That is, export the node version number and versions of the installed packages, then create an installation script that will recreate that setup?
It would be handy if this approach could be extended further. For example, we can also look at the packages – and their version numbers – installed on the Linux box using:
### Show packages
And we can get a list of Python packages – and their version numbers – using:
### Show Python packages
Surely there must be some simple tools/utilities out that support this sort of thing? Or even just cheatsheets that show you what commands to run to export the packages and versions into a file in a format that allows you to use that file as part of an installation script in a new machine to help rebuild the original one?
Having got my promotion case through the sub-Faculty level committee (with support and encouragement from senior departmental colleagues), it’s time for another complete rewrite to try to get it though the Faculty committee. Guidance suggests that it is not inappropriate – and may even be encouraged – for a candidate to include something about their academic philosophy, so here are some scribbled thoughts on mine…
One of the declared Charter objects (sic) of the Open University is "to promote the educational well-being of the community generally", as well as " the advancement and dissemination of learning and knowledge". Both as a full-time PhD student with the OU (1993-1997), and then as an academic (1999-), I have pursued a model of open practice, driven by the idea of learning in public, with the aim of communicating academic knowledge into, and as part of, wider communities of practice, modeling learning behaviour through demonstrating my own learning processes, and originating new ideas in a challengeable and open way as part of my own learning journey.
My interest in open educational resources is in part a subterfuge, driven by a desire that educators be more open in demonstrating their own learning and critical practices, including the confusion and misconceptions they grapple with along the way, rather than being seen simply as professors of some sort of inalienable academic truth.
My interest in short course development is based on the belief that for the University to contribute effectively to continued lifelong education and professional development, we need to have offerings that are at an appropriate level of granularity as well as academic level. Degrees represent only one - early part - of that journey. Learners are unlikely to take more than one undergraduate degree in their lifetime, but there is no reason why they should not continue to engage in learning throughout their life. Evidence from the first wave of MOOCs suggests that many participants in those courses were already graduates, with an appreciation of the values of learning and the skills to enable them to engage with those offerings. The characteristation of MOOCs as
cMOOCs xMOOCs (traditional course style offerings) or the looser networked modeled "connectivist MOOCs", xMOOCs cMOOCs, [H/T @r3becca in the comments;-)] represent different educational philosophies: the former may cruelly be described as being based on a model in which the learner expects to be taught (and the instructors expect to profess), whereas the latter requires that participants are engaged in a more personal, yet still collaborative, learning journey, where it is up to each participant to make sense of the world in an open and public way, informed and aided, but also challenged, by other participants. That's how I work every day. I try to make sense of the world to myself, often for a purpose, in public.
Much of my own learning is the direct result of applied problem solving. I try to learn something every day, often as the result of trying to do something each day that I haven't been able to do before. The OUseful.info blog is my own learning diary and a place I can look to refer to things I have previously learned. The posts are written in a way that reinforces my own learning, as a learning resource. The posts often take longer to write than the time taken to discover or originate the thing learned, because in them I try to represent a reflection and retelling of the rationale for the learning event and the context in which it arose: a problem to be solved, my state of knowledge at the time, the means by which I came to make sense of the situation in order to proceed, and the learning nugget that resulted. The thing I can see or do now but couldn't before. Capturing the "I couldn't do X because of Y but now I can, by doing Z" supports a similar form of discovery as the one supported by question and answer sites: the content is auto-optimised to include both naive and expert information, which aids discovery. (It often amused me that course descriptions would often be phrased in the terms and language you might expect to know having completed the course. Which doesn't help the novice discover it a priori, before they have learned those keywords, concepts or phrases that the course will introduce them to...). The posts also try to model my own learning process, demonstrating the confusion, showing where I had a misapprehension of just plain got it wrong. The blog also represents a telling of my own learning journey over an extended period of time, and such may be though of as an uncourse, something that could perhaps be looked at post hoc as a course but that was originated as my own personal learning journey unfolded.
Hmmm… 1500 words for the whole begging letter, so I need to cut the above down to a sentence…
It’s been some time now since I drafted most of my early unit contributions to the TM351 Data management and analysis course. Part of the point (for me) in drafting that material was to find out what sorts of thing we actually wanted to say and help identify the sorts of abstractions we wanted to then build a narrative around. Another part of this (for me) means exploring new ways of putting powerful “academic” ideas and concepts into meaningful, contexts; finding new ways to describe them; finding ways of using them in conjunction with other ideas; or finding new ways of using – or appropriating them – in general (which in turn may lead to new ways of thinking about them). These contexts are often process based, demonstrating how we can apply the ideas or put them to use (make them useful…) or use the ideas to support problem identification, problem decomposition and problem solving. At heart, I’m more of a creative technologist than a scientist or an engineer. (I aspire to being an artist…;-)
Someone who I think has a great take on conceptualising the data wrangling process – in part arising from his prolific tool building approach in the R language – is Hadley Wickham. His recent work for RStudio is built around an approach to working with data that he’s captured as follows (e.g. “dplyr” tutorial at useR 2014 , Pipelines for Data Analysis):
Following an often painful and laborious process of getting data into a state where you can actually start to work with it), you can then enter into an iterative process of transforming the data into various shapes and representations (often in the sense of re-presentations) that you can easily visualise or build models from. (In practice, you may have to keep redoing elements of the tidy step and then re-feed the increasingly cleaned data back into the sensemaking loop.)
Hadley’s take on this is that the visualisation phase can spring surprises on you but doesn’t scale very well, whilst the modeling phase scales but doesn’t surprise you.
To support the different phases of activity, Hadley has been instrumental in developing several software libraries for the R programming language that are particular suited to the different steps. (For the modeling, there are hundreds of community developed and often very specialised R libraries for doing all manner of weird and wonderful statistics…)
In many respects, I’ve generally found the way Hadley has presented his software libraries to be deeply pragmatic – the tools he’s developed are useful and in many senses naturalistic; they help you do the things you need to do in a way that makes practical sense. The steps they encourage you to take are natural ones, and useful ones. They are the sorts of tools that implement the sorts of ideas that come to mind when you’re faced with a problem and you think: this is the sort of thing I need (to be able) to do. (I can’t comment on how well implemented they are; I suspect: pretty well…)
Just as the data wrangling process diagram helps frame the sorts of things you’re likely to do into steps that make sense in a “folk computational” way (in the sense of folk computing or folk IT (also here), a computational correlate to notions of folk physics, for example), Hadley also has a handy diagram for helping us think about the process of solving problems computationally in a more general, problem solving sense:
A cognitive think it step, identifying a problem, and starting to think about what sort of answer you want from it, as well as how you might start to approach it; a describe it step, where you describe precisely what it is you want to do (the sort of step where you might start scribbling pseudo-code, for example); and the computational do it step where the computational grunt work is encoded in a way that allows it to actually get done by machine.
I’ve been pondering my own stance towards computing lately, particularly from my own context of someone who sees computery stuff from a more technology, tool building and tool using context, (that is, using computery things to help you do useful stuff), rather than framing it as a purer computer science or even “trad computing” take on operationalised logic, where the practical why is often ignored.
So I think this is how I read Hadley’s diagram…
Figuring out what the hell it is you want to do (imagining, the what for a particular why), figuring out how to do it (precisely; the programming step; the how); hacking that idea into a form that lets a machine actually do it for you (the coding step; the step where you express the idea in a weird incantation where every syllable has to be the right syllable; and from which the magic happens).
One of the nice things about Hadley’s approach to supporting practical spell casting (?!) is that transformation or operational steps his libraries implement are often based around naturalistic verbs. They sort of do what they say on the tin. For example, in the dplyr toolkit, there are the following verbs:
These sort of map onto elements (often similarly named) familiar to anyone who has used SQL, but in a friendlier way. (They don’t SHOUT AT YOU for a start.) It almost feels as if they have been designed as articulations of the ideas that come to mind when you are trying to describe (precisely) what it is you actually want to do to a dataset when working on a particular problem.
In a similar way, the ggvis library (the interactive chart reinvention of Hadley’s ggplot2 library) builds on the idea of Leland Wilkinson’s “The Grammar of Graphics” and provides a way of summoning charts from data in an incremental way, as well as a functionally and grammatically coherent way. The words the libraries use encourage you to articulate the steps you think you need to take to solve a problem – and then, as if by magic, they take those steps for you.
If programming is the meditative state you need to get into to cast a computery-thing spell, and coding is the language of magic, things like dplyr help us cast spells in the vernacular.
You know how it goes – you start trying to track down a forward “forthcoming” reference and you end up wending your way through all manner of things until you get back to where you started none the wiser… So here’s a snapshot of several docs I found trying to source the original forward reference for following table, found in Improving information to support decision making: standards for better quality data (November 2007, first published October 2007) with the crib that several references to it mentioned the Audit Commission…
The first thing I came across was The Use of Information in Decision Making – Literature Review for the Audit Commission (2008), prepared by Dr Mike Kennerley and Dr Steve Mason, Centre for Business Performance Cranfield School of Management, but that wasn’t it… This document does mention a set of activities associated with the data-to-decision process: Which Data, Data Collection, Data Analysis, Data Interpretation, Communication, Decision making/planning.
The data and information definitions from the table do appear in a footnote – without reference – in Nothing but the truth? A discussion paper from the Audit Commission in Nov 2009, but that’s even later… The document does, however, identify several characteristics (cited from an earlier 2007 report (Improving Information, mentioned below…), and endorsed at the time by Audit Scotland, Northern Ireland Audit Office, Wales Audit Office and CIPFA, with the strong support of the National Audit Office), that contribute to a notion of “good quality” data:
Good quality data is accurate, valid, reliable, timely, relevant and complete. Based on existing guidance and good practice, these are the dimensions reflected in the voluntary data standards produced by the Audit Commission and the other UK audit agencies
* Accuracy – data should be sufficiently accurate for the intended purposes.
* Validity – data should be recorded and used in compliance with relevant requirements.
* Reliability – data should reflect stable and consistent data collection processes across collection points and over time.
* Timeliness – data should be captured as quickly as possible after the event or activity and must be available for the intended use within a reasonable time period.
* Relevance – data captured should be relevant to the purposes for which it is used.
* Completeness – data requirements should be clearly specified based on the information needs of the body and data collection processes matched to these requirements.
The document also has some pretty pictures, such as this one of the data chain:
In the context of the data/information/knowledge definitions, the Audit Commission discussion document also references the 2008 HMG strategy document Information matters: building government’s capability in managing knowledge and information which includes the table in full; a citation link is provided, but 404s, but a source is given to the November 2008 version of Improving information, the one we originally started with. So the original reference forward refers the table to an unspecified report, but future reports in the area refer back to that “original” without making a claim to the actual table itself?
Just in passing, whilst searching for the Improving information report, I actually found another version of it… Improving information to support decision making: standards for better quality data, Audit Commission, first published March 2007.
The table and the definitions as cited in Information Matters do not seem to appear in this earlier version of the document?
PS Other tables do appear in both versions of the report. For example, both the March 2007 and November 2007 versions of the doc contain this table (here, taken from the 2008 doc) of stakeholders:
Anyway, aside from all that, several more documents for my reading list pile…
PS see also Audit Commission – “In the Know” from February 2008.
So you’ve got a dozen or so crappy Word documents collected over the years in a variety of formats, from .doc to .docx, and perhaps even a PDF or two, listing the biographies of speakers at this or that event, or the members of this or that group (a set of company directors, for example). And your task is to identify the names of the people identified in those documents and the companies they have been associated with.
Or you’ve been presented with a set of scanned PDF documents, where the text is selectable, or worse, a set of png images of text documents. And you have a stack of them to search through to find a particular name. What do you do?
Apart from cry a little, inside?
If the documents were HTML web pages, you might consider writing a scraper, using the structure of the HTML document to help you identify different meaningful elements within a webpage, and as a result try to recreate the database that contained the data that was used to generate the web pages.
But in a stack of arbitrary documents, or scanned image files, there is no consistent template you can work with to help you write the one scraper that will handle all the documents.
So how about a weaker form of document parsing? Text extraction, for example. Rather than trying to recreate a data base, how about we settle for just getting the text (the sort of thing a search engine might extract from a set of documents that it can index and search over, for example).
Something like this Microsoft Office (word) doc for example:
Or this scanned PDF (the highlighted text shows the text is copyable as such – so it is actually in the document as text):
Or this image I captured from a fragment of the scanned PDF – no text as such here…:
What are we to do?
Here’s where Apache Tika can help…
Apache Tika is like magic; give a document and it’ll (try) to give you back the text it contains. Even if that document is an image. Tika is quite a hefty bit of code, but it’s something you can run quite easily yourself as a service, using the magic of dockers containers.
In this example, I’m running Apache Tika as a web service in the cloud for a few pennies an hour; and yes, you can do this yourself – instructions for how to run Apache Tika in the cloud or on your own computer are described at the end of the post…
In my case, I had Apache Tika running at the address http://quicktika-1.psychemedia.cont.tutum.io:8008/tika (that address is no longer working).
I was working in an IPython notebook running on a Linux machine (the recipe will also work on a Mac; on Windows, you may need to install curl).
There are two steps:
- PUT the file you want the text extracted from to the server; I use curl, with a command of the form curl -T path/to/myfile.png http://quicktika-1.psychemedia.cont.tutum.io:8008/rmeta > path/to/myfile_txt.json
- Look at the result in the returned JSON file (path/to/myfile_txt.json)
Simple as that…simple as this:
Parse the word doc shown above…
You can see the start of the extracted text in the x-Tika:content element at the bottom…
Parse the PDF doc shown above…
Parse the actual image of fragment of the PDF doc shown above…
See how Tika has gone into image parsing and optical character recognition mode automatically, and done its best to extract the text from the image file? :-)
Running Apache Tika in the Cloud
As described in Getting Started With Personal App Containers in the Cloud, the first thing you need to do is set up an account with a cloud service provider – I’m using Digital Ocean at the moment: it has simple billing and lets you launch cloud hosted virtual machines of a variety of sizes in a variety of territories, including the UK. Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. You don’t need to do anything else – we’ll be using that service via another one… [Affiliate Link: sign up to Digital Ocean and get $10 credit]
Launch a node cluster as described at the start of Getting Started With Personal App Containers in the Cloud. The 2GB/2 core server is plenty.
Now launch a container – the one you want is logicalspark/docker-tikaserver:
To be able to access the service over the web, you need to make its ports public:
I’m going to give it a custom port number, but you don’t have to, in which case a random one will be assigned:
Having created and deployed the container, look up it’s address from the Endpoints tab. The address will be something like tcp://thing-1.yourid.cont.tutum.io:NNNN. You can check the service is there by going to thing-1.yourid.cont.tutum.io:NNNN/tika in your browser.
When you’re don terminate the container and the node cluster so you donlt get billed any more than is necessary.
Running Apache Tika on your own computer
- Install boot2docker
- Launch boot2docker
- In the boot2docker command line, enter: docker pull logicalspark/docker-tikaserver to grab the container image;
- To run the container: docker run -d -p 9998:9998 logicalspark/docker-tikaserver
- enter boot2docker ip to find the address bootdocker is publishing to (eg 192.168.59.103);
- Check the server is there – in your browser, go to eg: http://192.168.59.103:9998/tika
(Don’t be afraid of command lines; you probably already know how to download an app (step 1), definitely know how to launch an app (step 2), know how to type (steps 3 to 5), and how to go to a web location (step 6; note: you do have to enter this URL in the browser location bar at the top of the screen – entering it into Google won’t help..;-) All steps 3 to 5 do are get you to write the commands the computer is to follow, rather than automatically selecting them from a nicely named menu option. (What do you think a computer actually does when you select a menu option?!)
PS via @Pudo, see also: textract – python library for “extracting text out of any document”.
I came across Apache Tika a few weeks ago, a service that will tell you what pretty much any document type is based on it’s metadata, and will have a good go at extracting text from it.
With a prompt and a 101 from @IgorBrigadir, it was pretty easier getting started with it – sort of…
First up, I needed to get the Apache Tika server running. As there’s a containerised version available on dockerhub (logicalspark/docker-tikaserver), it was simple enough for me to fire up a server in a click using tutum (as described in this post on how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour; pretty much all you need to do is fire up a server, start a container based on logicalspark/docker-tikaserver, and tick to make the port public…)
His suggested recipe for using python requests library borked for me – I couldn’t get python to open the file to get the data bits to send to the server (file encoding issues; one reason for using Tika is it’ll try to accept pretty much anything you throw at it…)
I had a look at pycurl:
!apt-get install -y libcurl4-openssl-dev
!pip3 install pycurl
but couldn’t make head or tail of how to use it: the pycurl equivalant of curl -T foo.doc http://example.com:9998/rmeta can’t be that hard to write, can it? (Translations appreciated via the comments…;-)
Instead I took the approach of dumping the result of a curl request on the command line into a file:
!curl -T Text/foo.doc http://example.com:9998/rmeta > tikatest.json
and then grabbing the response out of that:
Not elegant, and far from ideal, but a stop gap for now.
Part of the response from the Tika server is the text extracted from the document, which can then provide the basis for some style free text analysis…
I haven’t tried with any document types other than crappy old MS Word .doc formats, but this looks like it could be a really easy tool to use.
And with the containerised version available, and tutum and Digital Ocean to hand, it’s easy enough to fire up a version in the cloud, let alone my desktop, whenever I need it:-)
…aka “how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour”…
I managed to get my first container up and running in the cloud today (yeah!:-), using tutum to launch a container I’d defined on Dockerhub and run it on a linked DigitalOcean server (or as they call them, “droplet”).
This sort of thing is probably a “so what?” to many devs, or even folk who do the self-hosting thing, where for example you can launch your own web applications using CPanel, setting up your own WordPress site, perhaps, or an online database.
The difference for me is that the instance of OpenRefine I got up and running in the cloud via a web browser was the result of composing several different, loosely coupled services together:
- I’d already published a container on dockerhub that launches the latest release version of OpenRefine: psychemedia/docker-openrefine. This lets me run OpenRefine in a boot2docker virtual machine running on my own desktop and access it through a browser on the same computer.
- Digital Ocean is a cloud hosting service with simple billing (I looked at Amazon AWS but it was just too complicated) that lets you launch cloud hosted virtual machines of a variety of sizes and in a variety of territories (including the UK). Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. So that’s all I did there – created an account and made a small payment. [Affiliate Link: sign up to Digital Ocean and get $10 credit]
- tutum an intermediary service that makes it easy to launch servers and containers running inside them. By linking a DigitalOcean account to tutum, I can launch containers on DigitalOcean in a relatively straightforward way…
Launching OpenRefine via tutum
I’m going to start by launching a 2GB machine which comes in a 3 cents an hour, capped at $20 a month.
Now we need to get a container – which I’m thinking of as if it was a personal app, or personal app server:
I’m going to make use of a public container image – here’s one I prepared earlier…
We need to do a tiny bit of configuration. Specifically, all I need to do is ensure that I make the port public so I can connect to it; by default, it will be assigned to a random port in a particular range on the publicly viewable service. I can also set the service name, but for now I’ll leave the default.
If I create and deploy the container, the image will be pulled from dockerhub and a container launched based on it that I should be able to access via a public URL:
The first time I pull the container into a specific machine it takes a little time to set up as the container files are imported into the machine. If I create another container using the same image (another OpenRefine instance, for example), it should start really quickly because all the required files have already been loaded into the node.
Unfortunately, when I go through to the corresponding URL, there’s nothing there. Looking at the logs, I think maybe there wasn’t enough memory to launch a second OpenRefine container… (I could test this by launching a second droplet/server with more memory, and then deploying a couple of containers to that one.)
The billing is calculated on DigitalOcean on a hourly rate, based on the number and size of servers running. To stop racking up charges, you can terminate the server/droplet (so you also lose the containers).
Note than in the case of OpenRefine, we could allow several users all to access the same OpenRefine container (the same URL) and just run different projects within them.
Although this is probably not the way that dev ops folk think of containers, I’m seeing them as a great way of packaging service based applications that I might one to run at a personal level, or perhaps in a teaching/training context, maybe on a self-service basis, maybe on a teacher self-service basis (fire up one application server that everyone in a cohort can log on to, or one container/application server for each of them; I noticed that I could automatically launch as many containers as I wanted – a 64GB 20 core processor costs about $1 per hour on Digital Ocean, so for an all day School of Data training session, for example, with 15-20 participants, that would be about $10, with everyone in such a class of 20 having their own OpenRefine container/server, all started with the same single click? Alternatively, we could fire up separate droplet servers, one per participant, each running its own set of containers? That might be harder to initialise though (i.e. more than one or two clicks?!) Or maybe not?)
One thing I haven’t explored yet is mounting data containers/volumes to link to application containers. This makes sense in a data teaching context because it cuts down on bandwidth. If folk are going to work on the same 1GB file, it makes sense to just load it in to the virtual machine once, then let all the containers synch from that local copy, rather than each container having to download its own copy of the file.
The advantage of the approach described in the walkthrough above over “pre-configured” self-hosting solutions is the extensibility of the range of applications available to me. If I can find – or create – a Dockerfile that will configure a container to run a particular application, I can test it on my local machine (using boot2docker, for example) and then deploy a public version in the cloud, at an affordable rate, in just a couple of steps.
Whilst templated configurations using things like fig or panamax which would support the 1-click launch of multiple linked containers configurations aren’t supported by tutum yet, I believe they are in the timeline… So I look forward to trying out a click cloud version of Using Docker to Build Linked Container Course VMs when that comes onstream:-)
In an institutional setting, I can easily imagine a local docker registry that hosts images for apps that are “approved” within the institution, or perhaps tagged as relevant to particular courses. I don’t know if it’s similarly possible to run your own panamax configuration registry, as opposed to pushing a public panamax template for example, but I could imagine that being useful institutionally too? For example, I could put container images on a dockerhub style OU teaching hub or OU research hub, and container or toolchain configurations that pull from those on a panamax style course template register, or research team/project reregister? To front this, something like tutum, though with an even easier interface to allow me to fire up machines and tear them down?
Just by the by, I think part of the capital funding the OU got recently from HEFCE was slated for a teaching related institutional “cloud”, so if that’s the case, it would be great to have a play around trying to set up a simple self-service personal app runner thing ?;-) That said, I think the pitch in that bid probably had the forthcoming TM352 Web, Mobile and Cloud course in mind (2016? 2017??), though from what I can tell I’m about as persona non grata as possible with respect to even being allowed to talk to anyone about that course!;-)