Category: Infoskills

Keeping Up With What’s Possible – Daily Satellite Imagery from AWS

Via @simonw’s rebooted blog, I  spotted this – Landsat on AWS: “Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes are available from the start of imagery capture. All new Landsat 8 scenes are made available each day, often within hours of production.”

What do things like this mean for research, and teaching?

For research, I’m guessing we’ve gone from a state 20 years ago – no data [widely] available – to 10 years ago – available under license, with a delay and perhaps as periodics snapshots – to now – daily availability. How does this imapct on research, and what sorts of research are possible? And how well suited are legacy workflows and tools to supporting work that can make use of daily updated datasets?

For teaching, the potential is there to do activities around a particular dataset that is current, but this introduces all sorts of issues when trying to write and support the activity (eg we don’t know what specific features the data will turn up in the future). We struggle with this anyway trying to write activities that give students an element of free choice or open-ended exploration where we don’t specifically constrain what they do. Which is perhaps why we tend to be so controlling – there is little opportunity for us to respond to something a student discovers for themselves.

The realtime-ish ness of data means we could engage students with contemporary issues, and perhaps enthuse them about the potential of working with datasets that we can only hint at or provide a grounding for in the course materials. There are also opportunities for introducing students to datasets and workflows that they might be able to use in their workplace, and as such act as a vector for getting new ways of working out of the Academy and out of the tech hinterland that the Academy may be aware of, and into more SMEs (helping SMEs avail themselves of emerging capabilities via OUr students).

At a more practical level, I wonder, if OU academics (research or teaching related) wanted to explore the LandSat 8 data on AWS, would they know how to get started?

What sort of infrastructure, training or support do we need to make this sort of stuff accessible to folk who are interested in exploring it for the first time (other than Jupyter notebooks, RStudio, and Docker of course!;-) ?

PS Alan Levine /@cogdog picks up on the question of what’s possible now vs. then: http://cogdogblog.com/2017/11/landsat-imagery-30-years-later/. I might also note: this is how the blogosphere used to work on a daily basis 10-15 years ago…

Open Educational Resources from Government and Parliament

Mentioning to a colleague yesterday that the UK Parliamentary library published research briefings and reports on topics of emerging interest, as well as to support legislation, that often provided a handy, informed, and politically neutral  overview of a subject area that could make for a useful learning resource, the question was asked whether or not they might have anything on the “internet of things”. The answer is not much, but it got me thinking a bit more about the range of documents and document types produced across Parliament and Government that can be used to educate and inform, as well as contribute to debate.

In other words, to what extent might such documents be used in an educational sense, whether in the sense of providing knowledge and information about a topic, providing a structured review of a topic area and the issues associated with it, raising questions about an issue, or reporting on an analysis of it. (There are also opportunities for learning from some of the better Parliamentary debates, for example in terms of how to structure an argument, or explore the issues associated with an issue, but Hansard is out of scope of this post!)

(Also note that I’m coming at this as a technologist, interested as much in the social processes, concerns and consequences associated with science and technology as much as the deep equations and principles that tend to be be taught as the core of the subject, at least in HE. And that I’m interested not just on how we can support the teaching and learning of current undergrads, but also how we can enculturate them into the availability and use of certain types of resource that are likely to continue being produced into the future, and as such provide a class of resources that will continue to support the learning and education of students once they leave formal education.)

So using IoT as a hook to provide examples, here’s the range of documents I came up with. (At some point it maybe worth tabulating this to properly summarise the sorts of information these reports might came, the communicative point of the document (to inform, persuade, provide evidence for or against something, etc), and any political bias that may be likely (in policy docs, for example).

Parliamentary Library Research Briefings

The Parliamentary Library produces a range of research briefings to cover matters of general interest (Commons Briefing papers, Lords Library notes), perhaps identified through multiple questions asked of the Library by members?, as well as background to legislation (Commons Debate Packs, Lords in Focus), through the Commons and Lords Libraries respectively.

Some of the research briefings include data sets (do a web search for site:http://researchbriefings.files.parliament.uk/ filetype:xlsx) which can also be quite handy.

There are also POSTnotes from the Parliamentary Office of Science and Technology, aka POST.

For access to briefings on matters currently in the House, the Parliament website provides timely/handy pages that list briefings documents for matters in the House today/this week. In addition, there are feeds available for recent briefings from all three: Commons Briefing Papers feed, Lords Library Notes feed, POSTnotes feed. If you’re looking for long reads and still use a feed reader, get subscribing;-)

Wider Parliamentary Documents

The Parliament website also supports navigation of topical issues such as Science and Technology, as well as sub-topics, such as Internet and Cybercrime. (I’m not sure how the topics/sub-topics are identified or how the graph is structured… That may be one to ask about when I chat to Parliamentary Library folk next week?:-)

Within the topic areas, relevant Commons and Lords related Library research briefings are listed, as well as
POSTnotes, Select Committee Reports and Early Day Motions.

(By the by, it’s also worth noting that chunks of the Parliament website are currently in scope of a website redesign.)

Government Documents

Along with legislation currently going through Parliament that is published on the Parliament website (along with Hansard reports that record, verbatim(-ish!) proceedings of debates in either House), explanatory notes provided by the Government department bringing a bill provide additional, supposedly more accessible/readable, information around it.

Reports are also published by government offices. For example, the Blackett review (2014) on the Internet of things was a Government Office for Science report from the UK Government Chief Scientific Adviser at the time (The Internet of Things: making the most of the Second Digital Revolution). Or how about a report from the Intellectual Property Office on Eight great technologies: The internet of things.

Briefing documents also appear in a variety of other guises. For example, competitions (such as the Centre for Defence Enterprise (CDE) competition on security for the internet of things, or Ofcom’s consultation on More radio spectrum for the Internet of Things) and consultations may both provide examples of how to start asking questions about a particular topic area (questions that may help to develop critical thinking, prompt critical reflection, or even provide ideas for assessment!).

Occasionally, you can also turn up a risk assessments or cost benefit analysis, such as this Department for Business, Energy & Industrial Strategy Smart meter roll-out (GB): cost-benefit analysis.

EC Parliamentary Research Service

In passing, it’s also worth noting that the EC Parliamentary Research Service also do briefings, such as this report on The Internet Of Things: Opportunities And Challenges, as well as publishing resources linked from topic based pages, such as the Digital Single Market them topic page on The Internet of Things.

Summary

In providing support for all members of the House, the Parliamentary research services must produce research briefings that can be used by both sides of the House. This may stand in contrast to documents produced by Government that may be influenced by particular policy (and political) objectives, or formal reports published by industry bodies and the big consultancies (the latter often producing reports that are either commissioned on behalf of government or published to try to promote some sort of commercial interest that can be sold to government) that may have a lobbying aim.

As I’ve suggested previously, (News, Courses and Scrutiny and Learning Problems and Consultation Based Curricula), maybe we could/should be making more use of them as part of higher education course readings, not just as a way of getting a quick, NPOV view over a topic area, bus also as a way of introduce students to a form of free and informed content, produced in timely way in response to issues of the day. In short, a source that will continue to remain relevant and current over the coming years, as students (hopefully) become lifelong, continuing learners.

DH Box – Digital Humanities Virtual Workbench

As well as offering digital application shelves, should libraries offer, or act as instituional sponsors of, digital workbenches?

I’ve previously blogged about things like SageMathCloud, and application based learning environment, and the IBM Data Scientist Workbench, and today came across another example: DHBox, CUNY’s digital humanities lab in the cloud (wiki), which looks like it may have been part of a Masters project?

DH_Box0

If you select the demo option, a lab context is spawned for you, and provides access to a range of tools: staples, such as RStudio and Jupyter notebooks, a Linux terminal, and several website creation tools: Brackets, Omeka and WordPress (though the latter two didn’t work for me).

DH_Box

(The toolbar menu reminded me of Stringle / DockLE ;-)

There’s also a file browser, which provides a common space for organising – and uploading – your own files. Files created in one application are saved to the shared file area and available for use on other applications.

DH_Box6

The applications are being a (demo) password authentication scheme, which makes me wonder if persistent accounts are in the project timeline?

DH_Box2

Once inside the application, you have full control over it. If you need additional packages in RStudio, for example, then just install them:

DH_Box4

They work, too!

DH_Box5

On the Jupyter notebook front, you get access to Python3 and R kernels:

DH_Box3

 

In passing, I notice that RStudio’s RMarkdown now demonstrates some notebook like activity, demonstrating the convergence between document formats such as Rmd (and ipymd) and notebook style UIs [video].

Code for running your own DHBox installation is available on Github (DH-Box/dhbox), though I haven’t had a chance to give it a try yet. One thing it’d be nice to see is a simple tutorial showing how to add in another tool of your own (OpenRefine, for example?) If I get a chance to play with this – and can get it running – I’ll try to see if I can figure out such an example.

It also reminded me that I need to play with my own install of tmpnb, not least because of  the claim that “tmpnb can run any Docker container”.  Which means I should be able to set up my own tmpRStudio, or tmpOpenRefine environment?

If visionary C. Titus Brown gets his way with a pitched for MyBinder hackathon, that might extend that project’s support for additional data science applications such as RStudio, as well as generalising the infrastructure on which myBinder can run. Such as Reclaimed personal hosting environments, perhaps?!;-)

That such combinations are now popping up all over the web makes me think that they’ll be a commodity service anytime soon. I’d be happy to argue this sort of thing could be used to support a “technology enhanced learning environment”, as well as extending naturally into“technology enhanced research environments”, but from what I can tell, TEL means learning analytics and not practical digital tools used to develop digital skills? (You could probably track the hell of of people using such environments if you wanted to, though I still don’t see what benefits are supposed to accrue from such activity?)

It also means I need to start looking out for a new emerging trend to follow, not least because data2text is already being commoditised at the toy/play entry level. And it won’t be VR. (Pound to a penny the Second Life hipster, hypster, shysters will be chasing that. Any VR campuses out there yet?!) I’d like to think we might see inroads being made into AR, but suspect that too will always be niche, outside certain industry and marketing applications. So… hmmm… Allotments… that’s where the action’ll be… and not in a tech sense…

More Docker Doodlings – Accessing GUI Apps Via a Browser from a Container Using Guacamole

In a PS to Using Docker as a Personal Productivity Tool – Running Command Line Apps Bundled in Docker Containers, I linked to a demonstration by Jessie Frazelle on how to connect to GUI based apps running in a container via X11. This is all very well if you have an X client, but it would be neater if we could find a way of treating the docker container as a virtual desktop container, and then accessing the app running inside it via the desktop presented through a browser.

Digging around, Guacamole looks like it provides a handy package for exposing a Linux desktop via a browser based user interface [video demo].

Very nice… Which got me wondering: can we run guacamole inside a container, alongside an X.11 producing app, to expose that app?

Via the Dockerfile referenced in Digikam on Mac OS/X or how to use docker to run a graphical app on Mac OS/X I tracked down linuxserver/dockergui, a Docker image that “makes it possible to use any X application on a headless server through a modern web browser such as chrome”.

Exciting:-) [UPDATE: note that that image uses an old version of the guacamole packages; I tried updating to the latest versions of the packages but it doesn’t just work so I rolled back. Support for Docker was introduced after the version used in the linuxserver/dockergui, but I don’t fully understand what that support does! Ideally, it’d be nice to run a guacamole container and then use docker-compose to link in the applications you want to expose to it? Is that possible? Anyone got an example of how to do it?]

So I gave it a go with Audacity. The files I used are contained in this gist that should also be embedded at the bottom of this post.

(Because the original linuxserver/dockergui was quite old, I downloaded their source files and built a current one of my own to seed my Audacity container.)

Building the Audacity container with:

docker build -t psychemedia/audacitygui .

and then running it with:

docker run -d -p 8080:8080 -p 3389:3389 -e "TZ=Europe/London" --name AudacityGui psychemedia/audacitygui

this is what pops up in the browser:

Guacamole_0_9_6

If we click through on the app, we’re presented with a launcher:

Audacity

Select the app, and Hey, Presto!, Audacity appears…

Audacity1

It seems to work, too… Create a chirp, and then analyse it:

Audacity2

We seem to be able to load and save files in the nobody directory:

Audacity3

I tried exposing the /nobody folder by adding VOLUME /nobody to the Dockerfile and running a -v "${PWD}/files":/nobody switch, but it seemed to break things which is presumably a permissions thing? There are various user roles settings in the linuxserver/dockergui build files, so making poking around with those would fix things? Otherwise, we might have to see the container directly with any files we want in it?:-(

UPDATE: adding RUN mkdir -p /audacityfiles && adduser nobody root to the Dockerfile along with ./VOLUME /audacityfiles and then adding -v "${PWD}/files":/audacityfiles when I create the container allows me to share files in to the container, but I can’t seem to save to the folder? Nor do variations on the theme, such as s creating a subfolder in the nobody folder and giving the same ownership and permissions as the nobody folder. I can save into the nobody folder though. (Just not share it?)

WORKAROUND: in Kitematic, the Exec button on the container view toolbar takes you into the container shell. From there, you can copy files into the shared direcory. For example: cp /nobody/test.aup /nobody/share/test.aup Moving the share to a folder outside /nobody, eg to /audacityfiles means we can simply compy everything from /nobody to /audacityfiles.

Another niggle is with the sound – one of the reasons I tried the Audacity app… (If we can have the visual of the desktop, we want to try to push for sound too, right?!)

Unfortunately, when I tried to play the audio file I’d created, it wasn’t having any of it:

Audacity4

Looking at the log file of the container launch in Kitematic, it seems that ALSA (the Advanced Linux Sound Architecture project) wasn’t happy?

alsa_no

I suspect trying to fix this is a bit beyond my ken, as too are the sorting out the shared folder permissions, I suspect… (I don’t really do sysadmin – which is why I like the idea of ready-to-run application containers).

UPDATE 2: using a different build of the image – hurricane/dockergui:x11rdp1.3, from the linuxserver/dockergui x11rdp1.3 branch, audio does work, though at times it seemed to struggle a bit. I still can’t save files to shared folder though:-(

UPDATE 3: I pushed an image as psychemedia/audacity2. It works from the command line as:
docker run -d -p 8080:8080 -p 3389:3389 -e "TZ=Europe/London" --name AudacityGui -v "${PWD}":/nobody/share psychemedia/audacity2

Anyway – half-way there. If nothing else, we could create and analyse audio files visually in the browser using Audacity, even if we can’t get hold of those audio files or play them!

I’d hope there was a simple permissions fix to get the file sharing to work (anyone? anyone?! ;-) but I suspect the audio bit might be a little bit harder? But if you know of a fix, please let me know:-)

PS I just tried launching psychemedia/audacity2 public Dockerhub image via Docker Cloud, and it seemed to work…


Using Docker as a Personal Productivity Tool – Running Command Line Apps Bundled in Docker Containers

With its focus on enterprise use, it’s probably with good reason that the Docker folk aren’t that interested in exploring the role that Docker may have to play as a technology that supports the execution of desktop applications, or at least, applications for desktop users. (The lack of significant love for Kitematic seems to be representative of that.)

But I think that’s a shame; because for educational and scientific/research applications, docker can be quite handy as a way of packaging software that ultimately presents itself using a browser based user interface delivered over http, as I’ve demonstrated previously in the context of Jupyter notebooks, OpenRefine, RStudio, R Shiny apps, linked applications and so on.

I’ve also shown how we can use Docker containers to package applications that offer machine services via an http endpoint, such as Apache Tika.

I think this latter use case shows how we can start to imagine things like a “digital humanities application shelf” in a digital library (fragmentary thoughts on this), that allows users to take either an image of the application off the shelf (where an image is a thing that lets you fire up a pristine instance of the application), or a running instance of the application of the shelf. (Furthermore, the application can be run locally, on your own desktop computer, or in the cloud, for example, using something like a mybinder like service). The user can then use the application directly (if it has a browser based UI), or call on it from elsewhere (eg in the case of Apache Tika). Once they’re done, they can keep a copy of whatever files they were working with and destroy their running version of the application. If they need the application again, they can just pull a new copy (of the latest version of the app, or the version they used previously) and fire up a new instance of it.

Another way of using Docker came to mind over the weekend when I saw a video demonstrating the use of the contentmine scientific literature analysis toolset. The contentmine installation instructions are a bit of a fiddle for the uninitiated, so I thought I’d try to pop them into a container. That was easy enough (for a certain definition of easy – it was a faff getting node to work and npm to be found, the Java requirements took a couple of goes, and I;m sure the image is way bigger than it really needs to be…), as the Dockerfile below/in the gist shows.

But the question then was how to access the tools? The tools themselves are commandline apps, so the first thing we want to do is to be able to call into the container to run the command. A handy post by Mike English entitled Distributing Command Line Tools with Docker shows how to do this, so that’s all good then…

The next step is to consider how to retain copies of the files created by the command line apps, or pass files to the apps for processing. If we have a target host directory and mount it into the container as as a shared volume, we can keep the files on our desktop or allow the container to create files into the host directory. Then they’ll be accessible to us all the time, even if we destroy the container.

The gist that should be embedded below shows the Dockerfile and a simple batch file passes the Contentmine tool commands into the container which then executes them. The batch file idea could be further extended to produce a set of command shortcuts that essentially alias the Contentmine commands (eg a ./getpapers command rather than a ./contentmine getpapers command, or that combine the various steps associated with a particular pipeline or workflow – getpapers/norma/cmine, for example – into a single command.

UPDATE: the CenturyLinkLabs DRAY docker pipeline looks interesting in this respect for sequencing a set of docker containers and passing the output of one as the input to the next.

If there are other folk out there looking at using Docker specifically for self-managed “pull your own container” individual desktop/user applications, rather than as a devops solution for deploying services at scale, I’d love to chat…:-)

PS for several other examples of using Docker for desktop apps, including accessing GUI based apps using X WIndows / X11, see Jessie Frazelle’s post Docker Containers on the Desktop.

PPS See also More Docker Doodlings – Accessing GUI Apps Via a Browser from a Container Using Guacamole for an attempt at exposing a GUI based app, such as Audacity, running in a container via a browser. Note that I couldn’t get a shared folder or the audio to work, although the GUI bit did…

PPPS I wondered how easy it would be to run command-line containers from within Jupyter notebook itself running in inside a container, but got stuck. Related question on Stack Overflow here.

The rest of the way this post is published is something of an experiment – everything below the line is pulled in from a gist using the WordPress – embedding gists shortcode…


Harvesting Searched for Tweets Using Python

Via Tanya Elias/eliast05, a query regarding tools for harvesting historical tweets. I haven’t been keeping track of Twitter related tools over the last few years, so my first thought is often “could Martin Hawksey’s TAGSexplorer do it?“!

But I’ve also had a the twecoll Python/command line package on my ‘to play with’ list for a while, so I though I’d give it a spin. Note that the code requires python to be installed (which it will be, by default, on a Mac).

On the command line, something like the following should be enough to get you up and running if you’re on a Mac (run the commands in a Terminal, available from the Utilities folder in the Applications folder). If wget is not available, download the twecoll file to the twitterstuff directory, and save it as twecoll (no suffix).

#Change directory to your home directory
$ cd

#Create a new directory - twitterstuff - in you home directory
$ mkdir twitterstuff

#Change directory into that directory
$ cd twitterstuff

#Fetch the twecoll code
$ wget https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
--2016-05-02 14:51:23--  https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
Resolving raw.githubusercontent.com... 23.235.43.133
Connecting to raw.githubusercontent.com|23.235.43.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31445 (31K) [text/plain]
Saving to: 'twecoll'
 
twecoll                    100%[=========================================>]  30.71K  --.-KB/s   in 0.05s  
 
2016-05-02 14:51:24 (564 KB/s) - 'twecoll' saved [31445/31445]

#If you don't have wget installed, download the file from:
#https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
#and save it in the twitterstuff directory as twecoll (no suffix)

#Show the directory listing to make sure the file is there
$ ls
twecoll

#Change the permissions on the file to 'user - executable'
$ chmod u+x twecoll

#Run the command file - the ./ reads as: 'in the current directory'
$ ./twecoll tweets -q "#lak16"

Running the code the first time prompts you for some Twitter API credentials (follow the guidance on the twecoll homepage), but this only needs doing once.

Testing the app, it works – tweets are saved as a text file in the current directory with an appropriate filename and suffix .twt – BUT the search doesn’t go back very far in time. (Is the Twitter search API crippled then…?)

Looking around for an alternative, the GetOldTweets python script, which again can be run from the command line; download the zip file from Github, move it into the twitterstuff directory, and unzip it. On the command line (if you’re still in the twitterstuff directory, run:

ls

to check the name of the folder (something like GetOldTweets-python-master) and then cd into it:

cd GetOldTweets-python-master/

to move into the unzipped folder.

Note that I found I had to install pyquery to get the script to run; on the command line, run: easy_install pyquery.

This script does not require credentials – instead it scrapes the Twitter web search. Data limits for the search can be set explicitly.

python Exporter.py --querysearch '#lak15' --since 2015-03-10 --until 2015-09-12 --maxtweets 500

Tweets are saved into the file output_got.csv and are semicolon delimited.

A couple of things I noticed with this script: it’s slow (because it “scrolls” through pages and pages of Twitter search results, which only have a small number of results on each) and on occasion seems to hang (I’m not sure if it gets stuck in an infinite loop; on a couple of occasions I used ctrl-z to break out). In such a case, it doesn’t currently save results as you go along, so you have nothing; reduce the --maxtweets value, and try again. On occasion, when running the script under the default Mac python 2.7, I noticed that there may be encoding issues in tweets which break the output, so again the file can’t get written,

Both packages run from the command line, or can be scripted from a Python programme (though I didn’t try that). If the GetOldTweets-python package can be tightened up a bit (eg in respect of UTF-8/tweet encoding issues, which are often a bugbear in Python 2.7), it looks like it could be a handy little tool. And for collecting stuff via the API (which requires authentication), rather than by scraping web results from advanced search queries, twecoll looks as if it could be quite handy too.

When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor

Although not necessarily the best way of publishing data, data tables in PDF documents can often be extracted quite easily, particularly if the tables are regular and the cell contents reasonably space.

For example, official timing sheets for F1 races are published by the FIA as event and timing information in a set of PDF documents containing tabulated timing data:

R_-_Best_Sector_Times_pdf__1_page_

In the past, I’ve written a variety of hand crafted scrapers to extract data from the timing sheets, but the regular way in which the data is presented in the documents means that they are quite amenable to scraping using a PDF table extractor such as Tabula. Tabula exists as both a server application, accessed via a web browser, or as a service using the tabula extractor Java application.

I don’t recall how I came across it, but the tabulizer R package provides a wrapper for tabula extractor (bundled within the package), that lets you access the service via it’s command line calls. (One dependency you do need to take care of is to have Java installed; adding Java into an RStudio docker container would be one way of taking care of this.)

Running the default extractor command on the above PDF pulls out the data of the inner table:

extract_tables('Best Sector Times.pdf')

fia_pdf_sector_extract

Where the data is spread across multiple pages, you get a data frame per page.

R_-_Lap_Analysis_pdf__page_3_of_8_

Note that the headings for the distinct tables are omitted. Tabula’s “table guesser” identifies the body of the table, but not the spanning column headers.

The default settings are such that tabula will try to scrape data from every page in the document.

fia_pdf_scrape2

Individual pages, or sets of pages, can be selected using the pages parameter. For example:

  • extract_tables('Lap Analysis.pdf',pages=1
  • extract_tables('Lap Analysis.pdf',pages=c(2,3))

Specified areas for scraping can also be specified using the area parameter:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 230, 500)))

The area parameter appears to take co-ordinates in the form: top, left, width, height is now fixed to take co-ordinates in the same form as those produced by tabula app debug: top, left, bottom, right.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.

Select_Tables___Tabula_concole

The tabula console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.

fia_pdf_head_scrape

Using a combination of “guess” to find the dominant table, and specified areas, we can extract the data we need from the PDF and combine it to provide a structured and clearly labeled dataframe.

On my to do list: add this data source recipe to the Wrangling F1 Data With R book…