So I finally got round to pushing a revised (and typo corrected!) version of Wrangling F1 Data With R: A Data Junkie’s Guide, that also includes a handful of new section and chapters, including descriptions of how to detect undercuts, the new style race history chart that shows the on-track position of each driver for each lap of a race relative to the lap leader, and a range of qualifying session analysis charts that show the evolution of session cut off times and drivers’ personal best times.
Code is described for every data manipulation and chart that appears in the book, along with directions for how to get hold of (most of) the data required to generate the charts. (Future updates will cover some of the scraping techniques required to get of of the rest of it!)
As well as the simple book, there’s also a version bundled with the R code libraries that are loaded in as a short-cut in many of the chapters.
The book is published on Leanpub, which means you can access several different electronic versions of the book, and once you’ve bought a copy, you get access to any future updates for no further cost…
There is a charge on the book, with a set minimum price, but you also have the freedom to pay more! Any monies received for this book go to cover costs (I’ve started trying to pay for the webservices I use, rather than just keep using their free plan). If the monthly receipts bump up a little, I’ll try to get some services that generate some of the charts interactively hosted somewhere…
Part of the vision behind the Jupyter notebook ecosystem seems to be the desire to create a literate computing infrastructure that supports “the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components” (Fernando Perez, “Literate computing” and computational reproducibility: IPython in the age of data-driven journalism, 19/4/13).
The notebook approach complements other live document approaches such as the use of Rmd in applications such as RStudio, providing an interactive, editable rendered view of the live document, including inlined outputs, rather than just the source code view.
Notebooks don’t just have to be used for analysis though. A few months ago, I spotted a notebook being used to configure a database system, db-introspection-notebook – my gut reaction to which was to ponder Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?. (A problem with that approach, of course, is that it requires notebook machinery to get started, whereas you might typically want to run configuration scrips in as bare bones a system possible.)
[a] web server that supports different mechanisms for spawning and communicating with Jupyter kernels, such as:
- A Jupyter Notebook server-compatible HTTP API for requesting kernels and talking the Jupyter kernel protocol with them over Websockets
- A[n] HTTP API defined by annotated notebook cells that maps HTTP verbs and resources to code to execute on a kernel
Tooling to support the creation of a literate API then, that fully respects Fernando Perez’ description of literate computing?!
At first glance it looks like all the API functions need to be defined within a single notebook – the notebook run by the kernel gateway. But another Jupyter project in incubation allows notebooks to be imported into other notebooks, as this demo shows: Notebooks as Reusable Modules and Cookbooks. Which means that a parent API defining notebook could pull in dependent child notebooks that each define a separate API call.
And because the Jupyter server can talk to a wide range of language kernels, this means the API can implemented using a increasing range of languages (though I think that all the calls will need to be implemented using the same language kernel?). Indeed, the demo code has notebooks showing how to define notebook powered APIs in python and R.
The news today was lead in part by a story broken by the BBC and BuzzFeed News – The Tennis Racket – about match fixing in Grand Slam tennis tournaments. (The BBC contribution seems to have been done under the ever listenable File on Four: Tennis: Game, Set and Fix?)
One interesting feature of this story was that “BuzzFeed News began its investigation after devising an algorithm to analyse gambling on professional tennis matches over the past seven years”, backing up evidence from leaked documents with “an original analysis of the betting activity on 26,000 matches”. (See also: How BuzzFeed News Used Betting Data To Investigate Match-Fixing In Tennis, and an open access academic paper that inspired it: Rodenberg, R. & Feustel, E.D. (2014), Forensic Sports Analytics: Detecting and Predicting Match-Fixing in Tennis, The Journal of Prediction Markets, 8(1).)
Feature detecting algorithms such as this (where the feature is an unusual betting pattern) are likely to play an increasing role in the discovery of stories from data, step 2 in the model described in this recent Tow Center for Digital Journalism Guide to Automated Journalism:]
Another interesting aspect of the story behind the story was the way in which BuzzFeed News opened up the analysis they had applied to the data. You can find it described on Github – Methodology and Code: Detecting Match-Fixing Patterns In Tennis – along with the data and a Jupyter notebook that includes the code used to perform the analysis: Data and Analysis: Detecting Match-Fixing Patterns In Tennis.
You can even run the notebook to replicate the analysis yourself, either by downloading it and running it using your own Jupyter notebook server, or by using the online mybinder service: run the tennis analysis yourself on mybinder.org.
(I’m not sure if the BuzzFeed or BBC folk tried to do any deeper analysis, for example poking into point summary data as captured by the Tennis Match Charting Project? See also this Teniis Visuals project that makes use of the MCP data. Tennis etting data is also collected here: tennis-data.co.uk. If you’re into the idea of analysing tennis stats, this book is one way in: Analyzing Wimbledon: The Power Of Statistics.)
So what are these notebooks anyway? They’re magic, that’s what!:-)
The Jupyter project is an evolution of an earlier IPython (interactive Python) project that included a browser based notebook style interface for allowing users to write and execute code, as well as seeing the result of executing the code, a line at a time, all in the context of a “narrative” text document. The Jupyter project funding proposal describes it thus:
[T]he core problem we are trying to solve is the collaborative creation of reproducible computational narratives that can be used across a wide range of audiences and contexts.
[C]omputation in science is ultimately in service of a result that needs to be woven into the bigger narrative of the questions under study: that result will be part of a paper, will support or contest a theory, will advance our understanding of a domain. And those insights are communicated in papers, books and lectures: narratives of various formats.
The problem the Jupyter project tackles is precisely this intersection: creating tools to support in the best possible ways the computational workflow of scientific inquiry, and providing the environment to create the proper narrative around that central act of computation. We refer to this as Literate Computing, in contrast to Knuth’s concept of Literate Programming, where the emphasis is on narrating algorithms and programs. In a Literate Computing environment, the author weaves human language with live code and the results of the code, and it is the combination of all that produces a computational narrative.
At the heart of the entire Jupyter architecture lies the idea of interactive computing: humans executing small pieces of code in various programming languages, and immediately seeing the results of their computation. Interactive computing is central to data science because scientific problems benefit from an exploratory process where the results of each computation inform the next step and guide the formation of insights about the problem at hand. In this Interactive Computing focus area, we will create new tools and abstractions that improve the reproducibility of interactive computations and widen their usage in different contexts and audiences.
The Jupyter notebooks include two types of interactive cell – editable text cells into which you can write simple markdown and HTML text that will be rendered as text; and code cells into which you can write executable code. Once executed, the results of that execution are displayed as cell output. Note that the output from a cell may be text, a datatable, a chart, or even an interactive map.
There are multiple ways of running Jupyter notebooks, including the mybinder approach described above, – I describe several of them in the post Seven Ways of Running IPython Notebooks.
As well as having an important role to play in reproducible data journalism and reproducible (scientific) research, notebooks are also a powerful, and expressive, medium for teaching and learning. For example, we’re just about to star using Jupyter notebooks, delivered via a virtual machine, for the new OU course Data management and analysis.
We also used them in the FutureLearn course Learn to Code for Data Analysis, showing how code could be used a line at a time to analyse a variety of opendata sets from sources such as the World Bank Indicators database and the UN Comtrade (import /export data) database.
PS for sports data fans, here’s a list of data sources I started to compile a year or so ago: Sports Data and R – Scope for a Thematic (Rather than Task) View? (Living Post).
One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.
(Actually, I think I had a script that was more or less there for chunks of that when I was looking at a docker solution for the databases courses, but that fell by the way side and I suspect the the Jupyter container (IPython notebook server, as was), probably needs a fair bit of updating by now. And I’ve no time or mental energy to look at it right now…:-(
Anyway, the IBM Data Scientist Workbench now sits alongside things like KMis longstanding KMi Crunch Learning Analytics Environment (RStudio + MySQL), and the Australian ResBaz Cloud – Containerised Research Apps Service in my list of why the heck can’t we get our act together to offer this sort of SaaS thing to learners? And yes I know there are cost applications…. but, erm, sponsorship, cough… get-started tokens then PAYG, cough…
It currently offers access to personal persistent storage and the ability to launch OpenRefine, RStudio and Jupyter notebooks:
The toolbar also suggest that the ability to “discover” pre-identified data sources and run pre-configured modeling tools is also on the cards.
The applications themselves run off a subdomain tied to your account – and of course, they’re all available through the browser…
So what’s next? I’d quite like to see ‘data import packs’ that would allow me to easily pull in data from particular sources, such as the CDRC, and quickly get started working with the data. (And again: yes, I know, I could start doing that anyway… maybe when I get round to actually doing something with isleofdata.com ?!;-)
See also these recipes for running app containers on Digital Ocean via Tutum: RStudio, Shiny server, OpenRefine and OpenRefine reconciliation services, and these Seven Ways of Running IPython / Jupyter Notebooks.
So have you been looking for something like RStudio, but for Python?
It’s been out for some time, but a recently updated release of Rodeo gives an increasingly workable RStudio-like environment for Python users.
The layout resembles the RStudio layout – file editor top left, interactive console bottom left, variable inspector and history top right, charts, directory view and plugins bottom right. (For plugins, read: packages).
The preferences panel lets you set the initial working directory as well as the path the required python executable.
Code selected in the file editor can be run in the console. Charts can be generated using matplotlib and are displayed in the chart view area bottom right.
As with RStudio, you can write reproducible research documents that blend markdown and code and render the result as HTML or PDF.
As you might expect, charts can be embedded as outputs in the document too.
Whilst the first version of Rodeo was a flask app viewable via a browser, and installable via pip, the latest version is an electron app, like RStudio. I found the ability to run Rodeo directly in the browser really useful, but the RStudio folks appear to have found a way of running RStudio via a browser using their RStudio server, so I’m hoping there’ll also be an open source version of Rodeo server available too?
One thing I’m wondering is whether Rodeo is a front end that can run against other Jupyter kernels? I notice that there is already a branch on the Rodeo github repo called r-backend, for example…?
Another thing I haven’t really clarified for myself are the differences between authoring (and teaching/learning) using the “Rmd/knitr” RStudio/Rodeo style workflow, and authoring in Jupyter notebooks. Notebook extensions are available that can suppress cell output etc to provide some level of control over what get rendered from a notebook used as an authoring environment. I guess what I’d like for Jupyter notebooks is a simple dropdown that lets me specify the equivalent of knitr text result options that control how code cells are rendered in an output document.
And if you do prefer the notebook route, here are Seven Ways of Running IPython / Jupyter Notebooks.
Via RBloggers, I spotted this post on Deploying Your Very Own Shiny Server. I’ve been toying with the idea of running some of my own Shiny apps, so that post provided a useful prompt, though way too involved for me;-)
So here’s what seems to me to be an easier, rather more pointy-clicky, wiring stuff together way using Docker containers (though it might not seem that much easier to you the first time through!). The recipe includes: github, Dockerhub, Tutum and Digital Ocean.
To being with, I created a minimal shiny app to allow the user to select a CSV file, upload it to the app and display it. The ui.R and server.R files, along with whatever else you need, should be placed into an application directory, for example shiny_demo within a project directory, which I’m confusingly also calling shiny_demo (I should have called it something else to make it a bit clearer – for example, shiny_demo_project.)
The shiny server comes from a prebuilt docker container on dockerhub – rocker/shiny.
This shiny server can run several Shiny applications, though I only want to run one: shiny_demo.
I’m going to put my application into it’s own container. This container will use the rocker/shiny container as a base, and simply copy my application folder into the shiny server folder from which applications are served. My Dockerfile is really simple and contains just two lines – it looks like this and goes into a file called Dockerfile in the project directory:
FROM rocker/shiny ADD shiny_demo /srv/shiny-server/shiny_demo
The ADD command simply copies the the contents of the child directory into a similarly named directory in the container’s /srv/shiny-server/ directory. You could add as many applications you wanted to the server as long as each is in it’s own directory. For example, if I have several applications:
I can add the second application to my container using:
ADD shiny_other_demo /srv/shiny-server/shiny_other_demo
The next thing I need to do is check-in my shiny_demo project into Github. (I don’t have a how to on this, unfortunately…) In fact, I’ve checked my project in as part of another repository (docker-containers).
I can then create an Automated Build that will build a container image from my Github repository. First, identify the repository on my linked Github account and name the image:
Then add the path the project directory that contains the Dockerfile for the image you’re interested in:
Click on Trigger to build the image the first time. In the future, every time I update that folder in the repository, the container image will be rebuilt to include the updates.
So now I have a Docker container image on Dockerhub that contains the Shiny server from the rocker/shiny image and a copy of my shiny application files.
Now I need to go Tutum (also part of the Docker empire), which is an application for launching containers on a range of cloud services. If you link your Digital Ocean account to tutum, you can use tutum to launch docker containers on Dockerhub on a Digital Ocean droplet.
Within tutum, you’ll need to create a new node cluster on Digital Ocean:
(Notwithstanding the below, I generally go for a single 4GB node…)
Now we need to create a service from a container image:
I can find the container image I want to deploy on the cluster that I previously built on Dockerhub:
Select the image and then configure it – you may want to rename it, for example. One thing you definitely need to do though is tick to publish the port – this will make the shiny server port visible on the web.
Create and deploy the service. When the container is built, and has started running, you’ll be told where you can find it.
Note that if you click on the link to the running container, the default URL starts with tcp:// which you’ll need to change to http://. The port will be dynamically allocated unless you specified a particular port mapping on the service creation page.
To view your shiny app, simply add the name of the folder the application is in to the URL.
When you’ve finished running the app, you may want to shut the container down – and more importantly perhaps, switch the Digital Ocean droplet off so you don’t continue paying for it!
As I said at the start, the first time round seems quite complicated. After all, you need to:
- create a Github account
- create a Dockerhub account
- link your Github account to your Docker account
- create a Digital Ocean account [affiliate link: sign up to Digital Ocean and get $10 credit, and they’ll tip me some credit too…]
- create a tutum account
- link your Digital Ocean account to your tutum account
(Actually, you can miss out the dockerhub steps, and instead link your github account to your tutum account and do the automated build from the github files within tutum: Tutum automatic image builds from GitHub repositories. The service can then be launched by finding the container image in your tutum repository)
However, once you do have your project files in github, you can then easily update them and easily launch them on Digital Ocean. In fact, you can make it even easier by adding a deploy to tutum button to a project README.md file in Github.
PS to test the container locally, I launch a docker terminal from Kitematic, cd into the project folder, and run something like:
docker build -t psychemedia/shinydemo . docker run --name shinydemo -i -t psychemedia/shinydemo
I can then set the port map and find a link to the server from within Kitematic.
Prompted by a joint
coursemodule team to look at options surrounding a “virtual computing lab” to support a couple of new level 1 (first year equivalent) IT and computing courses (they should know better?!;-), I had another scout around and came across SageMathCloud, which looks at first glance to be just magical:-)
An open source, cloud hosted system [code], the free plan allows users to log in with social media credentials and create their own account space:
Once you’re in, you have a project area in which you can define different projects:
I’m guessing that projects could be used by learners to split out different projects with a course, or perhaps use a project as the basis for a range of activities within a course.
Within a project, you have a file manager:
The file manager provides a basis for creating application-linked files; of particular interest to me is the ability to create Jupyter notebooks…
Notebook files are opened in to a tab. Multiple notebooks can be open in multiple tabs at the same time (though this may start to hit performance from the server? pandas dataframes, for example, are held in memory, and the SMC default plan could mean memory limits get hit if you try to hold too much data in memory at once?)?
Notebooks are autosaved regularly – and a time slider that allows you to replay and revert to a particular version is available, which could be really useful for learners? (I’m not sure how this works – I don’t think it’s a standard Jupyter offering? I also imagine that the state of the underlying Python process gets dislocated from the notebook view if you revert? So cells would need to be rerun?)
Several users can collaborate on a project. I created another me by creating an account using a different authentication scheme (which leads to a name clash – and I think an email clash – but SMC manages to disambiguate the different identities).
As soon as a collaborator is added to a project, they share the project and the files associated with the project.
Live collaborative editing is also possible. If one me updates a notebook, the other me can see the changes happening – so a common notebook file is being updated by each client/user (I was typing in the browser on the right with one account, and watching the live update in the browser on the left, authenticated using a different account).
Real-time chatrooms can also be created and associated with a project – they look as if they might persist the chat history too?
The SagMathCloud environment seems to have been designed by educators for educators. A project owner can create a course around a project and assign students to it.
(It looks as if students can’t be collaborators on a project, so when I created a test course, I uncollaborated with my other me and then added my other me as a student.)
An course folder appears in the project area of the student’s account when they are enrolled on a course. A student can add their own files to this folder, and inspected by the course administrator.
A course administrator can also add one or more of their other project folders, by name, as assignment folders. When an assignment folder is added to a course and assigned to a student, the student can see that folder, and its contents, in their corresponding course folder, where they can then work on the assignment.
The course administrator can then collect a copy of the student’s assignment folder and its contents for grading.
The marker opens the folder collected from the student, marks it, and may add feedback as annotations to the notebook files, returning the marked assignment back to the student – where it appears in another “graded” folder, along with the grade.
At first glance, I have to say I find this whole thing pretty compelling.
In an OU context, it’s easy enough imagining that we might sign up a cohort of students to a course, and then get them to add their tutor as a collaborator who can then comment – in real time – on a notebook.
A tutor might also hold a group tutorial by creating their own project and then adding their tutor group students to it as collaborators, working through a shared notebook in real time as students watch on in their own notebooks, and perhaps may direct contributions back in response to a question from the tutor.
(I don’t think there is an audio channel available within SMC, so that would have to be managed separately? [UPDATE: seems there is some audio support – via William Stein, “if you click on the chat to the right of most file types (e.g., make a .md file), then there is a video camera, and if you click on that, you can broadcast yourself to other viewers of the file”.])
So what else would be nice? I’ve already mentioned audio collaboration, though that’s not essential and could be easily managed by other means.
For a course like TM351, it would be nice to be able to create a composition of linked applications within a project – for example, it would be nice to be able to start a PostgreSQL or MongoDB server linked to the Jupyter server so that notebooks could interact directly with a DBMS within a project or course setting. I also note that the IPython kernel being used appears to be the 2.7 version, and wonder how easy it is to tweak the settings on the back-end, or via an administration panel somewhere, to enable other Jupyter kernels?
I also wonder how easy it would be to add in other applications that are viewable through a browser, such as OpenRefine or RStudio?
In terms of how the backend works, I wonder if the Sandstorm.io encapsulation would be useful (eg in context of Why doesn’t Sandstorm just run Docker apps?) compared to a simpler docker container model, if that indeed is what is being used?