Archive for the ‘Infoskills’ Category
By chance, a couple of days ago I stumbled across a spreadsheet summarising awarded PFI contracts as of 2013 (private finance initiative projects, 2013 summary data).
The spreadsheet has 800 or so rows, and a multitude of columns, some of which are essentially grouped (although multi-level, hierarchical headings are not explicitly used) – columns relating to (estimated) spend in different tax years, for example, or the equity partners involved in a particular project.
As part of the OU course we’re currently developing on “data”, we’re exploring an assessment framework based on students running small data investigations, documented using IPython notebooks, in which we will expect students to demonstrate a range of technical and analytical skills, as well as some understanding of data structures and shapes, data management technology and policy choices, and so on. In support of this, I’ve been trying to put together one or two notebooks a week over the course of a few hours to see what sorts of thing might be possible, tractable and appropriate for inclusion in such a data investigation report within the time constraints allowed.
To this end, the Quick Look at UK PFI Contracts Data notebook demonstrates some of the ways in which we might learn something about the data contained within the PFI summary data spreadsheet. The first thing to note is that it’s not really a data investigation: there is no specific question I set out to ask, and no specific theme I set out to explore. It’s more exploratory (that is, rambling!) than that. It has more of the form what I’ve started referring to as a conversation with data.
In a conversation with data, we develop an understanding of what the data may have to say, along with a set of tools (that is, recipes, or even functions) that allow us to talk to it more easily. These tools and recipes are reusable within the context of the data set or datasets selected for the conversation, and may be transferrable to other data conversations. For example, one recipe might be how to filter and summarise a dataset in a particular way, or generate a particular view or reshaping of it that makes it easy to ask a particular sort of question, or generate a particular sort of a chart (creating a chart is like asking a question where the surface answer is returned in a graphical form).
If we are going to use IPython notebook documented conversations with data as part of an assessment process, we need to tighten up a little more how we might expect to see them structured and what we might expect to see contained within them.
Quickly skimming through the PFI conversation, elements of the following skills are demonstrated:
– downloading a dataset from a remote location;
– loading it into an appropriate data representation;
– cleaning (and type casting) the data;
– spotting possibly erroneous data;
– subsetting/filtering the data by row and column, including the use of partial string matching;
– generating multiple views over the data;
– reshaping the data;
– using grouping as the basis for the generation of summary data;
– sorting the data;
– joining different data views;
– generating graphical charts from derived data using “native” matplotlib charts and a separate charting library (ggplot);
– generating text based interpretations of the data (“data2text” or “data textualisation”).
The notebook also demonstrates some elements of reflection about how the data might be used. For example, it might be used in association with other datasets to help people keep track of the performance of facilities funded under PFI contracts.
Note: the notebook I link to above does not include any database management system elements. As such, it represents elements that might be appropriate for inclusion in a report from the first third of our data course – which covers basic principles of data wrangling using pandas as the dominant toolkit. Conversations held later in the course should also demonstrate how to get data into an out of an appropriately chosen database management system, for example.
Via @simonperry, news that AP will use robots to write some business stories (Automated Insights are one of several companies I’ve been tracking over the years who are involved in such activities, eg Notes on Narrative Science and Automated Insights).
The claim is that using algorithms to do the procedural writing opens up time for the journalists to do more of the sensemaking. One way I see this is that we can use data2text techniques to produce human readable press releases of things like statistical releases, which has a couple of advantages at least.
Firstly, the grunt – and error prone – work of running the numbers (calculating month on month or year on year changes, handling seasonal adjustments etc) can be handled by machines using transparent and reproducible algorithms. Secondly, churning numbers into simple words (“x went up month on month from Sept 2013 to Oct 2013 and down year on year from 2012″) makes them searchable using words, rather than having to write our own database or spreadsheet queries with lots of inequalities in them.
In this respect, something that’s been on my to do list for way to long is to produce some simple “press release” generators based on ONS releases (something I touched on in Data Textualisation – Making Human Readable Sense of Data).
Matt Waite’s upcoming course on “automated story bots” looks like it might produce some handy resources in this regard (code repo). In the meantime, he already shared the code described in How to write 261 leads in a fraction of a second here: ucr-story-bot.
For the longer term, on my “to ponder” list is what might something like “The Grammar of Graphics” be for data textualisation? (For background, see A Simple Introduction to the Graphing Philosophy of ggplot2.)
For example, what might a ggplot2 inspired gtplot library look like for converting data tables not into chart elements, but textual elements? Does it even make sense to try to construct such a grammar? What would the corollaries to aesthetics, geoms and scales be?
I think I perhaps need to mock-up some examples to see if anything comes to mind and that the function names, as well as the outputs, might look like, let alone the code to implement them! Or maybe code first is the way, to get a feel for how to build up the grammar from sensible looking implementation elements? Or more likely, perhaps a bit of iteration may be required?!
If you’ve ever had to draw “blocks and arrows” diagrams, you’ll know how irritating it can be if you spend hours laying out the diagram using a presentation editor or drawing tool, only to find you need to edit the drawing, add another box, and lay the whole thing out again.
Surely there must be a better way?
Let’s just think about what a box and arrow diagram is intended to show: when describing a process, the connections typically represent a flow from one thing to another; furthermore, the layout is often rectilinear, laid out along straight lines, the boxes tidily spaced and their edges lined up with each other. In a diagram such as a mindmap, different ideas or concepts are related to each other by drawing a line between them and the layout may be more fluid, with like or related concepts grouped together in space, or by the additional use of colour themes, for example.
The primary information contained in the diagram are the text elements and the connections between them. The positioning on the page often reflects the structure of these connections. When we lay out a diagram, we unconsciously favour layouts that minimise the number of crossed lines (to keep the diagram “clean” looking), and group connected items close together (unless some other information requires us to separate them – for example, we might be using a timeline basis for a horizontal x-axis and placing boxes in areas of the canvas we are working on that are associated with a particular month).
The online Google Drawing document type is typical of drawing tools included in many office applications. As well as being able to draw boxes and connect registration points on each box by lines or arrows, a range of layout tools provides support for aligning and spacing boxes.
Tools such as popplet provide a friendlier environment for generating similar sorts of diagram:
Whilst drawing tools such as these allow you to craft your diagram by hand, building it up as you go along, actually putting previously collected information into blocks on the canvas, let alone connecting the blocks together and laying them out nicely, may be quite an involved and error prone affair.
In these circumstances, it may make more sense to take a raw representation of the block contents and a simple representation of connections between appropriate blocks and just write the relationships down, letting a drawing tool do the hard work of drawing the blocks, connecting them together and laying them out, at least in draft layout form. To provide for a final layer of customisation, it might also be useful to be able to take a vector/SVG representation of the automatically sketched layout into a drawing package where it can be tidied up by hand and the application of a human designer’s eye.
There are several online tools available that you can use to sketch box and arrow diagrams from simple text descriptions.
Text2Mindmap allows you to construct tree based mindmaps from a simple outline style description of the mindmap.
The layout has a radial basis. Designs can be saved and images downloaded as JPG or PDF files.
Diagrammr allows you to draw simple graph based network structures in which text labelled block elements can be connected to other blocks by labelled edges.
Designs are given a persistent URL, but anyone with access to the URL can edit the diagram.
JS Sequence Diagrams
Diagrams can be saved as SVG diagrams and associated with a URL that contains all the information used to recreate the diagram. As such, large diagrams are not supported if they make the URL too long. Source code is available.
Several diagram types are available using blockdiag, including graphviz diagrams constructed using the DOT language.
GraphvizFiddle is a fiddler style application that lets you enter ,a href=”http://www.graphviz.org/content/dot-language”>Graphviz DOT language descriptions and preview the result.
Files can be generated in SVG format, or a textual definitions (for example, in the dot layout language).
Generating boxes and arrows style diagrams can be a pain at times because the semantics of the diagram – how one item is related to another – is represented in a graphical rather than data based form. By writing down the relations and then automatically generating visual representations of them, we retain access to the data representation whilst letting the hard work of generating the initial draft, at least, of the layout to a machine.
Several tools are available to support the creation of such literally described box and arrow diagrams using a variety of description languages and generating a range of output image formats (SVG probably being the most useful if you need to edit the sketch diagram for yourself to tweak the layout for its final presentation). Code for some of the tools (JS Sequence diagrams, blockdiag) is available.
Arguably the most powerful tools allow you to “write” diagrams using the Graphviz DOT layout language. Whilst there is a certain overhead associated with learning this language, it does save time in the long run if you regularly need to create network style diagrams. Graphviz also supports a range of layout algorithms – see the Graphviz gallery for examples.
PS If you want to write your own diagramming application, the JointJS library looks like a handy library to have on hand… The Venn.js library also looks quite pretty – if you have to generate Venn diagrams, that is!
With the UK national curriculum for schools set to include a healthy dose of programming from September 2014 (Statutory guidance – National curriculum in England: computing programmes of study) I’m wondering what the diff will be on the school day (what gets dropped if computing is forced in?) and who’ll be teaching it?
A few years ago I spent way too much time engaged in robotics related school outreach activities. One of the driving ideas was that we could use practical and creative robotics as a hands-on platform in a variety of curriculum context: maths and robotics, for example, or science and robotics. We also ran some robot fashion shows – I particularly remember a two(?) day event at the Quay Arts Centre on the Isle of Wight where a couple of dozen or so kids put on a fashion show with tabletop robots – building and programming the robots, designing fashion dolls to sit on them, choosing the music, doing the lights, videoing the show, and then running the show itself in front of a live audience. Brilliant.
On the science side, we ran an extended intervention with the Pompey Study Centre, a study centre attached to the Portsmouth Football Club, that explored scientific principles in the context of robot football. As part of the ‘fitness training’ programme for the robot footballers, the kids had to run scientific experiments as they calibrated and configured their robots.
The robot platform – mechanical design, writing control programmes, working with sensors, understanding interactions with the real world, dealing with uncertainty – provided a medium for creative problem solving that could provide a context for, or be contextualised by, the academic principles being taught from a range of curriculum areas. The emphasis was very much on learning by doing, using an authentic problem solving context to motivate the learning of principles in order to be able to solve problems better or more easily. The idea was that kids should be able to see what the point was, and rehearse the ideas, strategies and techniques of informed problem solving inside the classroom that they might then be able to draw on outside the classroom, or in other classrooms. Needless to say, we were disrespectful of curriculum boundaries and felt free to draw on other curriculum areas when working within a particular curriculum area.
In many respects, robotics provides a great container for teaching pragmatic and practical computing. But robot kit is still pricey and if not used across curriculum areas can be hard for schools to afford. There are also issues of teacher skilling, and the set-up and tear-down time required when working with robot kits across several different classes over the same school day or week.
So how is the new computing curriculum teaching going to be delivered? One approach that I think could have promise if kids are expected to used text based programming languages (which they are required to do at KS3) is to use a notebook style programming environment. The first notebook style environment I came across was Mathematica, though expensive license fees mean I’ve never really used it (Using a Notebook Interface).
More recently, I’ve started playing with IPython Notebooks (“ipynb”; for example, Doodling With IPython Notebooks for Education).
(Start at 2 minutes 16 seconds in – I’m not sure that WordPress embeds respect the time anchor I set. Yet another piece of hosted WordPress crapness.)
For a history of IPython Notebooks, see The IPython notebook: a historical retrospective.
Whilst these can be used for teaching programming, they can also be used for doing simple arithmetic, calculator style, as well as simple graph plotting. If we’re going to teach kids to use calculators, then maybe:
1) we should be teaching them to use “found calculators”, such as on their phone, via the Google search box, in those two-dimensional programming surfaces we call spreadsheets, using tools such as WolframAlpha, etc;
2) maybe we shouldn’t be teaching them to use calculators at all? Maybe instead we should be teaching them to use “programmatic calculations”, as for example in Mathematica, or IPython Notebooks?
Maths is a tool and a language, and notebook environments, or other forms of (inter)active, executable worksheets that can be constructed and or annotated by learners, experimented with, and whose exercises can be repeated, provide a great environment for exploring how to use and work with that language. They’re also great for learning how the automated execution of mathematical statements can allow you to do mathematical work far more easily than you can do by hand. (This is something I think we often miss when teaching kids the mechanics of maths – they never get a chance to execute powerful mathematical ideas with computational tool support. One argument against using tools is that kids don’t learn to spot when a result a calculator gives is nonsense if they don’t also learn the mechanics by hand. I don’t think many people are that great at estimating numbers even across orders of magnitude even with the maths that they have learned to do by hand, so I don’t really rate that argument!)
Maybe it’s because I’m looking for signs of uptake of notebook ideas, or maybe it’s because it’s an emerging thing, but I noticed another example of notebook working again today, courtesy of @alexbilbie: reports written over Neo4J graph databases submitted to the Neo4j graph gist winter challenge. The GraphGist how to guide looks like they’re using a port of, or extensions to, an IPython Notebook, though I’ve not checked…
Note that IPython notebooks have access to the shell, so other languages can be used within them if appropriate support is provided. For example, we can use R code in the IPython notebook context.
Note that interactive, computationaal and data analysis notebooks are also starting to gain traction in certain areas of research under the moniker “reproducible research”. An example I came across just the other day was The Dataverse Network Project, and an R package that provides an interface to it: dvn – Sharing Reproducible Research from R.
In much the same way that I used to teach programming as a technique for working with robots, we can also teach programming in the context of data analysis. A major issue here is how we get data in to and out of a programming environment in an seamless way. Increasingly, data sources hosted online are presenting APIs (programmable interfaces) with wrappers that provide a nice interface to a particular programming language. This makes it easy to use a function call in the programming language to pull data into the programme context. Working with data, particularly when it comes to charting data, provides another authentic hook between maths and programming. Using them together allows us to present each as a tool that works with the other, helping answer the question “but why are learning this?” with the response “so now you can do this, see this, work with this, find this out”, etc. (I appreciate this is quite a utilitarian view of the value of knowledge…)
But how far can we go in terms of using “raw”, but very powerful, computational tools in school? The other day, I saw this preview of the Wolfram Language:
There is likely to be a cost barrier to using this language, but I wonder: why shouldn’t we use this style of language, or at least the notebook style of computing, in KS3 and 4? What are the barriers (aside from licensing cost and machine access) to using such a medium for teaching computing in context (in maths, in science, in geography, etc)?
Programming puritans might say that notebook style computing isn’t real programming… (I’m not sure why, but I could imagine they might… erm… anyone fancy arguing that line in the comments?!:-) But so what? We don’t want to teach everyone to be a programmer, but we do maybe want to help them realise what sorts of computational levers there are, even if they don’t become computational mechanics?
I was pleased to be invited back to the University of Lincoln again yesterday to give a talk on data journalism to a couple of dozen or so journalism students…
I was hoping to generate a copy of the slides (as images) embedded in a markdown version of the notes but couldn’t come up with a quick recipe for achieving that…
When I get a chance, it looks as if the easiest way will be to learn some VBA/Visual Basic for Applications macro scripting… So for example:
If anyone beats me to it, I’m actually on a Mac, so from the looks of things on Stack Overflow, hacks will be required to get the VBA to actually work properly?
One of the big issues when it comes to working with data is that things are often done most easily using small fragments of code. Something I’ve been playing with recently are IPython notebooks, which provide an interactive browser based interface to a Python shell.
The notebooks are built around the idea of cells of various types, including header and markdown/HTML interpreted cells and executable code cells. Here are a few immediate thoughts on how we might start to use these notebooks to support distance and self-paced education:
The output from executing a code cell is displayed in the lower part of the cell when the cell is run. Code execution causes state changes in the underlying IPython session, the current state of which is accessible to all cells.
Graphical outputs, such as chart objects generated using libraries such as matplotlib, can also be displayed inline (not shown).
There are several ways we might include a call to action for students:
* invitations to run code cells;
* invitations to edit and run code cells;
* invitations to enter commentary or reflective notes.
We can also explore ways of providing “revealed” content. One way is to make use of the ability to execute code, for example by importing a crib class that we can call…
Here’s how we might prompt its use:
All code cells in a notebook can be executed one after another, or a cell at a time (You can also execute all cells above or below the cursor). This makes for easy testing of the notebook and self-paced working through it.
A helper application, nbconvert, allows you to generate alternative versions of a notebook, whether as HTML, python code, latex, or HTML slides (reveal.js). There is also a notebook viewer available that displays an HTML view of a specified ipynb notebook. (There is also an online version: IPython notebook viewer.)
Another advantage of the browser based approach is that the IPython shell can run in a virtual machine (for example, Cursory Thoughts on Virtual Machines in Distance Education Courses) and expose the notebook as a service that can be accessed via a browser on a host machine:
A simple configuration tweak allows notebook files and data files to occupy a folder that is shared across both the host machine and the guest virtual machine.
It is also possible to run remote notebook servers that can be accessed via the web. It would be nice if institutional IT services could support the sort of Agile EdTech that Jim Groom has been writing about recently, that would allow course teams developers to experiment with this sort of technology quickly and easily, but in the meantime, we can still do it ourselves…
For example, I am currently running a couple of virtual machines whose configurations I have “borrowed” from elsewhere – Matthew Russell’s Mining the Social Web 2nd Edition VM, and @datamineruk’s pandas infinite-intern machine.
I’ve only really just started exploring what may be possible with IPython notebooks, and how we might be able to use them as part of an e-learning course offering. If you’ve seen any particularly good (or bad!) examples of IPython notebooks being used in an educational context, please let me know via the comments…:-)
As part of a new course I’m working on, the course team has been making use of shared Google docs for working up the course proposal and “D0″ (zero’th draft; key topics to be covered in each of the weekly sessions). Although the course production hasn’t been approved yet, we’ve started drafting the actual course materials, with an agreement to share them for comment via Google docs.
The approach I’ve taken is to created a shared folder with the rest of the course teams, and set up documents for each of the weekly sessions I’ve taken the lead on.
The documents in this folder are all available to other members of the course team – for reference and /or comment – at any time, and represent the “live”/most current version of each document I’m working on. I suspect that others in the course team may take a more cautious approach, only sharing a doc when it’s in a suitable state for handover – or at least, comment – but that’s fine too. My docs can of course be used that way as well – no-one has to look at them until I do “hand them over” for full comment at the end of the first draft stage.
But what if others, such as the course team chair or course manager, do want to keep check on progress over the coming weeks?
The file listing shown above doesn’t give a lot away about the stare of each document, not even a file size, only when it was last worked on. So it struck me that it might be useful to have a visual indicator (such as a horizontal progress bar) about the progress on each document so that someone looking at the listing would know whether there was any point opening a document to have a look inside at all…
..because at the current time, a lot of the docs are just stubs, identifying tasks to be done.
Progress could be measured by proxy indicators, such as file size, “page count” equivalent, or line count. In these cases, the progress meter could be updated automatically. Additional insight could be provided by associating a target line count or page length metadata element, providing additional feedback to the author about progress with respect to that target. If a document exceeds the planned length, the progress meter should carry on going, possibly with a different colour denoting the overrun.
There are a couple of problems at least with this approach – documents that are being worked on may blend scruffy working notes along with actual “finished” text; several versions of the same paragraph may exist as authors try out different approaches, all adding to the line count. Long copied chunks from other sources may be in the text as working references, and so on.
So how about an additional piece of metadata for docs additionally tagged as “task” type in which a user can set a quick progress percentage estimate (a slider widget would make this easy to update) that is displayed in a bar on the file listing. Anyone checking the folder could then – at a glance – see which docs were worth looking at based on progress within the document-as-task. (Of course, having metadata available also opens up the possibility of additional mission creeping features, rulesets for generating alerts when a doc hits a particular percentage completion, for example.)
I’m not looking for more project management tools to take time away from a task, but in this case think the simple addition of a “progress” metadata element could weave an element of project management support into this sort of workflow? (changing the title of the doc would be another way – eg adding (20% done) to the title…
Thinks: hmm, I procrastinating, aren’t I? I should really be working on one of those docs…;-)