OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Infoskills’ Category

its the Gramma an punctuashun wot its’ about, Rgiht?

leave a comment »

This is another of those confluence style posts, where a handful of things I’ve read in quick succession seem to phase lock in my mind:

(brought to mind in part via @downes a week or so ago: How to Synch 32 Metronomes)

The first was a post by Alan Levine on Making Text Work, which describes a simple technique for making text overlays on photographs create a more coherent image than just slapping text on:

One of the techniques I use on presentation slides from time to time is a solid filled banner or stripe containing text, some opaque, sometimes transparent. I wouldn’t claim to be much of an artist but it makes the slides slightly more interesting than a header with an image in the middle of the slide, surrounded by whitespace…

(Which reminds me: maybe I should look through Presentation Zen Design and Presentation Zen: Simple Ideas on Presentation Design and Delivery again…)

Reading Alan’s post, it occurred to me that once you get the idea of using a solid or semi-transparent solid filled background to a text label, you tend to remember it and add it to your toolbox of presentation ideas (of course, you might also forget and later rediscover this sort of trick… My own slides tend to crudely follow particular design ideas I’ve recently picked up on, albeit in a crude and often not very well polished way, that I’ve decided I want to try to explore…) In the slide above, several tricks are evident: the solid filled text label, the positioning of it, the backgrounded blog post that actually serves as a reference for the slide (you can do a web search for the post title to learn more about the topic), the foregrounded image, rotated slightly, and so on.

The thing that struck me about Alan’s post was that it reminded me of a time before I was really aware of using a solid filled label to foreground a piece of text, which in turn caused me to reflect on other things I now take for granted as ideas that I can draw on and combine with other ideas.

In the same way we learn to spell, and learn to use punctuation, and start to pick up on the grammer that structures a language, so we can use those rules to construct ever more complex sentences. Once we know the rules, it becomes a choice as to whether or not we employ them.

Here’s an example of how we might come to acquire a new design idea, drawn from a brief conversation with @mediaczar a couple of nights ago when Mat asked if anyone knew the name od this sort of chart combination:

I didn’t know what to call the chart, but thought it should be easy enough to try to wrangle one together using ggplot in R, guessing that a geom_errorbarh() might work; Mat came back suggesting geom_crossbar().

Here’s a minimal code fragment I used to explore it:

#plot a horizontal bar across a bar chart
a=data.frame(x=10,y=43)
b=stack(a)
c=data.frame(d=c('x','y'),f=c(3,22))
library(ggplot2)
g=ggplot(b)+geom_bar(aes(x=ind,y=values),stat='identity')
g=g+geom_crossbar(data = c, aes(x=d,y=f,ymax=f+1,ymin=f-1), colour = NA, fill = "red", width = 0.9, alpha = 0.5)
print(g)

Here’s an example of how I used it – in an as yet unlabelled sketch showing for a particular F1 driver, their gird position for each race (red bar) the the number of places they gained (or lost) during the first lap:

So now I know how to achieve that effect…

Now for one two more things… Just after reading Alan’s post, I read a post by James Allen on possible race strategies for the Japanese Grand Prix:

The first thing that struck me was that even if you vaguely understand how a race chart works, the following statement may not be readily obvious to you from the top chart (my emphasis):

Three stops is actually faster [taking new softs on lap 12], as the [upper] graph … shows, but it requires the driver to pass the two stoppers in the final stint. If there is a safety car, it will hand an advantage to the two stoppers.

So, can you see why the three stopper (the green line) “requires the driver to pass the two stoppers in the final stint”? Let’s step back for a moment – can you see which bit of the graphs represent an overtake?

This is actually quite a complex graph to read – the axes are non-obvious, and not that easy to describe, though you soon pick up a feeling for how the chart works (honest!). Getting a sensible interpretation working for the surprising feature – the sharp vertical drops – is one way of getting into this chart, as well as looking at how the lines are postioned at the extreme x values (that is, at the end of the first and last laps).

The second thing that occurred to me was that we could actually remove the fragment of the line that shows the pitstop and instead show a separate line segment for each stint for each driver, and hence the line crossings that do not represent required overtakes. I’ve used this technique before, for example to show the separate stints on a chart of laptimes for a particular driver over the course of a race:

And as to where I got that trick? I think it was a bastardisation of a cycle plot, which can be used to show monthly, weekly or seasonal trends over a series of years:

…but it could equally have been a Stephen Few highlighted trick of disconnecting a timeseries line at the crossing point of one month to the next…

Whatever the case, one of the ideas I always have in mind is whether it may be possible to introduce white space in the form of a break in a line in order to separate out different groups of data in a very simple way.

Written by Tony Hirst

October 5, 2012 at 12:15 am

Therapy Time: Networked Personal Learning and a Reflection on the Urban Peasant…

with 3 comments

Way back when I was a postgrad, I used to spend a coffee fuelled morning reading in bed, and then get up to eat a cooked breakfast whilst watching the Urban Peasant:

My abiding memory, in part confirmed by several of the asides in the above clip (can you guess which?!), was that of “agile cooking” and flexible recipes. A chicken curry (pork’s fine too, or beef, even fish if you like; or potato if you want a vegetarian version) could be served with rice (or bread, or a baked potato); if you didn’t like curry, you could leave out the spices or curry powder, and just use a stock cube. If a recipe called for chopped vegetables, you could also grate them or slice them or dice them or…”it’s your decision”. Potato and peas could equally well be carrot or parsnip and beans. If you needed to add water to a recipe, you could add wine, or beer, or fruit juice or whatever instead; if you wanted to have scrambled egg on toast, you could also fry it, or poach it, or boil it. And the toast could be a crumpet or a muffin or just use “whatever you’ve got”.

The ethos was very much one of: start with an idea, and/or see what you’ve got, and then work with it – a real hacker ethic. It also encouraged you to try alternative ideas out, to be adaptive. And I’m pretty sure mistakes happened too – but that was fine…

When I play with data, I often have a goal in mind (albeit a loose one), used to provide a focus for exploring a data set I want to explore a little (typically using Schneiderman’s “Overview first, zoom and filter, then details-on-demand” approach), to see what potential it might hold, or to act a testbed for a tool or technique I want to try out. The problem then becomes one of coming up with some sort of recipe that works with the data and tools I have to hand, as well as the techniques and processes I’ve used before. Sometimes, a recipe I’m working on requires me to get another ingredient out of the fridge, or another utensil out of the cupboard. Sometimes I use a tea towel as an oven glove, or a fork as a knife. Sometimes I taste the food-in-process to know when it’s done, sometimes I go by the colour, texture, consistency, shape, smell or clouds of smoke that have started to appear.

Because I haven’t had any formal training in any of this “stuff”, using “approved” academic sources (I’ve recently been living by R-Bloggers (which is populated by quite a few academics) and Stack Overflow, for example), I suffer from a lack of confidence in talking about it in an academic way (see for example For My One Thousandth Blogpost: The Un-Academic), and a similar lack of confidence in feeling that I could ever charge anybody a fee for telling them what I (think I) know (leave aside for the moment that I effectively charge the OU my salary, benefits and on-costs… hmmm?!). I used to do the academic thing way back when as a postgrad and early postdoc, but fell out of the habit over the last few years because there seemed to me to be a huge amount of investment of time required for very little impact or consequence of what I was doing. Yes, it’s important for things be “right”, but I’m not sure my maths is up to generating formal proofs of new algorithms. I may be able to do the engineering or technologist thing of getting something working, -ish, good enough “for now”, research-style coding, but it’s always mindful of an engineering style trade-off: that it might not be “right” and is just something I figured out that seems to work, but that it’ll do because it lets me get something done… As Artur Bergman puts it using rather colourful language – “yes, correlation isn’t causation, but…”


(This clip was originally brought to mind by a recent commentary from Stephen Downes on The Internet Blowhard’s Favorite Phrase, and the original post it refers to.)

Also mixed up in the notion of “right” is seeing things as “right” if they are formally recognised or accepted as such, which is where assessment and peer review come in: you let other people you trust make an assessment about whatever it is you do/have done, publicly recognising your achievements which in turn allows you to make a justifiable claim to them. (I am reminded here of the definition of knowledge as justified true belief. That word “justified” is interesting, isn’t it…?)

As well as resisting getting in the whole grant bidding cycle for turnover generating, public money cycling projects that are set up to fail, I’ve also recently started to fall out of OU-style formal teaching roles… again, in part because of the long lead times involved with producing course materials and my preference for network based, rather than teamwork based, working style. (I so need to revisit formal models of teamwork and try to come up with a corresponding formulation for networks rather than teams…Or do a lit review to find one that’s already out there…!) I tend to write in 1 hour chunks based on 3-4 hours work, then post whatever it is I’ve done. One reason for doing this is becuase I figure most people read or do things in 5 to 15 minutes or one to two hour chunks and that in a network-centric, distributed online online educational setting small chunks are more likely to be discoverable and immediately useful (directly and immediately learnable from) chunks. There’s no shame in using a well crafted Wikipedia as a starting point for discovering more detailed – and academic – resources: at least you stand a good chance of finding the Wikipedia page! In the same way, I try to link out out to supporting resources from most of my posts so that readers (including myself as a revisitor to these pages in that set) have some additional context, supporting or conflicting material to help get more value from it. (Related: Why I (Edu)Blog.)

Thinking about my own personal lack of confidence, which in part arises from the way I have informally learned whatever it is that I have actually learned over the last few years and not had it formally validated by anybody else, my interest in espousing an informal style networked learning on others is an odd one… Because based on my own experience, it doesn’t give me the feeling that what I know is valid (justified..?), or even necessarily trustable by anybody other than me (because I know how it’s caveated because of what I have personally learned about it, rather than just being told about it), even if it is pragmatic and at least occasionally appears to be useful. (Hmm… I don’t think an OU editor would let me get away with a sentence like that in a piece of OU course material!) Maybe I need to start keeping a second, formalised reflective learning journal as the OU Skills for OU Study suggests to log what I learn, and provide some sort of indexable and searchable metadata around it? In fact, this approach might be a useful approach if I do another uncourse? (It also brings to mind the word equation: Learning material + metadata = learning object (it was something like that, wasn’t it?!))

To the extent that this blog is an act of informal, open teaching, I think it offers three main things: a) “knowledge transferring” discoverable resources on a variety of specialised topics; b) fragmentary records of created knowledge (I *think* I’ve managed to make up odd bits of new stuff over the last few years…); c) a model of some sort of online distributed network centric learning behaviour (see also the Digital Worlds Uncourse Blog Experiment in this respect).

I guess one of the things I do get to validate against is the real world. When I used to go into schools doing robotics activities*, kids would ask me if their robot or programme was “right”. In many cases, there wasn’t really a notion of “right”, it was more a case of:

  • were there things that were obviously wrong?
  • did the thing work as anticipated (or indeed, did any elements of it work at all?!;-)?
  • were there any bits that could be improved, adapted or done in another more elegant way?

So it is with some of my visualisation “experiments” – are they not wrong (is the data right, is there a sensible relationship between the data and the visual mappings)? do they “work” at all (eg in the sense of communicating a particular trend, or revealing a particular anomaly)? could they be improved? By running the robot program, or trying to read the story a data visualisation appears to be telling us, we can get a sense of how “right” it is; but there is often no single “right” for it to be. Which is where doubt can crop in… Because if something is “not right”, then maybe it’s “wrong”…?

In the world of distributed, networked learning, I think one thing we need to work on is developing an appropriate sense of validation and legitimisation of personal learning. Things like badges are weak extrinsic signs that some would claim have a role in this, but I wonder how networks and communities can be shaped and architected, or how their dynamics might work, so that learners develop not only a well-founded intrinsic confidence about what they have self-learned, but also a feeling that what they have self-learned is as legitimate as something they have been formally taught? (I know, I know: “I was at the University of Life, me”… As I am, now… which reminds me, I’ve a Coursera video and Feynman lecture on Youtube to watch, and a couple of code mentor answers to questions I’ve raised on various Stack Exchange sites to read through; and I probably should check to see if there are any questions recently posted to Stack Overflow that I may be able to answer and use to link out to other, more academic “open educational” resources…)

[Rereading this post, I think I am suffering from a lack of formality and the sense of justification that comes with it. Hmmm...]

* This is something I’ve recently been asked to do again for an MK local primary school in the new year; the contact queried how much I might charge and whilst in the past I would have said “no need”, for some reason this time I felt obliged to seek advice about from the Deanery about whether I should charge, and if so how much. This a huge personal cultural shift away from my traditional “of course/pro bono” attitude, and it felt wrong, somehow. To the extent that universities are public bodies, they should work with other public services in their local and extended communities. But of course, I get the sense we’re not really being encouraged to think of ourselves as public bodies very much any more, we’re commercial services… And that feeling affects the personal responsibility I feel when acting for and on behalf of the university. As it turns out, the Deanery seems keen that we participate freely in community events… But I note here that I felt (for the first time) as if I had to check first. So what’s in the air?

See also: Terran Lane’s On Leaving Academia and (via @boyledsweetie) Inspirational teaching: since when did entertainment not matter?

Written by Tony Hirst

October 4, 2012 at 1:32 pm

Appropriating IT: Glue Steps

with 2 comments

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

To generalise things a little, I have labelled the circuits as applications in the figure. But you can think of them as circuits if you prefer.

A piece of custom digital circuitry than can talk to both original circuits, and translate the outputs of each into a form that can be used as input to the other, is placed between them to take on this interfacing role: glue logic.

Sometimes, we might not need to transform all the data that comes out of the first circuit or application:

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

So what has this got to do with anything? In a post yesterday, I described a recipe for Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi. The recipe does what it says on the tin… but it actually isn’t really about that at all. It’s about using Google Refine as glue logic, for taking data out of one system in one format (JSON data from the Twitter search API, via a hand assembled URL) and getting it in to another (Gephi, using a simple CSV file).

The Twitter example was a contrived example… the idea of using Google Refine as demonstrated in the example is much more powerful. Because it provides you way of thinking about how we might be able to decompose data chain problems into simple steps using particular tools that do certain conversions for you ‘for free’.

Simply by knowing that Google Refine can import JSON (and if you look at the import screen, you’ll also notice it can read in Excel files, CSV files, XML files, etc, not just JSON, tidy it a little, and then export some or all of it as a simple CSV file means you now have a tool that might just do the job if you ever need to glue together an application that publishes or exports JSON data (or XML, or etc etc) with one that expects to read CSV. (Google Refine can also be persuaded to generate other output formats too – not just CSV…)

You can also imagine chaining separate tools together. Yahoo Pipes, for example, is a great environment for aggregating and filtering RSS feeds. As well as publishing RSS/XML via a URL, it also publishes more comprehensive JSON feeds via a URL. So what? So now you know that if you have data coming through a Yahoo Pipe, you can pull it out of the pipe and into Google Refine, and then produce a custom CSV file from it.

PS Here’s another really powerful idea: moving in and out of different representation spaces. In Automagic Map of #opened12 Attendees, Scott Leslie described a recipe for plotting out the locations of #opened12 attendees on a map. This involved a process of geocoding the addresses of the home institutions of the participants to obtain the latitude and longitude of those locations, so that they could be appropriately placed. So Scott had to hand: 1) institution names and or messy addresses for those institutions; 2) latitude and longitude data for those addresses. (I use the word “messy” to get over the idea that the addresses may be specified in all manner of weird and wonderful ways… Geocoders are built to cope with all sorts of variation in the way addresses are presented to them, so we can pass the problem of handling these messy addresses over to the geocoder.)

In a comment to his original post, Scott then writes: [O]nce I had the geocodes for each registrant, I was also interested in doing some graphs of attendees by country and province. I realized I could use a reverse geocode lookup to do this. That is, by virtue of now having the lat/long data, Scott could present these co-ordinates to a reverse geo-coding service that takes in a latitude/longitude pair and returns an address for it in a standardised form, including, for example, an explicit country or country code element. This clean data can then be used as the basis for grouping the data by country, for example. The process is something like this:

Messy address data -> [geocode] -> latitude/longitude data -> [reverse geocode] -> structured address data.

Beautiful. A really neat, elegant solution that dances between ways of representing the data, getting it into different shapes or states that we can work in different particular ways. :-)

PPS Tom Morris tweeted this post likening Google Refine to an FPGA. For this to work more strongly as a metaphor, I think we might have to take the JSON transcript of a set of operations in Google Refine as the programming, and then bake them into executable code, as for example we can do with the Stanford Data Wrangler, or using Pipe2Py, with Yahoo Pipes?

Written by Tony Hirst

October 3, 2012 at 10:36 am

Posted in Anything you want, Infoskills

Tagged with

ILI2012 Workshop Prep – Appropriating IT: innovative uses of emerging technologies

with 6 comments

Given that workshops at ILI2012 last a day (10 till 5), I thought I’d better start prepping the workshop I’m delivering with Martin Hawksey at this year’s Internat Librarian International early… W2 – Appropriating IT: innovative uses of emerging technologies:

Are you concerned that you are not maximising the potential of the many tools available to you? Do you know your mash-ups from your APIs? How are your data visualisation skills? Could you be using emerging technologies more imaginatively? What new technologies could you use to inspire, inform and educate your users? Learn about some of the most interesting emerging technologies and explore their potential for information professionals.

The workshop will combine a range of presentations and discussions about emerging information skills and techniques with some practical ‘makes’ to explore how a variety of free tools and applications can be appropriated and plugged together to create powerful information handling tools with few, if any, programming skills required.

Topics include:

- Visualisation tools
- Maps and timelines
- Data wrangling
- Social media hacks
- Screenscraping and data liberation
- Data visualisation

(If you would like to join in with the ‘makes’, please bring a laptop)

I have some ideas about how to fill the day – and I’m sure Martin does too – but I thought it might be worth asking what any readers of this blog might be interested in learning about in a little more detail and using slightly easier, starting from nowhere baby steps than I usually post.

My initial plan is to come up with five or six self contained elements that can also be loosely joined, structuring the day something like this:

  • opening, and an example of the sort of thing you’ll be able to do by the end of the day – no prior experience required, handheld walkthroughs all the way; intros from the floor along with what folk expect to get out of the day/want to be able to do at the day (h/t @briankelly in the comments; of course, if folks’ expectations differ from what we had planned….;-). As well as demo-ing how to use tools, we’ll also discuss why you might want to do these things and some of the strategies involved in trying to work out how to do them, knowing what you already know, or how to find out/work out how to do them if you don’t..
  • The philosophy of “appropriation”, “small pieces, lightly joined”, “minimum viability” and ‘why Twitter, blogs and Stack Overflow are Good Things”;
  • Visualising Data – because it’s fun to start playing straight away…
    • Google Fusion Tables – visualisations and queries
    • Google visualisation API/chart components

    Payoff: generate some charts and dashboards using pre-provided data (any ideas what data sets we might use…? At least one should have geo-data for a simple mapping demo…)

  • — Morning coffee break? —
  • Data scraping:
    • Google spreadsheets – import CSV, import HTML table;
    • Google Refine – import XLS, import JSON, import XML
    • (Briefly) – note the existence of other scraper tools, incl. Scraperwiki, and how they can be used

    Payoff: scrape some data and generate some charts/views… Any ideas what data to use? For the JSON, I thought about finishing with a grab of Twitter data, to set up after lunch…

  • — Lunch? —
  • (Social) Network Analysis with Gephi
    • Visually analyse Twitter data and/or Facebook data grabbed using Google Refine and/or TAGSExplorer
    • Wikipedia graphing using DBPedia
    • Other examples of how to think in graphs…
  • The scary session…
    • Working with large data files – examples of some simple text processing command line tools
    • Data cleansing and shaping – Google Refine, for the most part, including the use of reconciliation; additional examples based on regular expressions in a text editor, Google spreadsheets as a database, Stanford Data Wrangler, and R…
  • — Afternoon coffee break? —
  • Writing Diagrams – examples referring back to Gephi, mentioning Graphviz, then looking at R/ggplot2, finishing with R’s googleVis library as a way of generating Google Visualisation API Charts…
  • Wrap up – review of the philosophy, showing how it was applied throughout the exercises; maybe a multi-step mashup as a final demo?

Requirements: we’d need good wifi/network connections; also, it would help if participants pre-installed – and checked the set up of: a) a Google account; b) a modern browser (standardising on Google Chrome might be easiest?) c) Google Refine; d) Gephi (which may also require the installation of a Java runtime, eg on a new-ish Mac); e) R; f) RStudio and a raft of R libraries (ggplot2, plyr, reshape, RCurl, stringr, googleViz); g) a good text editor (?I use TextWrangler on a Mac); h) commandline tools (Windows machines);

Throughout each session, participants will be encouraged to identify datasets or IT workflow issues they encounter at work and discuss how the ideas presented in the workshop may be appropriated for use in those contexts…

Of course, this is all subject to change (I haven’t asked Martin how he sees the day panning out yet;-), but it gives a flavour of my current thinking… So: what sorts of things would you like to see? And would you like to book any of the sessions for a workshop at your place…?!;-)

Written by Tony Hirst

September 25, 2012 at 1:05 pm

Posted in Anything you want, Infoskills

Tagged with

Open Research Data Processes: KMi Crunch – Hosted RStudio Analytics Environment

with 2 comments

One of the possible barriers to widespread adoption of open notebook science is knowing where to start. Video reports of lab experiments hosted on Youtube can be easily embedded in a hosted WordPress blog; a MediaWiki wiki can be used to provide one page per experiment, with change tracking/history on each page and a shadow page for commentary and discussion; Github can be used to provide a version control environment for software code, results data, project pages and documentation. For tabulated data, Google Spreadsheets provides a hosting environment and an API that lets you treat the data as a database and also explore it dashboard style via a range of interactive visual filtering and charting components. Alternatively, a CKAN instance (such as is used to run thedatahub.org) offers data management and preview tools.

Keeping track of data analysis in an open way is also getting easier. In An R-chitecture for Reproducible Research/Reporting/Data Journalism, I briefly mentioned RPubs.com, a site that can be used to 1-click publish HTML reports of statistical analyses executed within the RStudio environment (I really need to do a proper post about this). But now there’s an example of another hosted solution from Fridolin Wild of the OU’s KMi: Crunch.

Crunch offers a hosted RStudio environment (so you can access RStudio via a browser) with public and private areas. The public areas allow you to post datasets, run scripts as a service, or publish results (Sweave generated PDFs, or knitr generated HTML reports, for example).

Crunch also incorporates a MySQL database for each user. (Scheduling and pipelining are also on the cards…)

Whilst developed as an application to support learning analytics (I think?), Crunch also provides a great demonstration of a more general open research data workbench. You can store – and publish – data sets, along with analysis scripts and reports generated by executing those scripts over your data set. Version control isn’t available at the moment (I think?) but RSTudio does have git/github support, so that may be coming. The provision of a MySql database means that data collections can be managed within a database environment. (From a data journalism, rather than an open/reproducible research, perspective, I did wonder whether it would be possible to situate something like Scraperwiki on the same platform and replace its SQLite support with MySQL support, so a Scraperwiki scraper could be used to scrape data into a MySQL database that was then accessed from RStudio? Being able to wire MySQL read/write access into Google Refine on the same platform could also be interesting..;-)

I’m not sure about the extent to which the OU LIbrary is taking an interest in the development of Crunch, but providing best practice support and advice in the orchestration of information and data handling tools seems to me to be in-scope for the academic research librarian, in much the same way as advising on the use of bibliography data management tools used to be…? (For a recent take on this, see Dorothea Salo’s recent Ariadne article Retooling Libraries for the Data Challenge.)

Written by Tony Hirst

August 23, 2012 at 10:24 am

Quick and Dirty Recipe: Merging (Concatenating) Multiple CSV files (ODA Spending)

with 4 comments

There’s been a flurry of tweets over the last few days about LOCOG’s exemption from FOI (example LOCOG response to an FOI request), but the Olympic Delivery Authority (ODA, one of the owner stakeholders) is rather more open, and publishes its spends over £25k: ODA Finance: Transparency Reports.

CSV files containing spend on a monthly basis are available from the site, using a consistent CSV file format each time (I think…). For what it’s worth, I thought it might be worth sharing a pragmatic, though not ideal, Mac/Linux/unixtools commandline recipe for generating a single file containing all this data.

  1. Right-click and download each of the CSV files on the page to the same directory (eg odaSpending) on your local machine. (There are easier ways of doing this – I tried wget on the command line, but got an Access Denied response (workaround anyone?); there are probably more than a few browser extensions/plugins that will also download all the files linked to from a page. If so, you just want to grab the csv files; if you get them all, from the command line, just copy the csv files to a new directory: eg mkdir csvfiles;cp *.csv csvfiles)
  2. On the commandline, change directory to the files directory – eg cd odaSpending/csvfiles; then join all the files together: files=*; cat $files > odaspending.csv
  3. You should now have a big file, odaspending.csv, containing all the data, although it will also contain multiple header rows (each csv file had its own header row). Open the file in a text editor (I use TextWrangler), copy from the start of the first line to the start of the second (ie copy the header row, including the end of line/carriage return), then do a Find on the header and global Replace with nothing replacing the search string. Then, depending where you started the replace, maybe paste the header (if required) back into the first row

To turn the data file into something you can explore more interactively, upload it to something like Google Fusion Tables, as I did here (data to May 2012): ODA Spending in Google Fusion Tables

Note that this recipe is a pragmatic one. Unix gurus would surely be able to work out far more efficient scripts that concatenate the files after stripping out the header in all but the first file, for example, or that maybe even check the columns are the same etc etc. But if you want something quick and dirty, this is one way of doing it… (Please feel free to add alternative recipes for achieving the same thing in the comments…)

PS here’s an example of one sort of report you can then create in Fusion Tables – ODA spend with G4S; here’s another: Seconded staff

Written by Tony Hirst

July 16, 2012 at 10:16 am

Mapping How Programming Languages Influenced Each Other According to Wikipedia

with 9 comments

By way of demonstrating how the recipe described in Visualising Related Entries in Wikipedia Using Gephi can easily be turned to other things, here’s a map of how different computer programming languages influence each other according to DBpedia/Wikipedia:

Here’s the code that I pasted in to the Request area of the Gephi Semantic Web Import plugin as configured for a DBpedia import:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?a gephi:label ?an .
  ?b gephi:label ?bn .
  ?a <http://dbpedia.org/ontology/influencedBy> ?b
} WHERE {
?a a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?b a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a foaf:name ?an.
?b foaf:name ?bn.
}

As to how I found the <http://dbpedia.org/ontology/ProgrammingLanguage&gt; relation, I had a play around with the SNORQL query interface for DBpedia looking for possible relations using queries along the lines of:

SELECT DISTINCT ?c WHERE {
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a rdf:type ?c.
?b a ?c.
} limit 50 offset 150

(I think a (as in ?x a ?y and rdf:type are synonyms?)

This query looks for pairs of things (?a, ?b), each of the same type, ?c, where ?b also influences ?a, then reports what sort of thing (?c) they are (philosophers, for example, or programming languages). We can then use this thing in our custom Wikipedia/DBpedia/Gephi semantic web mapping request to map out the “internal” influence network pertaining to that thing (internal in the sense that the things that are influencing and influenced are both representatives of the same, erm, thing…;-).

The limit term specifies how many results to return, the offset essentially allows you to page through results (so an offset of 500 will return results starting with the 501st result overall). DISTINCT ensures we see unique relations.

If you see a relation that looks like dbpedia:ontology/Philosopher, put it in and brackets (<>) and replace dbpedia: with http://dbpedia.org/ to give something like <http://dbpedia.org/ontology/Philosopher&gt;.

PS see how to use a similar technique to map out musical genres ascribed to bands on WIkipedia

Written by Tony Hirst

July 3, 2012 at 12:08 pm

Visualising Related Entries in Wikipedia Using Gephi

with 13 comments

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

- tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
- tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
- tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL http://dbpedia.org/sparql):

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

SELECT *
WHERE {
?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
}

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

CONSTRUCT{
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} WHERE {
  ?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – http://dbpedia.org/page/Martin_Heidegger – for example, but we can tidy this up. Going to one of the DPpedia pages, such as http://dbpedia.org/page/Martin_Heidegger, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<http://gephi.org/&gt; as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<http://xmlns.com/foaf/0.1/&gt;, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <http://dbpedia.org/ontology/influenced&gt; ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence
} WHERE {
  ?philosopher a
  <http://dbpedia.org/ontology/Philosopher> .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I'll maybe post more on this later - in the meantime, if you're new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

[See also: Graphing Every* Idea In History]

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

Written by Tony Hirst

July 3, 2012 at 10:05 am

Sketching Substantial Council Spending Flows to Serco Using OpenlyLocal Aggregated Spending Data

with 3 comments

An article in today’s Guardian (Serco investigated over claims of ‘unsafe’ out-of-hours GP service) about services provided by Serco to various NHS Trusts got me thinking about how much local councils spend with Serco companies. OpenlyLocal provides a patchy(?) aggregating service over local council spending data (I don’t think there’s an equivalent aggregator for NHS organisations’ spending, or police authority spending?) so I thought I’d have a quick peek at how the money flows from councils to Serco.

If we search the OpenlyLocal Spending Dashboard, we can get a summary of spend with various Serco companies from local councils whose spending data has ben logged by the site:

Using the local spend on corporates scraper I used to produce Inter-Council Payments Network Graph, I grabbed details of payments to companies returned by a search on OpenlyLocal for suppliers containing the keyword serco, and then generated a directed graph with edges defined: a) from council nodes to company nodes; b) from company nodes to canonical company nodes. (Where possible, OpenlyLocal tries to reconcile companies identified for payment by councils with canonical company identifiers so that we can start to get a feeling for how different councils make payments to the same companies.)

I then exported the graph as a json node/edge list so that it could be displayed by Mike Bostock’s d3.js Sankey diagram library:

(Note that I’ve filtered the edges to only show ones above a certain payment amount (£10k).)

As a presentation graphic, it’s really tatty, doesn’t include amount labels (though they could be added) and so on. But as a sketch, it provides an easy to digest view over the data as a starting point for a deeper conversation with the data. We might also be able to use the diagram as a starting point for a data quality improvement process, by identifying the companies that we really should try to reconcile.

Here are flows associated with speend to g4s identified companies:

I also had a quick peek at which councils were spending £3,500 and up (in total) with the OU…

Digging into OpenlyLocal spending data a little more deeply, it seems we can get a breakdown of how total payments from council to supplier are made up, such as by spending department.

Which suggests to me that we could introduce another “column” in the Sankey diagram that joins councils with payees via spending department (I suspect the Category column would result in data that’s a bit too fine grained).

See also: University Funding – A Wider View

Written by Tony Hirst

May 26, 2012 at 10:33 pm

Posted in Data, Infoskills, Visualisation

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 340 other followers