My ILI2012 Presentation – Derived Products from OpenLearn/OU XML Documents

FWIW, a copy of the slides I used in my ILI2012 presentation earlier this week – Making the most of structured content:data products from OpenLearn XML:

I guess this counts as a dissemination activity for my related eSTEeM project on course related custom search engines, since the work(?!) sort of evolved out of that idea…

The thesis is this:

  1. Course Units on OpenLearn are available as XML docs – a URL pointing to the XML version of a unit can be derived from the Moodle URL for the HTML version of the course; (the same is true of “closed” OU course materials). The OU machine uses the XML docs as a feedstock for a publication process that generates HTML views, ebook views, etc, etc of a course.
  2. We can treat XML docs as if they were database records; sets of structured XML elements can be viewed as if they define database tables; the values taken by the structured elements are like database table entries. Which is to say, we can treat each XML docs as a mini-database, or we we can trivially extract the data and pop it into a “proper”/”real” database.
  3. given a list of courses we can grab all the corresponding XML docs and build a big database of their contents; that is, a single database that contains records pulled from course XML docs.
  4. the sorts of things that we can pull out of a course include: links, images, glossary items, learning objectives, section and subsection headings;
  5. if we mine the (sub)section structure of a course from the XML, we can easily provide an interactive treemap version of the sections and subsections in a course; generating a Freemind mindmap document type, we can automatically generate course-section mindmap files that students can view – and annotate – in Freemind. We can also generate bespoke mindmaps, for example based on sections across OpenLearn courses that contain a particular search term.
  6. By disaggregating individual course units into “typed” elements or faceted components, and then reaggreating items of a similar class or type across all course units, we can provide faceted search across, as well as university wide “meta” view over, different classes of content. For example:
    • by aggregating learning objectives from across OpenLearn units, we can trivially create a search tool that provides a faceted search over just the learning objectives associated with each unit; the search returns learning outcomes associated with a search term and links to course units associated with those learning objectives; this might help in identifying reusable course elements based around reuse or extension of learning outcomes;
    • by aggregating glossary items from across OpenLearn units, we can trivially create a meta glossary for the whole of OpenLearn (or similarly across all OU courses). That is, we could produce a monolithic OpenLearn, or even OU wide, glossary; or maybe it’s useful to have redefine the same glossary terms using different definitions, rather than reuse the same definition(s) consistently across different courses? As with learning objectives, we can also create a search tool that provides a faceted search over just the glossary items associated with each unit; the search returns glossary items associated with a search term and links to course units associated with those glossary items;
    • by aggregating images from across OpenLearn units, we can trivially create a search tool that provides a faceted search over just the descriptions/captions of images associated with each unit; the search returns the images whose description/captions are associated with the search term and links to course units associated with those images. This disaggregation provides a direct way of search for images that have been published through OpenLearn. Rights information may also be available, allowing users to search for images that have been rights cleared, as well as openly licensed images.
  7. the original route in was the extraction of links from course units that could be used to seed custom search engines that search over resources referenced from a course. This could in principle also include books using Google book search.

I also briefly described an approach for appropriating Google custom search engine promotions as the basis for a search engine mediated course, something I think could be used in a sMoocH (search mediated MOOC hack). But then MOOCs as popularised have f**k all to do with innovation, don’t they, other than in a marketing sense for people with very little imagination.

During questions, @briankelly asked if any of the reported dabblings/demos (and there are several working demo) were just OUseful experiments or whether they could in principle be adopted within the OU, or even more widely across HE. The answers are ‘yes’ and ‘yes’ but in reality ‘yes’ and ‘no’. I haven’t even been able to get round to writing up (or persuading someone else to write up) any of my dabblings as ‘proper’ research, let alone fight the interminable rounds of lobbying and stakeholder acquisition it takes to get anything adopted as a rolled out as adopted innovation. If any of the ideas were/are useful, they’re Googleable and folk are free to run with them…but because they had no big budget holding champion associated with their creation, and hence no stake (even defensively) in seeing some sort of use from them, they unlikely to register anywhere.

Appropriating IT: Glue Steps

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

To generalise things a little, I have labelled the circuits as applications in the figure. But you can think of them as circuits if you prefer.

A piece of custom digital circuitry than can talk to both original circuits, and translate the outputs of each into a form that can be used as input to the other, is placed between them to take on this interfacing role: glue logic.

Sometimes, we might not need to transform all the data that comes out of the first circuit or application:

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

So what has this got to do with anything? In a post yesterday, I described a recipe for Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi. The recipe does what it says on the tin… but it actually isn’t really about that at all. It’s about using Google Refine as glue logic, for taking data out of one system in one format (JSON data from the Twitter search API, via a hand assembled URL) and getting it in to another (Gephi, using a simple CSV file).

The Twitter example was a contrived example… the idea of using Google Refine as demonstrated in the example is much more powerful. Because it provides you way of thinking about how we might be able to decompose data chain problems into simple steps using particular tools that do certain conversions for you ‘for free’.

Simply by knowing that Google Refine can import JSON (and if you look at the import screen, you’ll also notice it can read in Excel files, CSV files, XML files, etc, not just JSON, tidy it a little, and then export some or all of it as a simple CSV file means you now have a tool that might just do the job if you ever need to glue together an application that publishes or exports JSON data (or XML, or etc etc) with one that expects to read CSV. (Google Refine can also be persuaded to generate other output formats too – not just CSV…)

You can also imagine chaining separate tools together. Yahoo Pipes, for example, is a great environment for aggregating and filtering RSS feeds. As well as publishing RSS/XML via a URL, it also publishes more comprehensive JSON feeds via a URL. So what? So now you know that if you have data coming through a Yahoo Pipe, you can pull it out of the pipe and into Google Refine, and then produce a custom CSV file from it.

PS Here’s another really powerful idea: moving in and out of different representation spaces. In Automagic Map of #opened12 Attendees, Scott Leslie described a recipe for plotting out the locations of #opened12 attendees on a map. This involved a process of geocoding the addresses of the home institutions of the participants to obtain the latitude and longitude of those locations, so that they could be appropriately placed. So Scott had to hand: 1) institution names and or messy addresses for those institutions; 2) latitude and longitude data for those addresses. (I use the word “messy” to get over the idea that the addresses may be specified in all manner of weird and wonderful ways… Geocoders are built to cope with all sorts of variation in the way addresses are presented to them, so we can pass the problem of handling these messy addresses over to the geocoder.)

In a comment to his original post, Scott then writes: [O]nce I had the geocodes for each registrant, I was also interested in doing some graphs of attendees by country and province. I realized I could use a reverse geocode lookup to do this. That is, by virtue of now having the lat/long data, Scott could present these co-ordinates to a reverse geo-coding service that takes in a latitude/longitude pair and returns an address for it in a standardised form, including, for example, an explicit country or country code element. This clean data can then be used as the basis for grouping the data by country, for example. The process is something like this:

Messy address data -> [geocode] -> latitude/longitude data -> [reverse geocode] -> structured address data.

Beautiful. A really neat, elegant solution that dances between ways of representing the data, getting it into different shapes or states that we can work in different particular ways. :-)

PPS Tom Morris tweeted this post likening Google Refine to an FPGA. For this to work more strongly as a metaphor, I think we might have to take the JSON transcript of a set of operations in Google Refine as the programming, and then bake them into executable code, as for example we can do with the Stanford Data Wrangler, or using Pipe2Py, with Yahoo Pipes?

ILI2012 Workshop Prep – Appropriating IT: innovative uses of emerging technologies

Given that workshops at ILI2012 last a day (10 till 5), I thought I’d better start prepping the workshop I’m delivering with Martin Hawksey at this year’s Internat Librarian International early… W2 – Appropriating IT: innovative uses of emerging technologies:

Are you concerned that you are not maximising the potential of the many tools available to you? Do you know your mash-ups from your APIs? How are your data visualisation skills? Could you be using emerging technologies more imaginatively? What new technologies could you use to inspire, inform and educate your users? Learn about some of the most interesting emerging technologies and explore their potential for information professionals.

The workshop will combine a range of presentations and discussions about emerging information skills and techniques with some practical ‘makes’ to explore how a variety of free tools and applications can be appropriated and plugged together to create powerful information handling tools with few, if any, programming skills required.

Topics include:

– Visualisation tools
– Maps and timelines
– Data wrangling
– Social media hacks
– Screenscraping and data liberation
– Data visualisation

(If you would like to join in with the ‘makes’, please bring a laptop)

I have some ideas about how to fill the day – and I’m sure Martin does too – but I thought it might be worth asking what any readers of this blog might be interested in learning about in a little more detail and using slightly easier, starting from nowhere baby steps than I usually post.

My initial plan is to come up with five or six self contained elements that can also be loosely joined, structuring the day something like this:

  • opening, and an example of the sort of thing you’ll be able to do by the end of the day – no prior experience required, handheld walkthroughs all the way; intros from the floor along with what folk expect to get out of the day/want to be able to do at the day (h/t @briankelly in the comments; of course, if folks’ expectations differ from what we had planned….;-). As well as demo-ing how to use tools, we’ll also discuss why you might want to do these things and some of the strategies involved in trying to work out how to do them, knowing what you already know, or how to find out/work out how to do them if you don’t..
  • The philosophy of “appropriation”, “small pieces, lightly joined”, “minimum viability” and ‘why Twitter, blogs and Stack Overflow are Good Things”;
  • Visualising Data – because it’s fun to start playing straight away…
    • Google Fusion Tables – visualisations and queries
    • Google visualisation API/chart components

    Payoff: generate some charts and dashboards using pre-provided data (any ideas what data sets we might use…? At least one should have geo-data for a simple mapping demo…)

  • — Morning coffee break? —
  • Data scraping:
    • Google spreadsheets – import CSV, import HTML table;
    • Google Refine – import XLS, import JSON, import XML
    • (Briefly) – note the existence of other scraper tools, incl. Scraperwiki, and how they can be used

    Payoff: scrape some data and generate some charts/views… Any ideas what data to use? For the JSON, I thought about finishing with a grab of Twitter data, to set up after lunch…

  • — Lunch? —
  • (Social) Network Analysis with Gephi
    • Visually analyse Twitter data and/or Facebook data grabbed using Google Refine and/or TAGSExplorer
    • Wikipedia graphing using DBPedia
    • Other examples of how to think in graphs…
  • The scary session…
    • Working with large data files – examples of some simple text processing command line tools
    • Data cleansing and shaping – Google Refine, for the most part, including the use of reconciliation; additional examples based on regular expressions in a text editor, Google spreadsheets as a database, Stanford Data Wrangler, and R…
  • — Afternoon coffee break? —
  • Writing Diagrams – examples referring back to Gephi, mentioning Graphviz, then looking at R/ggplot2, finishing with R’s googleVis library as a way of generating Google Visualisation API Charts…
  • Wrap up – review of the philosophy, showing how it was applied throughout the exercises; maybe a multi-step mashup as a final demo?

Requirements: we’d need good wifi/network connections; also, it would help if participants pre-installed – and checked the set up of: a) a Google account; b) a modern browser (standardising on Google Chrome might be easiest?) c) Google Refine; d) Gephi (which may also require the installation of a Java runtime, eg on a new-ish Mac); e) R; f) RStudio and a raft of R libraries (ggplot2, plyr, reshape, RCurl, stringr, googleViz); g) a good text editor (?I use TextWrangler on a Mac); h) commandline tools (Windows machines);

Throughout each session, participants will be encouraged to identify datasets or IT workflow issues they encounter at work and discuss how the ideas presented in the workshop may be appropriated for use in those contexts…

Of course, this is all subject to change (I haven’t asked Martin how he sees the day panning out yet;-), but it gives a flavour of my current thinking… So: what sorts of things would you like to see? And would you like to book any of the sessions for a workshop at your place…?!;-)