Using browser based data analysis toolkits such as pandas in IPython notebooks, or R in RStudio, means you need to have access to python or R and the corresponding application server either on your own computer, or running on a remote server that you have access to.
When running occasional training sessions or workshops, this can cause several headaches: either a remote service needs to be set up that is capable of supporting the expected number of participants, security may need putting in place, accounts configured (or account management tools supported), network connections need guaranteeing so that participants can access the server, and so on; or participants need to install software on their own computers: ideally this would be done in advance of a training session, otherwise training time is spent installing, configuring and debugging software installs; some computers may have security policies that prevent users installing software, or require and IT person with admin privileges to install the software, and so on.
That’s why the coLaboratory Chrome extension looks like an interesting innovation – it runs an IPython notebook fork, with pandas and matplotlib as a Chrome Native Client application. I posted a quick walkthrough of the extension over on the School of Data blog: Working With Data in the Browser Using python – coLaboratory.
Via a Twitter exchange with @nativeclient, it seems that there’s also the possibility that R could run as a dependency free Chrome extension. Native Client seems to like things written in C/C++, which underpins R, although I think R also has some fortran dependencies. (One of the coLaboratory talks mentioned the to do list item of getting scipy (I think?) running in the coLaboratory extension, the major challenge there (or whatever the package was) being the fortran src; so there maybe be synergies in working the fortran components there?))
Within a couple of hours of the twitter exchange starting, Brad Nelson/@flagxor posted a first attempt at an R port to the Native Client. I don’t pretend to understand what’s involved in moving from this to an extension with some sort of useable UI, even if only a command line, but it represents an interesting possibility: of being able to run R in the browser (or at least, in Chrome). Package availability would be limited of course to packages compiled to run using PNaCl.
For training events, there is still the requirement that users install a Chrome browser on their computer and then install the extension into that. However, I think it is possible to run Chrome as a portable app – that is, from a flash drive such as a USB memory stick: Google Chrome Portable (Windows).
I’m not sure how fast it would be able to run, but it suggests there may be a way of carrying a portable, dependency free pandas environment around that you can run on a Windows computer from a USB key?! And maybe R too…?
Whenever a new open data dataset is released, the #opendata wires hum a little more. More open data is a Good Thing, right? Why? Haven’t we got enough already?
In a blog post a few weeks ago, Alan Levine, aka @cogdog, set about Stalking the Mythical OER Reuse: Seeking Non-Blurry Videos. OERs are open educational resources, openly licensed materials produced by educators and released to the world so others could make use of them. Funding was put into developing and releasing them and then, … what?
OERs. People build them. People house them in repositories. People do journal articles, conference presentations, research on them. I doubt never their existence.
But the ultimate thing they are supposed to support, maybe their raison d’être – the re use by other educators, what do we have to show for that except whispered stories, innuendo, and blurry photos in the forest?
Alan went in search of the OER reuse in his own inimitable way…
… but came back without much success. He then used the rest of the post to put out all for stories about how OERs have actually been used in the world… Not just mythical stories, not coulds and mights: real examples.
So what about opendata – is there much use, or reuse, going on there?
It seems as is more datasets get opened every day, but is there more use every day, first day use of newly released datasets, incremental reuse of the datasets that are already out, linkage between the new datasets and the previously released ones.
Yesterday, I spotted via @owenboswarva the release of a dataset that aggregated and normalised data relating to charitable grant awards: A big day for charity data. Interesting… The supporting website – 360 Giving – (self-admittedly in it’s early days) allows you to search by funder, recipient or key word. You have to search using the right keywords, though, and the right capitalisation of keywords…
And you may have to add in white space.. so *University of Oxford * as well as *University of Oxford*.
I don’t want to knock the site, but I am really interested to know how this data might be used. Really. Genuinely. I am properly interested. How would someone working in the charitable sector use that website to help them do something? What thing? How would it support them? My imagination may be able to go off on crazy flights of fancy in certain areas, but my lack of sector knowledge or a current headful of summer cold leaves me struggling to work out what this website would tangibly help someone to do. (I tried to ask a similar question around charities data before, giving the example of Charities Commission data grabbed from OpenCharities, but drew a blank then.) Like @cogdog in his search for real OER use case stories, I’d love to hear examples of real questions – no matter how trivial – that the 360 Giving site could help answer.
As well as the website, 360 Giving folk provide a data download as a CSV file containing getting on for a quarter of a million records. The date stamp on the file I grabbed is 5th June 2014. Skimming through the data quickly – my own opening conversation with it can be found here: 360 Giving Grant Navigator – Initial Data Conversation – I noticed through comparison with the data on the website some gaps…
- this item doesn’t seem to appear in the CSV download, perhaps because it doesn’t appear to have a funder?
- this item on the website has an address for the recipient organisation, but the CSV document doesn’t have any address fields. In fact, on close inspection, the record relates to a grant by the Northern Rock Foundation, and I see no records from that body in the CSV file?
- Although there is a project title field in the CSV document, no project titles are supplied. Looking through a sample of grants on the website, are any titles provided?
- The website lists the following funders:
Arts Council England
Arts Council Wales
Heritage Lottery Fund
Northern Rock Foundation
Paul Hamlyn Foundation
Sport Northern Ireland
The CSV file has data from these funders:
Arts Council England
Arts Council Wales
Sport Northern Ireland
That is, the CSV contains a subset of the data on the website; data from Heritage Lottery Fund, Indigo Trust, Northern Rock Foundation, Paul Hamlyn Foundation doesn’t seem to have made it into the data download? I also note that data from the Research Councils’ Gateway to Research (aside from the TSB data) doesn’t seem to have made it into either dataset. For anyone researching grants to universities, this could be useful information. (Could?! Why?!;-)
- No company numbers or Charity Numbers are given. Using opendata from Companies House a quick join on recipient names and company names from the Companies House register (without any attempts at normalising out things like LTD and LIMITED – that is, purely looking for an exact match) gives me just over 15,000 matched company names (which means I now have their address, company number, etc. too). And presumably if I try to match on names from the OpenCharities data, I’ll be able to match some charity numbers. Now both these annotations will be far from complete, but they’d be more than we have at the moment. A question to then ask is – is this better or worse? Does the dataset only have value if it is in some way complete? One of the clarion calls for open data initiatives has been to ‘just get the data out there’ so that it can be started to be worked on, and improved on. So presumably having some company numbers of charity numbers matched is a plus?
Now I know there is a risk to this. Funders may want to not release details about the addresses of the charities of they are funding because that data may be used to plot maps to say “this is where the money’s going” when it isn’t. The charity may have a Kensington address and the received funding for an initiative in Oswaldtwistle, but the map might see all the money sinking into Kensington; which would be wrong. But that’s where you have to start educating the data users. Or releasing data fields like “address of charity” and “postcode area of point of use”, or whatever, even if the latter is empty. As it is, if you give me a charity or company name, I can look up it’s address. And its company or charity number if it has one.
As I mentioned, I don’t want to knock the work 360 Giving have done, but I’m keen to understand what it is they have done, what they haven’t done, and what the opendata they have aggregated and re-presented could – practically, tractably, tangibly – be used for. Really used for.
Time to pack my bags and head out into the wood, maybe…
Last week, a post on the ONS (Office of National Statistics) Digital Publishing blog caught my eye: Introducing the New Improved ONS API which apparently “mak[es] things much easier to work with”.
Ooh… exciting…. maybe I can use this to start hacking together some notebooks?:-)
It was followed a few days later by this one – ONS-API, Just the Numbers which described “a simple bit of code for requesting some data and then turning that into ‘just the raw numbers’” – a blog post that describes how to get a simple statistic, as a number, from the API. The API that “mak[es] things much easier to work with”.
After a few hours spent hacking away over the weekend, looking round various bits of the API, I still wasn’t really in a position to discover where to find the numbers, let alone get numbers out of the API in a reliable way. (You can see my fumblings here.) Note that I’m happy to be told I’m going about this completely the wrong way and didn’t find the baby steps guide I need to help me use it properly.
So FWIW, here are some reflections, from a personal standpoint, about the whole API thing from the perspective of someone who couldn’t get it together enough to get the thing working …
Most data users aren’t programmers. And I’m not sure how many programmers are data junkies, let alone statisticians and data analysts.
For data users who do dabble with programming – in R, for example, or python (for example, using the pandas library) – the offer of an API is often seen as providing a way of interrogating a data source and getting the bits of data you want. The alternative to this is often having to download a huge great dataset yourself and then querying it or partitioning it yourself to get just the data elements you want to make use of (for example, Working With Large Text Files – Finding UK Companies by Postcode or Business Area).
That’s fine, insofar as it goes, but it starts to give the person who wants to do some data analysis a data management problem too. And for data users who aren’t happy working with gigabyte data files, it can sometimes be a blocker. (Big file downloads also take time, and incur bandwidth costs.)
For me, a stereotypical data user might be someone who typically wants to be able to quickly and easily get just the data they want from the API into a data representation that is native to the environment they are working in, and that they are familiar with working with.
This might be a spreadsheet user or it might be a code (R, pandas etc) user.
In the same way that spreadsheet users want files in XLS or CSV format that they can easily open, (formats that can be also be directly opened into appropriate data structures in R or pandas), I increasingly look not for APIs, but for API wrappers, that bring API calls and the results from them directly into the environment I’m working in in a form appropriate to that environment.
So for example, in R, I make use of the FAOstat package, which also offers an interface to the World Bank Indicators datasets. In pandas, a remote data access handler for the World Bank Indicators portal allows me to make simple requests for that data.
At a level up (or should that be “down”?) from the API wrapper are libraries that parse typical response formats. For example, Statistics Norway seem to publish data using the json-stat format, the format used in the new ONS API update. This IPython notebook shows how to use the pyjstat python package to parse the json-stat data directly into a pandas dataframe (I couldn’t get it to work with the ONS data feed – not sure if the problem was me, the package, or the data feed; which is another problem – working out where the problem is…). For parsing data returned from SPARQL Linked Data endpoints, packages such as SPARQLwrapper get the data into Python dicts, if not pandas dataframes directly. (A SPARQL i/o wrapper for pandas could be quite handy?)
At the user level, IPython Notebooks (my current ‘can be used to solve all known problems’ piece of magic tech!;-) provide a great way of demonstrating not just how to get started with an API, but also encourage the development within the notebook or reusable components, as well as demonstrations of how to use the data. The latter demonstrations have the benefit of requiring that the API demo does actually get the data into a form that is useable within the environment. It also helps folk see what it means to be able to get data into the environment (it means you can do things like the things done in the demo…; and if you can do that, then you can probably also do other related things…)
So am I happy when I see APIs announced? Yes and no… I’m more interested in having API wrappers available within my data wrangling environment. If that’s a fully blown wrapper, great. If that sort of wrapper isn’t available, but I can use a standard data feed parsing library to parse results pulled from easily generated RESTful URLs, I can just about work out how to create the URLs, so that’s not too bad either.
When publishing APIs, it’s worth considering who can address them and use them. Just because you publish a data API doesn’t mean a data analyst can necessarily use the data, because they may not be (are likely not to be) a programmer. And if ten, or a hundred, or a thousand potential data users all have to implement the same sort of glue code to get the data from the API into the same sort of analysis environment, that’s not necessarily efficient either. (Data users may feel they can hack some code to get the data from the API into the environment for their particular use case, but may not be willing to release it as a general, tested and robust API wrapper, certainly not a stable production level one.)
This isn’t meant to be a slight against the ONS API, more a reflection on some of the things I was thinking as I hacked my weekend away…
PS I don’t know how easy it is to run Python code in R, but the R magic in IPython notebooks supports the running of R code within a notebook running a Python kernel, with the handing over of data from R dataframes to python dataframes. Which is to say, if there’s an R package available, for someone who can run R via an IPython context, it’s available via python too.
PPS I notice that from some of the ONS API calls we can get links to URLs of downloadable datasets (though when I tried some of them, I got errors trying to unzip the results). This provides an intermediate way of providing API access to a dataset – search based API calls that allow discovery of a dataset, then the download and automatic unpacking of that dataset into a native data representation, such as one or more data frames.
Some rambling but possibly associated thoughts… I suggest you put Alice’s Restaurant on…
For some time now, I’ve had an uncomfortable feeling about the asymmetries that exist in the open data world as well as total confusion about the notion of transparency.
Part of the nub of the problem (for me) lies with the asymmetric disclosure requirements of public and private services. Public bodies have disclosure requirements (eg Local Government Transparency Code), private companies don’t. Public bodies disclose metrics and spend data, data that can be used in public contract tendering processes by private bodies against public ones tendering for the same service. The private body uses this information – and prices in a discount associated with not having to carry the cost of public reporting – into the bid. The next time the contract is tendered, the public body won’t have access to the (previously publicly disclosed) information that the private body originally had when making its bid. Possibly. I don’t know how tendering works. But from the outside, that’s how it appears to me. (Maybe there needs to be more transparency about the process?)
Open data is possibly a Big Thing. Who knows? Maybe it isn’t. Certainly the big consulting firms are calling it as something worth squillionty billionty of pounds. I’m not sure how they cost it. Maybe I need to dig through the references and footnotes in their reports (Cap Gemini’s Open Data Economy: Unlocking Economic Value by Opening Government and Public Data, Deloitte’s Open growth: Stimulating demand for open data in the UK or McKinsey’s Open data: Unlocking innovation and performance with liquid information). I don’t know how much those companies have received in fees for producing those reports, or how much they have received in consultancy fees associated with public open data initiatives – somehow, that spend data doesn’t seem to have been curated in a convenient way, or as a #opendatadata bundle? – but I have to assume they’re not doing it to fleece the public bodies and tee up benefits for their other private corporate clients.
Reminds me – I need to read Owen Boswarva’s Who supports the privatisation of Land Registry? and ODUG benefits case for open data release of an authoritative GP dataset again… And remind myself of who sits on the Open Data User Group (ODUG), and other UK gov departmental transparency boards…
And read the FTC’s report Data Brokers: A Call For Transparency and Accountability…
Just by the by, one thing I’ve noticed about a lot of opendata releases is that, along with many other sorts of data, they are most useful when aggregated over time or space, and/or combined with other data sets. Looking at the month on month reports of local spending data from my local council is all very well, but it gets more interesting when viewed over several months or years. Looking at the month on month reports of local spending data from my local council is all very well, but it gets more interesting when looking at spend across councils, as for example in the case of looking at spend to particular companies.
Aggregating public data is one of the business models that helps create some of the GDP figure that contributes to the claimed, anticipated squillionty billionty pounds of financial benefit that will arise from open data – companies like opencorporates aggregating company data, or Spend Network aggregating UK public spending data who hope to start making money selling products off the back of public open data they have curated. Yes – I know a lot of work goes in to cleaning and normalising that data, and that exploiting the data collection as a whole is what their business models are about – and why they don’t offer downloads of their complete datasets, though maybe licenses require they do make links to, or downloads of, the original (“partial”) datasets available?
But you know where I think the real value of those companies lies? In being bought out. By Experian, or Acxiom (if there’s even a hint of personally identifiable data through reverse engineering in the mix), or whoever… A weak, cheap, cop out business model. Just like this: Farmers up in arms over potential misuse of data. (In case you missed it, Climate Corporation was one of the OpenData500 that aggregated shed loads of open data – according to Andrew Stott’s Open Data for Economic Growth report for the World Bank, Climate Corp “uses 60 years of detailed crop yield data, weather observations from one million locations in the United States and 14 terabytes of soil quality data – all free from the US Government – to provide applications that help farmers improve their profits by making better informed operating and financing decisions”. It was also recently acquired by Monsanto – Monsanto – for just under a billion US $. That’s part of the squillionty billionties I guess. Good ol’ open data. Monsanto.)
Sort of related to this – that is, companies buying others to asset strip them for their data – you know all that data of yours locked up in Facebook and Google? Remember MySpace? Remember NNTP? According to the Sophos blog, Just Because You Don’t Give Your Personal Data to Google Doesn’t Mean They Can’t Acquire It. Or that someone else might buy it.
And as another aside – Google – remember Google? They don’t really “read” your email, at least, people don’t, they just let algorithms process it so the algorithms can privately just use that data to send you ads, but no-one will ever know what the content of the email was to trigger you getting that ad (‘cos the cookie tracking, cookie matching services can’t unpick ad bids, ad displays, click thrus, surely, can they?!), well – maybe there are side effects: Google tips off cops after spotting child abuse images in email (for some reason, after initially being able to read that article, my browser can’t load it atm. Server fatigue?). Of course, if Google reads your ads for blind business purposes and ad serving is part of that blind process you accept it. But how does the law enforcement ‘because we can even though you didn’t warrant us to?’ angle work? Does the Post Office look inside the envelope? Is surveillance actually part of Google’s business model?
If you want to up the paranoia stakes, this (from Ray Corrigan, in particular: “Without going through the process of matching each government assurance with contradictory evidence, something I suspect would be of little interest, I would like to draw your attention to one important misunderstanding. It seems increasingly to be the belief amongst MPs that blanket data collection and retention is acceptable in law and that the only concern should be the subsequent access to that data. Assertions to this effect are simply wrong.”) + that. Because one day, one day, they may just find your name on an envelope of some sort under a tonne of garbage. Or an algorithm might… Kid.
But that’s not what this post is about – what this post is about is… Way back when, so very long ago, not so very long ago, there was a license called GPL. GPL. And GPL was a tainting license. findlaw describes the consequences of reusing GPL licensed code as follows: Kid, ‘if a user of GPL code decides to distribute or publish a work that “in whole or in part contains or is derived from the [open source] or any part thereof,” it must make the source code available and license the work as a whole at no charge to third parties under the terms of the GPL (thereby allowing further modification and redistribution).
‘In other words, this can be a trap for the unwary: a company can unwittingly lose valuable rights to its proprietary code.’
Now, friends, GPL scared people so much that another license called LGPL was created, and LGPL allowed you to use LGPL licensed code without fear of tainting your own code with the requirement to open up your own code as GPL would require of it. ‘Cos licenses can be used against you.
And when it comes to open data licenses, they seem to be like LGPL. You can take open public data and aggregate it, and combine it, and mix it and mash it and do what you like with it and that’s fine… And then someone can come along buy that good work you’ve done and do what they want with it. Even Monsanto. Even Experian. And that’s good and right, right? Wrong. The ODUG. Remember the ODUG? The ODUG is the Open Data User Group that lobbies government for what datasets to open up next. And who’s on the ODUG? Who’s there, sitting there, on the ODUG bench, right there, right next to you?
Kid… you wanna be the all-open, all-liberal open data advocate? You wanna see open data used for innovation and exploitation and transparency and all the Good Things (big G, big T) that open data might be used for? Or you wanna sit down on the ODUG bench? With Deloitte, and Experian, and so on…
And if you think that using a tainting open data license so anyone who uses that data has to share it likewise, aggregated, congregated, conjugated, disaggregated, mixed, matched, joined, summarised or just otherwise and anyways processed, is a Good Thing…? Then kid… they’ll all move away from you on the bench there…
Because when they come to buy you, they won’t your data to be tainted in any way that means they’ll have to give up the commercial advantage they’ll have from buying up your work on that open data…
But this post? That’s not what this post is about. This post about holding companies to account. Open data used to hold companies to account. There’s a story to be told that’s not been told about Dr Foster, and open NHS data and fear-mongering and the privatisation of the NHS and that’s one thing…
But another thing is how government might use data to help us protect ourselves. Because government can’t protect us. Government can’t make companies pay taxes and behave responsibly and not rip off consumers. Government needs our help to do that. But can government help us do that too? Protect and Survive.
There’s a thing that DECC – the Department of Energy and Climate Change – do, and that’s publish statistics about domestic energy price statistics and industrial energy price statistics and road fuel and other petroleum product price statistics, and they’re all meaningless. Because they bear little resemblance to spot prices paid when consumers pay their domestic energy bills and road fuel and other petroleum product bills.
To find out what those prices are you have to buy the data from someone like Experian, from something like Experian’s Catalist fuel price data – daily site retail fuel prices – data product. You may be able to caluclate the DECC statistics from that data (or you may not) but you certainly can’t go the other way, from the DECC statistics to anything like the Experian data.
But can you go into your local library and ask to look at a copy of the Experian data? A copy of the data that may or may not be used to generate the DECC road fuel and other petroleum product price statistics (how do they generate those statistics anyway? What raw data do they use to generate those statistics?)
Can you imagine ant-eye-ant-eye-consumer data sets being published by your local council or your county council or your national government that can be used to help you hold companies to account and help you tell them that you know they’re ripping you off and your council off and your government off and that together, you’re not going to stand for it?
Can you imagine your local council publishing the forecourt fuel prices for one petrol stations, just one petrol station, in your local council area every day? And how about if they do it for two petrol stations, two petrol stations, each day? And if they do it for three forecourts, three, can you imagine if they do it for three petrol stations…? And can you, can you imagine prices for 50 petrol stations a day being published by your local council, your council helping you inform yourself about how you’re being manipulated, can you imagine…? (It may not be so hard – food hygiene ratings are published for food retail environments across the England, Northern Ireland and Wales…
So let’s hear it for open data, and how open data can be used to hold corporates to account, and how public bodies can use open data to help you make better decisions (which is a good neoliberal position to take and one which the other folk on the bench tell you that that’s what you want and that and markets work, though they also fall short of telling you that the models say that markets work with full information but you don’t have the information, and even if you did, you wouldn’t understand it, because you don’t really know how to make a good decision, but at the end of the day you don’t want a decision, you just want a good service fairly delivered, but they don’t tell that it’s all right to just want that…)
And let’s hear it for public bodies making data available whether it’s open or not, making it available by paying for it if they have to and making it available via library services so that we can start using it to start holding companies to account and start helping our public services, and ourselves, protect ourselves from the attacks being mounted on us by companies, and their national government supporters, who take on debt, and who allow them to take on debt, to make dividend payouts but not capital investment and subsidise the temporary driving down of prices (which is NOT a capital investment) through debt subsidised loss leading designed to crush competition in a last man standing contest that will allow monopolistic last man standing price hikes at the end of it…
And just remember, if there’s anything you want, you know where you can get it… At Alice’s… or the library… only they’re shutting them down, aren’t they…? So that leaves what..? Google?
Killer post title, eh?
Some time ago I put in an FOI request to the Isle of Wight Council for the transaction logs from a couple of ticket machines in the car park at Yarmouth. Since then, the Council made some unpopular decisions about car parking charges, got a recall and then in passing made the local BBC news (along with other councils) in respect of the extent of parking charge overpayments…
Here’s how hyperlocal news outlet OnTheWight reported the unfolding story…
- 11 new ways the council propose to make car parking more expensive
- Look again at parking and leisure centre charges, say Island Conservatives
- Increased car parking charges revealed
- Council could face legal action over car parking increases
- Council gives their view on the legal uses of car parking income
- Council claim they don’t yet know how many people wrote to them about parking changes
- Executive vote: Free parking in 24 car parks goes, including Appley and Puckpool and parking charges up
- Councillors ‘call-in’ decision on parking changes
- Date set for scrutiny of changes to parking charges
- Follow live coverage of parking changes being scrutinised (Updated) (includes a copy of the call-in notice)
- Isle of Wight car parkers overpaid £186,706.35 between 2011-13
I really missed a trick not getting involved in this process – because there is, or could be, a significant data element to it. And I had a sample of data that I could have doodled with, and then gone for the whole data set.
Anyway, I finally made a start on looking at the data I did have with a view to seeing what stories or insight we might be able to pull from it – the first sketch of my conversation with the data is here: A Conversation With Data – Car Parking Meter Data.
It’s not just the parking meter data that can be brought to bear in this case – there’s another set of relevant data too, and I also had a sample of that: traffic penalty charge notices (i.e. traffic warden ticket issuances…)
With a bit of luck, I’ll have a go at a quick conversation with that data over the next week or so… Then maybe put in a full set of FOI requests for data from all the Council operated ticket machines, and all the penalty notices issued, for a couple of financial years.
Several things I think might be interesting to look at Island-wide:
- in much the same was as Tube platforms suffer from loading problems, where folk surge around one entrance or another, do car parks “fill up” in some sort of order, eg within a car park (one meter lags the other in terms of tickets issued) or one car park lags another overall;
- do different car parks have a different balance of ticket types issued (are some used for long stay, others for short stay?) and does this change according to what day of the week it is?
- how does the issuance of traffic penalty charge notices compare with the sorts of parking meter tickets issued?
- from the timestamps of when traffic penalty charge notices tickets are issued, can we work out the rounds of different traffic warden patrols?
The last one might be a little bit cheeky – just like you aren’t supposed to share information about the mobile speed traps, perhaps you also aren’t supposed to share information that there’s a traffic warden doing the rounds…?!
First up, Downes suggests that:
The traditional course is designed like a book – it is intended to run in a sequence, the latter bits build on the first bits, and if you start a book and abandon it p[art way through there is a real sense in which you can say the book has failed, because the whole point of a book is to read it from beginning to end.
But our MOOCs are not designed like that. Though they have a beginning and an end and a range of topics in between, they’re not designed to be consumed in a linear fashion the way a book it. Rather, they’re much more like a magazine or a newspaper (or an atlas or a city map or a phone book). The idea is that there’s probably more content than you want, and that you’re supposed to pick and choose from the items, selecting those that are useful and relevant to your present purpose.
And so here’s the response to completion rates: nobody ever complained that newspapers have low completion rates. And yet no doubt they do,. Probably far below the ‘abysmal’ MOOC completion rates (especially if you include real estate listings and classified ads). People don’t read a newspaper to complete it, they read a newspaper to find out what’s important.
Martin (Weller) responds:
Stephen Downes has a nice analogy, (which he blogged at my request, thankyou Stephen) in that it’s like a newspaper, no-one drops out of a newspaper, they just take what they want. This has become repeated rather like a statement of fact now. I think Stephen’s analogy is very powerful, but it is really a statement of intent. If you design MOOCs in a certain way, then the MOOC experience could be like reading a newspaper. The problem is 95% of MOOCs aren’t designed that way. And even for the ones that are, completion rates are still an issue.
Here’s why they’re an issue. MOOCs are nearly always designed on a week by week basis (which would be like designing a newspaper where you had to read a certain section by a certain time). I’ve blogged about this before, but from Katy Jordan’s data we reckon 45% of those who sign up, never turn up or do anything. It’s hard to argue that they’ve had a meaningful learning experience in any way. If we register those who have done anything at all, eg just opened a page, then by the end of week 2 we’re down to about 35% of initial registrations. And by week 3 or 4 it’s plateauing near 10%. The data suggests that people are definitely not treating it like a newspaper. In Japan some research was done on what sections of newspapers people read.
He goes on:
… Most MOOCs are about 6-7 weeks long, so 90% of your registered learners are never even looking at 50% of your content. That must raise the question of why are you including it in the first place? If a subject requires a longer take at it, beyond 3 weeks say, then MOOCs really may not be a very good approach to it. There is a hard, economic perspective here, it costs money to make and run MOOCs, and people will have to ask if the small completion rates are the most effective way to get people to learn that subject. You might be better off creating more stand alone OERs, or putting money into better supported outreach programmes where you can really help people stay with the course. Or maybe you will actually design your MOOC to be like a newspaper.
I buy three newspapers a week – the Isle of Wight County Press (to get a feel for what’s happened and is about to happen locally, as well as seeing who’s currently recruiting), the Guardian on a Saturday (see what news stories made it as far as Saturday comment, do the Japanese number puzzles, check out the book reviews, maybe read the odd long form interview and check a recipe or two), and the Observer on a Sunday (read colleagues’ columns, longer form articles by journalists I know or have met, check out any F1 sports news that made it into that paper, book reviews, columns, and Killer again…).
So I skim bits, have old faithfuls I read religiously, and occasionally follow through on a long form article that was maybe advertised on the cover and I might have missed otherwise.
Newspapers are organised in a particular way, and that lets me quickly access the bits I know I want to access, and throw the rest straight onto the animal bedding pile, often unread and unopened.
So MOOCs are not really like that, at least, not for me.
For me MOOCs are freebie papers I’ve picked up and then thrown, unread, onto the animal bedding pile. For me.
What I can see, though, as MOOCs as partworks. Partworks are those titles you see week on week in the local newsagent with a new bit on the cover that, if collected over weeks and months and assembled in the right way, result in a flimsy plastic model you’ve assembled yourself with an effective cost price running into hundreds of pounds.
[Retro: seems I floated the MOOC as partwork idea before - Online Courses or Long Form Journalism? Communicating How the World Works… - and no-one really bit then either...]
In the UK, there are several notable publishers of partwork titles, including for example Hachette, De Agostini,Eaglemoss. Check out their homepages – then check out the homepages of a few MOOC providers. (Note to self – see if any folk working in marketing of MOOC platform providers came from a partwork publishing background.)
Here’s a riff reworking the Wikipedia partwork page:
partworkMOOC is a written publicationan online course released as a series of planned magazine-like issueslessons over a period of time. IssuesLessons are typically released on a weekly, fortnightly or monthly basis, and often a completed set is designed to form a reference work oncomplete course in a particular topic. Partwork seriesMOOCs run for a determined length and have a finite life. Generally, partworksMOOCs cover specific areas of interest, such as sports, hobbies, or children’s interest and stories such as PC Ace and the successful The Ancestral Trail series by Marshall Cavendish Ltdrandom university module subjects, particularly ones that tie in to the telly or hyped areas of pseudo-academic interest. They are generally sold at newsagents and are mostly supported by massive television advertising campaigns for the launchhosted on MOOC platforms because exploiting user data and optimising user journeys through learning content is something universities don't really understand and avoid trying to do. In the United Kingdom, partworksMOOCs are usually launched by heavy television advertising each Januarymentioned occasionally in the press, often following a PR campaign by the UK MOOC platfrom, FutureLearn. PartworksMOOCs often include cover-mounted items with each issue that build into a complete set over time. For example, a partwork about artMOOC might include a small number of paints or pencils that build into a complete art-setso-called "badges" that can be put into an online "backpack" to show off to your friends, family, and LinkedIn trawlers ; a partwork about dinosaurs might include a few replica bones that build a complete model skeleton at the end of the series; a partwork about films may include a DVD with each issue. In Europe, partworks with collectable models are extremely popular; there are a number of different publications that come with character figurines or diecast model vehicles, for example: The James Bond Car Collection.
In addition, completed
partworksMOOCs have sometimes been used as the basis for receiving a non-academic credit bearing course completion certificate, or to create case-bound reference works and encyclopediasa basis for a piece of semi-formal assessment and recognition. An example is the multi-volume Illustrated Science and Invention Encyclopedia which was created with material first published in the How It Works partworkNEED TO FIND A GOOD EXAMPLE.
In the UK,
partworksMOOCs are the fourth-best selling magazine sector, after TV listing guides, women’s weeklies and women’s monthliesNEED SOME NUMBERS HERE*.... A common inducement is a heavy discount for the first one or two issues??HOW DO MOOCs SELL GET SOLD?. The same seriesMOOC can be sold worldwide in different languages and even in different variations.
* Possibly useful starting point? BBC News Magazine: Let’s get this partwork started
The Wikipedia page goes on to talk about serialisation (ah, the good old days when I still had hoped for feeds and syndication… eg OpenLearn Daily Learning Chunks via RSS and then Serialised OpenLearn Daily RSS Feeds via WordPress) and the Pecia System (new to me), which looks like it could provide an interesting starting point on a model of peer-co-created learning, or somesuch. There’s probably a section on it in this year’s Innovating Pedagogy report. Or maybe there isn’t?!;-)
Sort of related but also not, this article from icrossing on ‘Subscribe is the new shop.’ – Are subscription business models taking over? and John Naughton’s column last week on the (as then, just leaked) Kindle subscription model – Kindle Unlimited: it’s the end of losing yourself in a good book, I’m reminded of Subscription Models for Lifelong Students and Graduate With Who (Whom?!;-), Exactly…?, which several people argued against and which I never really tried to defend, though I can’t remember what the arguments were, and I never really tried to build a case with numbers in it to see whether or not it might make sense. (Because sometimes you think the numbers should work out in your favour, but then they don’t… as in this example: Restaurant Performance Sunk by Selfies [via RBloggers].)
Erm, oh yes – back to the MOOCs.. and the partworks models. Martin mentioned the economics – just thinking about the partwork model (pun intended, or maybe not) here, how are parts costed? Maybe an expensive loss leader part in week 1, then cheap parts for months, then the expensive parts at the end when only two people still want them? How will print on demand affect partworks (newsagent has a partwork printer round the back to print of the bits that are needed for whatever magazines are sold that week?) And how do the partwork costing models then translate to MOOC production and presentation models?
Big expensively produced materials in front loaded weeks, then maybe move to smaller presentation methods, get the forums working a little better with smaller, more engaged groups? How about the cMOOC ideas – up front in early weeks, or pushed back to later weeks, where different motivations, skills, interest and engagement models play out.
MOOCs are newspapers? Nah… MOOCs as partwork – that works better as a model for me. (You can always buy a partwork mid-way through because you are interested in that week’s content, or the content covered by the magazine generally, not because you are interested in the plastic model or badge.
Thinks: hmm, partworks come in at least two forms, don’t they – one to get pieces to build a big model of a boat or a steam train or whatever. The other where you get a different superhero figurine each week and the aim it attract the completionist. Which isn’t to say that part 37 might not be stupidly popular because it has a figure that is just generally of interest, ex- of being part of a set?
One of the comment themes I’ve noticed around the first Challenge in the Tata F1 Connectivity Innovation Prize, a challenge to rethink what’s possible around the timing screen given only the data in the real time timing feed, is that the non-programmers don’t get to play. I don’t think that’s true – the challenge seems to be open to ideas as well as practical demonstrations, but it got me thinking about what technical ways in might be to non-programmers who wouldn’t know where to start when it came to working with the timing stream messages.
The answer is surely the timing screen itself… One of the issues I still haven’t fully resolved is a proven way of getting useful information events from the timing feed – it updates the timing screen on a cell by cell basis, so we have to finesse the way we associate new laptimes or sector times with a particular driver, bearing in mind cells update one at a time, in a potentially arbitrary order, and with potentially different timestamps.
So how about if we work with a “live information model” by creating a copy of an example timing screen in a spreadsheet. If we know how, we might be able to parse the real data stream to directly update the appropriate cells, but that’s largely by the by. At least we have something we can work work to start playing with the timing screen in terms of a literal reimagining of it. So what can we do if we put the data from an example timing screen into a spreadsheet?
If we create a new worksheet, we can reference the cells in the “original” timing sheet and pull values over. The timing feed updates cells on a cell by cell basis, but spreadsheets are really good at rippling through changes from one or more cells which are themselves reference by one or more others.
The first thing we might do is just transform the shape of the timing screen. For example, we can take the cells in a column relating to sector 1 times and put them into a row.
The second thing we might do is start to think about some sums. For example, we might find the difference between each of those sector times and (for practice and qualifying sessions at least) the best sector time recorded in that session.
The third thing we might do is to use a calculated value as the basis for a custom cell format that colours the cell according to the delta from the best session time.
Simple, but a start.
I’ve not really tried to push this idea very far – I’m not much of a spreadsheet jockey – but I’d be interested to know how folk who are might be able to push this idea…
PS FWIW, my entry to the competition is here: #f1datajunkie challenge 1 entry. It’s perhaps a little off-brief, but I’ve been meaning to do this sort of summary for some time, and this was a good starting point. If I get a chance, I’ll have a go a getting the parsers to work properly properly!