Search results for: wikipedia

Fragments – Wikipedia to Markdown

I’ve been sketching some ideas, pondering the ethics of doing an F1 review style book blending (openly licensed) content from Wikipedia race reports with some of my own f1datajunkie charts, and also wondering about the extent to which I could automatically generate Wikipedia style race report sentences from the data; I think the sentence generation, in general should be quite easy – the harder part would be identifying the “interesting” sentences (that is, the ones that make it into the report, rather than than the totality of ones that could be generated).

So far, my sketches have been based around just grabbing the content from Wikipedia, and transforming to markdown, the markup language used in the Leanpub workflow:

In Python 3.x at least, I came across some encoding issues, and couldn’t seem to identify Wikipedia page sections. For what it’s worth, a minimal scribble looks something like this:

!pip3 install wikipedia
import wikipedia

#Search for page titles on Wikipedia'2014 Australian grand prix')

#Load a page'2014 Australian Grand Prix')

#Preview page content

#Preview a section's content by section header
##For some reason, f1.sections shows an empty list for me?

#pandoc supports Wikimedia to markdown conversion
!apt-get -y install pandoc
!pip3 install pypandoc
import pypandoc

#To work round encoding issues, write the content to a file and then convert it...
f = open('delme1.txt', 'w', encoding='utf8')

md=pypandoc.convert('delme1.txt', 'md', format='mediawiki')

If the Formula One race report pages follow similar templates and use similar headings, then it should be straightforward enough to pull down sections of the reports and interleave them with charts and tables. (As well as issues parsing out section headers to fill the sections list, the tables on the page don’t appear to be grabbed into the .content field (assuming the API wrapper does manage to grab that content down? However, I can easily recreate those from things like the ergast API).

Looking at the construction of sentences in the race reports, many of them are formulaic. However, as noted above, generating sentences is one thing, but generating interesting sentences is another. For that, I think we need to identify sets of rules that mark data features out as interesting or not before generating sentences from them.

Tracking Anonymous Wikipedia Edits From Specific IP Ranges

Via @davewiner’s blog, I spotted a link to @congressedits, “a bot that tweets anonymous Wikipedia edits that are made from IP addresses in the US Congress”. (For more info, see why @congressedits?, /via @ostephens.) I didn’t follow the link to the home page for that account (doh!), but in response to a question about whether white label code was available, @superglaze pointed me to, a script that “will watch Wikipedia for edits from a set of named IP ranges and will tweet when it notices one”.

It turns out the script was inspired by @parliamentedits, a bot built by @tomscott that “tracks edits to Wikipedia made from Parliamentary IP addresses” built using IFTT and possibly a list of IP ranges operated by the House of Commons gleaned from this FOI request?


My immediate thought was set up something to track edits made to Wikipedia from OU IP addresses, then idly wondered if set of feeds for tracking edits from HEIs in general might also be useful (something to add to the UK University Web Observatory for example?)

To the extent that Wikipedia represents an authoritative source of information, for some definition of authoritative(?!), it could be interesting to track the “impact” of our foolish universities in terms of contributing to the sum of of human knowledge as represented by Wikipedia.

It’d also be interesting to track the sorts of edits made from anonymous and named editors from HEI IP ranges. I wonder what classes they may fall into?

  1. edits from the marketing and comms folk?
  2. ego and peer ego edits, eg from academics keeping the web pages of other academics in their field up to date?
  3. research topic edits – academics maintaining pages that relate to their research areas or areas of scholarly interest?
  4. teaching topic edits – academics maintaining pages that relate to their teaching activities?
  5. library edits – edits made from the library?
  6. student edits – edits made by students as part of a course?
  7. “personal” edits – edits made by folk who class themselves and Wikimedians in general and just happen to make edits while they are on an HEI network?

My second thought was to wonder to what extent might news and media organisations be maintaining – or tweaking – Wikipedia pages? The BBC, for example, who have made widespread use of Wikipedia in their Linked Data driven music and wildlife pages.

Hmmm… news.. reminds me: wasn’t a civil servant who made abusive edits to a Wikipedia page sacked recently? Ah, yes: Civil servant fired after Telegraph investigation into Hillsborough Wikipedia slurs, as my OU colleague Andrew Smith suggested might happen.

Or how about other cultural organisations – museums and galleries for example?

Or charities?

Or particular brands? Hmm…

So I wonder: could we try to identify areas of expertise on, or attempted/potential influence over, particular topics by doing reverse IP lookups from pages focussed on those topics? This sort of mapping activity pivots the idea of visualising related entries in Wikipedia to map IP ranges, and perhaps from that locations and individuals associated with maintaining a set of resources around a particular topic area (cf. Visualising Delicious Tag Communities).

I think I started looking at how we might start to map IP ranges for organisations once….? Erm… maybe not, actually: it was looking up domains a company owned from its nameservers.

Hmm.. thinks… webstats show IP ranges of incoming requests – can we create maps from those? In fact, are there maps/indexes that give IP ranges for eg companies or universities?

I’m rambling again…

PS Related: Repository Googalytics – Visits from HEIs which briefly reviews the idea of tracking visits to HEI repositories from other HEIs…

Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi

Following on from Mapping How Programming Languages Influenced Each Other According to Wikipedia, where I tried to generalise the approach described in Visualising Related Entries in Wikipedia Using Gephi for grabbing datasets in Wikipedia related to declared influences between items within particular subject areas, here’s another way of grabbing data from Wikipedia/DBpedia that we can visualise as similarity neighbourhoods/maps (following @danbri: Everything Still Looks Like A Graph (but graphs look like maps)).

In this case, the technique relies on identifying items that are associated with several different values for the same sort of classification-type. So for example, in the world of music, a band may be associated with one or more musical genres. If a particular band is associated with the genres Electronic music, New Wave music and Ambient music, we might construct a graph by drawing lines/edges between nodes representing each of those musical genres. That is, if we let nodes represent genre, we might draw edges between two nodes show that a particular band has been labelled as falling within each of those two genres.

So for example, here’s a sketch of genres that are associated with at least some of the bands that have also been labelled as “Psychedelic” on Wikipedia:

Following the recipe described here, I used this Request within the Gephi Semantic Web Import module to grab the data:

prefix gephi:<>
  ?genreA gephi:label ?genreAname .
  ?genreB gephi:label ?genreBname .
  ?genreA <> ?genreB .
  ?genreB <> ?genreA .
?band <> <>.
?band <> "group_or_band"@en.
?band <> ?genreA.
?band <> ?genreB.
?genreA rdfs:label ?genreAname.
?genreB rdfs:label ?genreBname.
FILTER(?genreA != ?genreB && langMatches(lang(?genreAname), "en")  && langMatches(lang(?genreBname), "en"))

(I made up the relation type to describe the edge…;-)

This query searches for things that fall into the declared genre, and then checks that they are also a group_or_band. Note that this approach was discovered through idle browsing of the properties of several bands. Instead of:
?band <; "group_or_band"@en.
I should maybe have used a more strongly semantically defined relation such as:
?band a >;.
?band a <;.

The FILTER helps us pull back English language name labels, as well as creating pairs of different genre terms from each band (again, there may be a better way of doing this? I’m still a SPARQL novice! If you know a better way of doing this, or a more efficient way of writing the query, please let me know via the comments.)

It’s easy enough to generate similarly focussed maps around other specific genres; the following query run using the DBpedia SNORQL interface pulls out candidate values:

  ?band <> "group_or_band"@en.
  ?band <> ?genre.
} limit 50 offset 0

(The offset parameter allows you to page between results; so an offset of 10 will display results starting with the 11th(?) result.)

What this query does is look for items that are declared as a type group_or_band and then pull out the genres associated with each band.

If you take a deep breath, you’ll hopefully see how this recipe can be used to help probe similar “co-attributes” of things in DBpedia/Wikipeda, if you can work out how to narrow down your search to find them… (My starting point is to browse DPpedia pages of things that might have properties I’m interested in. So for example, when searching for hooks into music related data, we might have a peak at the DBpedia page for Hawkwind (who aren’t, apparently, of the Psychedelic genre…), and then hunt for likely relations to try out in a sample SNORQL query…)

PS if you pick up on this recipe and come up with any interesting maps over particular bits of DBpedia, please post a link in the comments below:-)

Mapping How Programming Languages Influenced Each Other According to Wikipedia

By way of demonstrating how the recipe described in Visualising Related Entries in Wikipedia Using Gephi can easily be turned to other things, here’s a map of how different computer programming languages influence each other according to DBpedia/Wikipedia:

Here’s the code that I pasted in to the Request area of the Gephi Semantic Web Import plugin as configured for a DBpedia import:

prefix gephi:<>
prefix foaf: <>
  ?a gephi:label ?an .
  ?b gephi:label ?bn .
  ?a <> ?b
?a a <>.
?b a <>.
?a <> ?b.
?a foaf:name ?an.
?b foaf:name ?bn.

As to how I found the <; relation, I had a play around with the SNORQL query interface for DBpedia looking for possible relations using queries along the lines of:

?a <> ?b.
?a rdf:type ?c.
?b a ?c.
} limit 50 offset 150

(I think a (as in ?x a ?y and rdf:type are synonyms?)

This query looks for pairs of things (?a, ?b), each of the same type, ?c, where ?b also influences ?a, then reports what sort of thing (?c) they are (philosophers, for example, or programming languages). We can then use this thing in our custom Wikipedia/DBpedia/Gephi semantic web mapping request to map out the “internal” influence network pertaining to that thing (internal in the sense that the things that are influencing and influenced are both representatives of the same, erm, thing…;-).

The limit term specifies how many results to return, the offset essentially allows you to page through results (so an offset of 500 will return results starting with the 501st result overall). DISTINCT ensures we see unique relations.

If you see a relation that looks like dbpedia:ontology/Philosopher, put it in and brackets (<>) and replace dbpedia: with to give something like <;.

PS see how to use a similar technique to map out musical genres ascribed to bands on WIkipedia

Visualising Related Entries in Wikipedia Using Gephi

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

– tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
– tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
– tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

?p a
<> .
?p <> ?influenced.

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

?p <> ?influenced.
  ?p a
<> .
?p <> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – – for example, but we can tidy this up. Going to one of the DPpedia pages, such as, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<; as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<;, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <; ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<>
prefix foaf: <>
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <> ?influence
  ?philosopher a
  <> .
  ?philosopher <> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I’ll maybe post more on this later – in the meantime, if you’re new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

[See also: Graphing Every* Idea In History]

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

PPPPS related – a large scale version of this? Wikipedia Mining Algorithm Reveals The Most Influential People In 35 Centuries Of Human History

Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:


And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…


Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.


And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

Impressions from Data Sketches

By chance, I spotted a tweet this evening from @owenboswarva pointing to a #DVLA data release: number of licence holders with penalty points, broken down by postcode district [link] | #FOI #opendata.

A quick search turned up some DfT driving license open data that includes a couple of postcode district related datasets – one giving a count of number of license holders with a particular number of points in each district, one breaking out the number of full and provisional licence holders by gender in each district. The metadata in the sheets also suggests that the datasets are monthly releases, but that doesn’t seem to be reflected by what’s on the site.


I haven’t done any data sketches for a bit, so I thought I’d have a quick play with the data to see whether any of the Isle of Wight postcode areas seemed have a noticeably higher percentage rate of points holders than other bits of the island, dividing the number of license holders with points by the total number of license holders in each postcode district…


If you’re not familiar with Isle of Wight postal geography, Wikipedia identifies the postcode districts as follows:


So Seaview and Yarmouth, then, are places to watch out for (places that tend to have older, well to do populations…)

I then wondered how the number of license holders might compare based on population estimates. From a quick search, I could only find population estimates for postcode districts based on 2011 census figures, which are five years out of date now.


The main thing that jumped out at me was that for the number of license holders to exceed the population means there must have been a noticeable population increase in the area… The really high percentages perhaps also suggest that those areas have older populations (something we could check from the population demographics from the last census). Secondly, we might note that the proportions rank differently to the first table, though again Yarmouth and Seaview head the leaderboard. This got me thinking that there are perhaps age effects making a difference here. This is something we could start to explore in a little more detail using a couple of the other DfT tables, one that describes the number of licenses issued by gender and age, and another that counts the number of points carrying a particular number of points by age and gender. (These two tables are at a national level, though, rather than broken out by postcode district.)

I guess I should really have coloured the numbers using a choropleth map, or using a new trick I learned earlier this week, displayed the numbers on labels located at the postcode district centroid…


(The map shows Land Registry House Prices Paid data for November, 2015.)

Maybe another day…

MOOCs as Partworks

A couple of recent posts crossed my feeds recently mooching around the idea that doing is MOOC is Like Reading a Newspaper; or not: MOOC Completion Rates DO Matter.

First up, Downes suggests that:

The traditional course is designed like a book – it is intended to run in a sequence, the latter bits build on the first bits, and if you start a book and abandon it p[art way through there is a real sense in which you can say the book has failed, because the whole point of a book is to read it from beginning to end.

But our MOOCs are not designed like that. Though they have a beginning and an end and a range of topics in between, they’re not designed to be consumed in a linear fashion the way a book it. Rather, they’re much more like a magazine or a newspaper (or an atlas or a city map or a phone book). The idea is that there’s probably more content than you want, and that you’re supposed to pick and choose from the items, selecting those that are useful and relevant to your present purpose.

And so here’s the response to completion rates: nobody ever complained that newspapers have low completion rates. And yet no doubt they do,. Probably far below the ‘abysmal’ MOOC completion rates (especially if you include real estate listings and classified ads). People don’t read a newspaper to complete it, they read a newspaper to find out what’s important.

Martin (Weller) responds:

Stephen Downes has a nice analogy, (which he blogged at my request, thankyou Stephen) in that it’s like a newspaper, no-one drops out of a newspaper, they just take what they want. This has become repeated rather like a statement of fact now. I think Stephen’s analogy is very powerful, but it is really a statement of intent. If you design MOOCs in a certain way, then the MOOC experience could be like reading a newspaper. The problem is 95% of MOOCs aren’t designed that way. And even for the ones that are, completion rates are still an issue.

Here’s why they’re an issue. MOOCs are nearly always designed on a week by week basis (which would be like designing a newspaper where you had to read a certain section by a certain time). I’ve blogged about this before, but from Katy Jordan’s data we reckon 45% of those who sign up, never turn up or do anything. It’s hard to argue that they’ve had a meaningful learning experience in any way. If we register those who have done anything at all, eg just opened a page, then by the end of week 2 we’re down to about 35% of initial registrations. And by week 3 or 4 it’s plateauing near 10%. The data suggests that people are definitely not treating it like a newspaper. In Japan some research was done on what sections of newspapers people read.

He goes on:

… Most MOOCs are about 6-7 weeks long, so 90% of your registered learners are never even looking at 50% of your content. That must raise the question of why are you including it in the first place? If a subject requires a longer take at it, beyond 3 weeks say, then MOOCs really may not be a very good approach to it. There is a hard, economic perspective here, it costs money to make and run MOOCs, and people will have to ask if the small completion rates are the most effective way to get people to learn that subject. You might be better off creating more stand alone OERs, or putting money into better supported outreach programmes where you can really help people stay with the course. Or maybe you will actually design your MOOC to be like a newspaper.


I buy three newspapers a week – the Isle of Wight County Press (to get a feel for what’s happened and is about to happen locally, as well as seeing who’s currently recruiting), the Guardian on a Saturday (see what news stories made it as far as Saturday comment, do the Japanese number puzzles, check out the book reviews, maybe read the odd long form interview and check a recipe or two), and the Observer on a Sunday (read colleagues’ columns, longer form articles by journalists I know or have met, check out any F1 sports news that made it into that paper, book reviews, columns, and Killer again…).

So I skim bits, have old faithfuls I read religiously, and occasionally follow through on a long form article that was maybe advertised on the cover and I might have missed otherwise.

Newspapers are organised in a particular way, and that lets me quickly access the bits I know I want to access, and throw the rest straight onto the animal bedding pile, often unread and unopened.

So MOOCs are not really like that, at least, not for me.

For me MOOCs are freebie papers I’ve picked up and then thrown, unread, onto the animal bedding pile. For me.

What I can see, though, as MOOCs as partworks. Partworks are those titles you see week on week in the local newsagent with a new bit on the cover that, if collected over weeks and months and assembled in the right way, result in a flimsy plastic model you’ve assembled yourself with an effective cost price running into hundreds of pounds.

[Retro: seems I floated the MOOC as partwork idea before – Online Courses or Long Form Journalism? Communicating How the World Works… – and no-one really bit then either…]

In the UK, there are several notable publishers of partwork titles, including for example Hachette, De Agostini,Eaglemoss. Check out their homepages – then check out the homepages of a few MOOC providers. (Note to self – see if any folk working in marketing of MOOC platform providers came from a partwork publishing background.)

Here’s a riff reworking the Wikipedia partwork page:

A partworkMOOC is a written publicationan online course released as a series of planned magazine-like issueslessons over a period of time. IssuesLessons are typically released on a weekly, fortnightly or monthly basis, and often a completed set is designed to form a reference work oncomplete course in a particular topic.

Partwork seriesMOOCs run for a determined length and have a finite life. Generally, partworksMOOCs cover specific areas of interest, such as sports, hobbies, or children’s interest and stories such as PC Ace and the successful The Ancestral Trail series by Marshall Cavendish Ltdrandom university module subjects, particularly ones that tie in to the telly or hyped areas of pseudo-academic interest. They are generally sold at newsagents and are mostly supported by massive television advertising campaigns for the launchhosted on MOOC platforms because exploiting user data and optimising user journeys through learning content is something universities don't really understand and avoid trying to do. In the United Kingdom, partworksMOOCs are usually launched by heavy television advertising each Januarymentioned occasionally in the press, often following a PR campaign by the UK MOOC platfrom, FutureLearn.

PartworksMOOCs often include cover-mounted items with each issue that build into a complete set over time. For example, a partwork about artMOOC might include a small number of paints or pencils that build into a complete art-setso-called "badges" that can be put into an online "backpack" to show off to your friends, family, and LinkedIn trawlers; a partwork about dinosaurs might include a few replica bones that build a complete model skeleton at the end of the series; a partwork about films may include a DVD with each issue. In Europe, partworks with collectable models are extremely popular; there are a number of different publications that come with character figurines or diecast model vehicles, for example: The James Bond Car Collection.

In addition, completed partworksMOOCs have sometimes been used as the basis for receiving a non-academic credit bearing course completion certificate, or to create case-bound reference works and encyclopediasa basis for a piece of semi-formal assessment and recognition. An example is the multi-volume Illustrated Science and Invention Encyclopedia which was created with material first published in the How It Works partworkNEED TO FIND A GOOD EXAMPLE.

In the UK, partworksMOOCs are the fourth-best selling magazine sector, after TV listing guides, women’s weeklies and women’s monthliesNEED SOME NUMBERS HERE*.... A common inducement is a heavy discount for the first one or two issues??HOW DO MOOCs SELL GET SOLD?. The same seriesMOOC can be sold worldwide in different languages and even in different variations.

* Possibly useful starting point? BBC News Magazine: Let’s get this partwork started

The Wikipedia page goes on to talk about serialisation (ah, the good old days when I still had hoped for feeds and syndication… eg OpenLearn Daily Learning Chunks via RSS and then Serialised OpenLearn Daily RSS Feeds via WordPress) and the Pecia System (new to me), which looks like it could provide an interesting starting point on a model of peer-co-created learning, or somesuch. There’s probably a section on it in this year’s Innovating Pedagogy report. Or maybe there isn’t?!;-)

Sort of related but also not, this article from icrossing on ‘Subscribe is the new shop.’ – Are subscription business models taking over? and John Naughton’s column last week on the (as then, just leaked) Kindle subscription model – Kindle Unlimited: it’s the end of losing yourself in a good book, I’m reminded of Subscription Models for Lifelong Students and Graduate With Who (Whom?!;-), Exactly…?, which several people argued against and which I never really tried to defend, though I can’t remember what the arguments were, and I never really tried to build a case with numbers in it to see whether or not it might make sense. (Because sometimes you think the numbers should work out in your favour, but then they don’t… as in this example: Restaurant Performance Sunk by Selfies [via RBloggers].)

Erm, oh yes – back to the MOOCs.. and the partworks models. Martin mentioned the economics – just thinking about the partwork model (pun intended, or maybe not) here, how are parts costed? Maybe an expensive loss leader part in week 1, then cheap parts for months, then the expensive parts at the end when only two people still want them? How will print on demand affect partworks (newsagent has a partwork printer round the back to print of the bits that are needed for whatever magazines are sold that week?) And how do the partwork costing models then translate to MOOC production and presentation models?

Big expensively produced materials in front loaded weeks, then maybe move to smaller presentation methods, get the forums working a little better with smaller, more engaged groups? How about the cMOOC ideas – up front in early weeks, or pushed back to later weeks, where different motivations, skills, interest and engagement models play out.

MOOCs are newspapers? Nah… MOOCs as partwork – that works better as a model for me. (You can always buy a partwork mid-way through because you are interested in that week’s content, or the content covered by the magazine generally, not because you are interested in the plastic model or badge.

Thinks: hmm, partworks come in at least two forms, don’t they – one to get pieces to build a big model of a boat or a steam train or whatever. The other where you get a different superhero figurine each week and the aim it attract the completionist. Which isn’t to say that part 37 might not be stupidly popular because it has a figure that is just generally of interest, ex- of being part of a set?

Innovation’s End

In that oft referred to work on innovation, The Innovator’s Dilemma, Clayton Christensen suggested that old guard companies struggle to innovate internally because of the value networks they have built up around their current business models. Upstart companies compete around the edges, providing cheaper but lower quality alternative offerings that allow the old guard to retreat to the higher value, higher quality products. As the upstarts improve their offerings, they grow market share and start to compete for the higher value customers. The upstarts also develop their own value networks which may be better adapted to an emerging new economy than the old guard’s network.

I don’t know if this model is still in favour, or whether it has been debunked by a more recent business author with an alternative story to sell, but in its original form it was a compelling tale, easily co-opted and reused, as I have done here. I imagine over the years, the model has evolved and become more refined, perhaps offering ever more upmarket consultancy opportunities to Christensen and his acolytes.

The theory was also one of the things I grasped at this evening to try to help get my head round why the great opportunities for creative play around the technologies being developed by companies such as Google, Amazon and Yahoo five or so years ago don’t seem to be there any more. (See for example this post mourning the loss of a playful web.)

The following screenshots – taken from Data Scraping Wikipedia with Google Spreadsheets – show how the original version of Google spreadsheets used to allow you to generate different file output formats, with their own URL, from a particular sheet in a Google spreadsheet:





In the new Google spreadsheets, this is what you’re now offered from the Publish to Web options:


[A glimmer of hope – there’s still a route to CSV URLs in the new Google spreadsheets. But the big question is – will the Google Query language still work with the new Google spreadsheets?]

embed changes everything

(For some reason, WordPress won’t let me put angle brackets round that phrase. #ffs)

That’s what I said in this video put together for a presentation to a bunch of publishers visiting the OU Library at an event I couldn’t be at in person (back when I used to be invited to give presentations at events…)

I saw embed as a way that the publishers could retain control over content whilst still allowing people to access the content, and make it accessible, in ways that the publishers wouldn’t have thought of.

Where content could be syndicated but remain under control of the publisher, the idea was that new value networks could spring up around legacy content, and the publishers could then find a way to take a cut. (Publishers don’t see it that way of course – they want it all. However big the pie, they want all of it. If someone else finds a way to make the pie bigger, that’s not interesting. My pie. All mine. My value network, not yours, even if yours feeds mine. Because it’s mine. All mine.)

I used to build things around Amazon’s API, and Yahoo’s APIs, and Google APIs, and Twitter’s API. As those companies innovated, they built bare bones services that they let others play with. Against the established value network order of SOAP and enterprise service models let the RESTful upstarts play with their toys. And the upstarts let us play with their toys. And we did, because they were easy to play with.

But they’re not anymore. The upstarts started to build up their services, improve them, entrench them. And now they’re not something you can play with. The toys became enterprise warez and now you need professional tools to play with them. I used to hack around URLs and play with the result using a few lines of Javascript. Now I need credentials and heavyweight libraries, programming frameworks and tooling.

Christensen saw how the old guard, with their entrenched value networks couldn’t compete. The upstarts had value networks with playful edges and low hanging technological fruit we could pick up and play with. The old guard entrenched upwards, the upstarts upped their technology too, their value networks started to get real monetary value baked in, grown up services, ffs stop playing with our edges and bending our branches looking for low hanging fruit, because there isn’t any more. Go away and play somewhere else.

Go away and play somewhere else.

Go somewhere else.

Lock (y)our content in, Google, lock it in. Go play with yourself. Your social network sucked and your search is getting ropey. You want to lock up content, well so does every other content generating site, which means you’re all gonna be faced with the problem of ranking content that all intranets face. And their searches universally suck.

The innovator’s dilemma presented incumbents with the problem of how to generate new products and business models that might threaten their current ones. The upstarts started scruffy and let people play alongside, let people innovate along with them. The upstarts moved upwards and locked out the innovation networks around them. Innovations end. Innovation’s end. Innovation send. Away.

< embed > changes everything. Only this time it’s gone the wrong way. I saw embed as a way for us to get their closed content. Now Google’s gone the other way – open data has become an embedded package.

“God help us.” Withnail.

PS Google – why did my, sorry, your Chrome browser ask for my contacts today? Why? #ffs, why?