A Couple of Things from Last Week’s Independent on Sunday…

Really struggling to do anything creative or thought requiring this week, but as I declutter my pockets I find a couple of things I tore out from last weekend’s Independent on Sunday that I wanted to put a marker on.

Firstly, in an article about a fully fledged in-store Tesco Bank (Cheerio then, Sir Terry – you’re a hard act to follow) I spotted this:

Tesco Personal Finance, now wholly owned and renamed Tesco Bank, is being primed to become a major competitor to the high-street banks. Already, six million customers buy Tesco financial products and have deposited £4.4bn with the company, but later this year the group intends to launch a savings account and a range of mortgages; then in the second half of 2011 it will launch a current account.

[W]hereas firms such as Virgin or Metro are seeking branches to build banking empires, Tesco will use its existing stores. The bank will also exploit the data from customers’ Clubcards. [my emphasis]

Oh good… if anyone knows how that might work in practice, I’d love to know… The nightmare scenario (and one I’m sure won’t happen) would be Clubcard operator and data cruncher Dunnhumby trawling through your current account data and working out what you are spending, but not at Tesco, so they can send ever more targeted promotions to you. (So for example, I think they already count the calories you buy from Tesco so they can have a guess at how many you don’t buy… Knowing how much you’re spending at other supermarkets would be a nice trick to have up your sleeve, I should think.)

Spending information might also help with price setting, something it seems as if Dunhumby are looking to develop further if their acquisition of KSS Retail is anything to go by… (e.g. Tesco Clubcard company Dunnhumby buys KSS Retail).

Supermarkets are wary of price wars, of course (reduces margins and cuts into profits), so finding optimal pricing models (just ponder what “optimal” might mean there…;-), and using those models to also influence shopping behaviour, can generate useful returns.

The other clip I have from the IoS is also price related, a comment by Robert Chalmers in a profile of one time editor of The Sunday Times, Harold Evans (Harold Evans: ‘All I tried to do was shed a little light’):

“So how do you feel about the Murdoch empire now?”

Evans pauses. “I’m not that familiar with the British… OK. Let’s take an alternative scenario. Murdoch never arrives. I manage to take control of The Sunday Times with the management buyout. Then I get defeated by the unions. The Independent wouldn’t be here. Rival papers survived because they got the technology. Thanks to Murdoch.”

Thanks to a man who, by starting the price war, created a situation where profit is driven not by a newspaper’s retail price but by its advertising, to the point that advertisers risk dictating editorial content. Haute- couture houses don’t fancy the idea of photographs of dead Congolese babies next to their latest tanning oil, do they? [My emphasis]”

“Thanks to a man who, by starting the price war, created a situation where profit is driven not by a newspaper’s retail price but by its advertising” – brilliant…

My reading of that is that maybe at one time the papers had a pay wall of sorts that kept in balance their reliance on advertising income. Price wars increased the percentage returns from the advertising, and then the internet arrived. The first generation online ad container – banner ads – were as ineffective as any other sort of advertising, (I guess – maybe more so?) and I’m guessing didn’t pose a huge threat to ad spend in the newspapers (note to self: look at financial state of news sales and ad industry returns over last 20 years…); but the AdWords container did, because you could start to track what happened to any interest raised by the ad. With ever more sophisticated forms of personalised behavioural advertising (which isn’t just online – it’s what ClubCard does, right?) the route that newspaper advertising provides is threatened from the other side. (That is, AdWords provide trackability, which newspaper ads don’t; behavioural marketing provides more sophisticated segmentation than the crude ABC demographic reach that newspapers provide.)

I’ve no idea how the Times paywall is doing at the moment, but as News Corp makes moves on BSkyB, I wonder if we’re going to a see folk taking up (exclusive?) membership subscriptions to cross platform content providers, who maybe also run content access points (iDevices, Sky boxes, etc.) and optionally content generation/commissioning. And who would be in the running? Apple and NewsCorp/Sky at the very least. (Not Tesco, yet… Err… maybe: Tesco sets up film studio to adapt hit novels;-). I’m certainly watching out for signs of someone making moves on buying up online game distributor Steam, though…

Scribbled Ideas for “Research” Around the OU Online Conference…

So it seems I missed a meeting earlier this week planning a research strategy around the OU’s online conference, which takes place in a couple of weeks or so… (sigh: another presentation to prepare…;-)…

For what it’s worth, here are a few random thoughts about things I’ve done informally around confs before, or have considered doing… I’ve got the lurgy, though, those this is pretty much a raw core dump and is likely to have more typos and quirky grammatical constructions than usual (can’t concentrate at all:-(

Twitter hashtag communities: I keep thinking I should grab a bit more data (e.g. friends and followers details) around folk using a particular hashtag, and then do some social network analysis on the resulting network so I can write a “proper research paper” about it, but that would be selfish; because I suspect what would be more useful would be to spend that time making it easier for folk to decide whether or not they want to follow other hashtaggers, provide easy ways to create lists of hashtaggers, and so on. (That said, it would be really handy to get hold of the script that Dave Challis cobbled together around Dev8D (here and here) and then used to plot the growth of a twitter community over the course of that event. What’s required? Find the list of folk using the hashtag and then several times a day just grab a list of all their friends and followers (So we require two scripts: one to grab hashtaggers every hour or so and produce a list of “new” hashtaggers; one to grab the friends and followers of every hashtagger once an hour or so (or every half day; or whatever… if this is a research project, I guess it’d make sense to set quite a high sample rate and then learn from that what an effective sample rate would be?). Then at our leisure we can plot the network, and I guess run some SNA stats on it. (We could also use a hshtagger list to create a twitter map view of where folk might be participating from?) One interesting network view would be to show the growth of new connections between two time periods. I’m not sure if the temporal graphs Gephi supports would be handy here, but it’d be a good reason to learn how to use Gephi’s temporal views:-) If the conf is mainly hashtagged by OU users, then it won’t be interesting, because I suspect the OU hashtag community is already pretty densely interconnected. As the conference is being held (I think) in Ellumniate, it might be that a lot of the backchannel chatter occurs in that closed environment…? Is it possible to set up elluminate with a panel showing part of someone’s desktop that is streaming the conference hashtag, I wonder – ie showing backchannel chat within the elluminate environment using a backchannel that exists outside elluminate? (Thinks: would it be worth having a conference twitter user that autofollows anyone using the conf hashtag?) Other twitter stuff we can do is described in Topic and Event based Twittering – Who’s in Your Community?. Eg from the list of hashtaggers, we could see what other hashtags they were using/have recently used, helping identify situation of OU conf in other contexts according to the interests of people talking about the OU conf.

Facebook communities might be another thing to look at. The Netvizz app will grab an individuals network, and the connections between members of that network (unless recent privacy changes have broken things?). This data is trivially visualised in Gephi, which can also determine various SNA stats. Again it would make sense to grab regular dumps of data in maybe two cases: 1) create a faux Facebook user and get folk to friend it, then grab a dump of it’s network every hour or so (is it possible to autofriend people back? Or maybe that’s a job for a research monkey…?! Alternatively, get folk to join a conference group and grab a dump of the members of the group every hour or so (or every whenever or so). The only problem with that is if the group has more than 200 members, you only get a dump of a randomly selected 200 members.

– link communities – by which I mean look at activity around links that are being shared via eg twitter (extract the links from the (archived) hashtag feed) , or bookmarked on delicious. I’ve doodled social life of URL ideas before that might help provide macroscopic views over what links folk are sharing, and who else might be interested in those links (e.g. delicious URL History: Users by Tag or edupunk chatter). From socially bookmarked links, we can also generate tag clouds.

– chatter clouds: term extraction word clouds based on stuff that’s being tweeted with a particular hashtag.

– blog post communities: just cobble together a list of blogs that contain posts written around conf sessions.

– googalytics, bit.lytics: not sure what Google analytics you’d collect from where, but an obvious thing to do with them would be to look at the incoming IP adddresses/domains to see whether the audience was largely coming in from educational institutions. (Is there a list of IP ranges for UK HEIs, I wonder?) If any links are shared in the conference context, eg by backchannel, it would might sense shortening all those links on bit.ly with a conf API key, so you could track all click throughs on bit.ly shortened versions of that target link. The point would be to just be able to produce a chart of something like “most clicked through links for this conf”.

Bleurghhhhh….

Don’t Tell Us What To Do – Let Us Surprise You…

The dreaded lurgy beat me yesterday and so far today, so I’ve spent the last 36 hours drifting in and out of fitful sleep with a podcast backing track…

Here’s something I woke up to a few minutes in that might well have an audio quote or two worth grabbing – Clay Shirky on why you shouldn’t tell people what to do if you want them to participate collaboratively… this maybe has consequences for the way we think of designing forum related discussion activities in online courses?

Clay Shirky, Technology Insight, O’Reilly Media Gov 2.0 Summit, September 2009 [via IT Conversations]

Liberating Data from the Guardian… Has it Really Come to This?

When the data is the story, should a news organisation make it available? When the Telegraph started trawling through MPs’ expenses data it had bought from a source, industry commentators started asking questions around whether it was the Telegraph’s duty to release that data (e.g. Has Telegraph failed by keeping expenses process and data to itself?).

Today, the Guardian released its University guide 2011: University league table, as a table:

Guardian university tables, sort of

Yes, this is data, sort of (though the javascript applied to the table means that it’s hard to just select and copy the data from the page – unless you turn javascript off, of course:

Data grab

but it’s not like the data that the Guardian are republishing it in their datastore, as they did with these league tables…:

Guardian datastore

…which was actually a republication of data from the THES… ;-)

I’ve been wondering for some time when this sort of apparent duplicity was going to occur… the Guardian datastore has been doing a great job of making data available (as evidenced by its award from the Royal Statistical Society last week, which noted: “there was commendable openness with data, providing it in easily accessible ways”) but when the data is “commercially valuable” data to the Guardian, presumably in terms of being able to attract eyeballs to Guardian Education web pages, there seems to be some delay in getting the data onto the datastore… (at least, it isn’t there yet/wasn’t published contemporaneously the original story…)

I have to admit I’m a bit wary about writing this post – I don’t want to throw any spanners in the works as far as harming the work being done by the Datastore team – but I can’t not…

So what do we learn from this about the economics of data in a news environment?

– data has creation costs;
– there may be a return to be had from maintaining limited, priviliged or exclusive access to the data as data OR as information, where information is interpreted, contextualised or visualised data, or is valuable in the short term (as for example, in the case of financial news). By withholding access to data, publishers maintain the ability to generate views or analysis of the data that they can create stories, or attractive website content, around. (Just by the by, I noticed that an interactive Many Eyes widget was embedded in a Guardian Datablog post today:-)
– if you’ve incurred the creation cost, maybe you have a right to a limited period of exclusivity with respect to profiting from that content. This is what intellectual property rights try to guarantee, at least until the Mickey Mouse lawyers get upset about losing their exclusive right to profit from the content.

I think (I think) what the Guardian doing is not so different to what the Telegraph did. A cost was incurred, and now there is a (hopefully limited) period in which some sort of return is attempting to be generated. But there’s a problem, I think, with the way it looks, especially given the way the Guardian has been championing open data access. Maybe the data should have been posted to the datablog, but with access permissions denied until a stated date, so that at least people could see the data was going to be made available.

What this has also thrown up, for me at least, is the question as to what sort of “contract” the datablog might have, implied or otherwise, with third parties who develop visualisations based on data in the Guardian Datastore, particularly if those visualisations are embeddable and capable of generating traffic (i.e. eyeballs, = ad impressions, = income…).

It also gets me wondering; does there need to be a separate datastore? Or is the ideal case where the stories themselves are linking out to datasets directly? (I suppose that would make it hard to locate the data? On second thoughts, the directory datastore approach is much better…)

Related: Time for data.ac.uk? Or a local data.open.ac.uk?

PS I toyed with the idea of republishing all the data from the Guardian Education pages in a spreadsheet somewhere, and then taking my chances with the lawyers in the court of public opinion, but instead, here’s a howto:

Scraping data from the Grauniad

So just create a Google spreadsheet (you don’t even need an account: just go to docs.google.com/demo), double click on cell A1 and enter:

=ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table”,”table”,1)

and then you’ll be presented with the data, in a handy spreadsheet form, from:
http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table

For the subject pages – e.g. Agriculture, Forestry and Food, paste in something like:

=ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-guide-agriculture-forestry-and-food”,”table”,1)

You can probably see the pattern… ;-)

(You might want to select all the previously filled cells and clear them first so you don’t get the data sets messed up. If you’ve got your own spreadsheet, you could always create a new sheet for each table. (It is also possible to automate the scraping of all the tables using Google Apps script: Screenscraping With Google Spreadsheets App Script and the =importHTML Formula gives an example how…))

An alternative route to the data is via YQL:

Scraping HTML table data in YQL

Enjoy…;-) And if you do grab the data and produce some interesting visualisations, feel free to post a link back here… ;-) To give you some ideas, here are a few examples of education data related visualisations I’ve played around with previously.

PPS it’ll be interested to see if this post gets picked up by the Datablog, or popped into the Guardian Technology newsbucket… ;-) Heh heh…

Ba dum… Education for the Open Web Fellowship: Uncourse Edu

A couple of weeks ago, I started getting tweets and emails linking to a call for an Education for the Open Web Fellowship from the Mozilla and Shuttleworth Foundations.

The way I read the call was that the fellowship provides an opportunity for an advocate of open ed on the web to do their thing with the backing of a programme that sees value in that approach…

…and so, I’ve popped an (un)application in (though not helped with having spent the weekend in a sick bed… bleurrrgh… man flu ;-) It’s not as polished as it should be, and it could be argued that it’s unfinished, but that is, erm, part of the point… After all, my take on the Fellowship is that the funders are seeking to act as a patron to a person and helping them achieve as much as they can, howsoever they can, as much as it is supporting a very specific project? (And if I’m wrong, then it’s right that my application is wrong, right?!;-)

The proposal – Uncourse Edu – is just an extension of what it is I spend much of my time doing anyway, as well as an attempt to advocate the approach through living it: trying to see what some of the future consequences of emerging tech might be, and demonstrating them (albeit often in way that feels too technical to most) in a loosely educational context. As well as being my personal notebook, an intended spin-off of this blog is to try help drive down barriers to use of web technologies, or demonstrate how technologies that are currently only available to skilled developers are becoming more widely usable, and access to them as building blocks is being “democratised”. As to what the barriers to adoption are, I see them as being at least two-fold: one is ease of use (how easy the technology is to actually use); the second is attitude: many people just aren’t, or don’t feel they’re allowed to be, playful. This stops them innovating in the workplace, as well as learning for themselves. (So for example, I’m not an auto-didact, I’m a free player…;-)

The Fellowship applications are templated (loosely) and submitted via the Drumbeat project pitching platform. This platform allows folk to pitch projects and hopefully gather support around a project idea, as well as soliciting (small amounts of) funding to help run a project. (It’d be interesting if in any future rounds of JISC Rapid Innovation Funding, projects were solicited this way and one of the marking criterion was the amount of support a pitched proposal received?)

I’m not sure if my application is allowed to change, but if it doesn’t get locked by the Drumbeat platform it may well do so… (Hopefully I’ll get to do at least another iteration of the text today…) In particular, I really need to post my own video about the project (that was my undone weekend task:-(

Of course, if you want to help out producing the video, and maybe even helping shape the project description, then why not join the project? Here’s the link again: Uncourse Edu.

PS I think there’s a package on this week’s OU co-produced episode of Digital Planet on BBC World Service (see also: Digital Planet on open2) that includes an interview with Mark Shuttleworth and a discussion about some of the work the Shuttleworth Foundation gets up to… (first broadcast is tomorrow, with repeats throughout the week).

DISCLAIMER: I’m the OU academic contact for the Digital Planet.

Manchester Digital Presentation – Open Council Data

I spent much of yesterday trying to come up with some sort of storyline for a presentations I’m giving in at a Manchester Digital Open Data: Concept & Practice event tomorrow evening on Open Civic Data, quite thankful that I’d bought myself a bit of slack with the original title: “Open Data Surprise”…

Anywhere, a draft of the slides as here (Manchester opendata presentation):

though as ever, they don’t necessarily give the full picture without me talking over them…

The gist of the presentation is basically as follows: there is an increasing number of local council websites out there making data available. By data, I mean stuff that developers, or tinkerers such as myself, can wrangle with, or council officers actually use as part of their job. The data that’s provided can be thought of along a spectrum ranging from fixed archival data (i.e. reports, things that aren’t going to change) to timely data, such as dates of forthcoming council elections, or details of current or planned roadworks. Somewhere in-between is data that changes on a regular cycle, such as the list of councillors, for example. The most timely of all data is live data, such as bus locations, or their estimated arrival time at a particular bus stop.

Lots of councils are starting to offer “data” via maps. What they are actually doing is providing information in a natural way, which is not a Bad Thing, although it’s not a Data Thing – if I wanted to create my own map view of the data, it’s generally not easy for me to do so. Providing data files that contain latitude and longitude is one way that councils can make the data available to users, but there is then a barrier to entry in terms of who can realistically make use of that data. Publishing geo-content in the KML format is one way we can can improve this, because tools such as Google Earth provide a ready way of rendering KML feeds that is accessible to many more users.

As a pragmatist, I believe that most people who use data do so in the context of a spreadsheet. This suggests that we need to make data available in that format, notwithstanding the problems that might arise from the difficulties of keeping an audit trail of the origins of that data in that format once files become merged. As a realist, I appreciate that most people don’t know that it’s possible to visualise their data, or what sorts of insight might be afforded by visualising the data. Nor do they know how it is becoming increasingly easy to create visualisations on top of data presented in an appropriate way. Just as using KML formats allows almost anyone to crate their own google Map by pasting a KML file URL into a Google maps search box, so the use of simple data formats such as CSV allow users to pull data into visualisation environments such as IBM’s Many Eyes. (For more on this line of thinking, see Programming, Not Coding: Infoskills for Journalists (and Librarians..?!;-)). As a data junkie, I think that the data should be “linkable” across data sets, and also queryable. As a contrarian, I think that Linked Data is maybe not the way forward at this time, at least in terms of consumer end-user evangelism/advocacy… (see also: So What Is It About Linked Data that Makes it Linked Data™?

The data that is starting to be published by many councils typically corresponds to the various functions of the council – finance, education, transport, planning, cultural and leisure services, for example. Through publishing this data in an open way, third party vertical providers can both aggregate data as information across councils, as well as adding contextual value. Some councils are entering into partnership with other councils to develop vertical services to which the council can publish it’s data, before pulling it back into the council’s own website via a data feed. And as to whose data it is anyway, it might be ours, but it’s also theirs: data as the business of government. Which makes me think: the most effective council data stores will be the ones that are used by councils are data consumers in their own right, rather than just as data publishers*.

(* There is a corollary here with open educational resources, I think? Institutions that make effective use of OERs are institutions who use at least their own OERs, as well as publishing them…?)

Recent communications from Downing Street suggest the new coalition government is serious in its aim to open up public data (though as @lesteph points out, this move towards radical transparency is not without its own attendant risks), so data releases are going to happen. The question is, are we going to help that data flow so that it can get to where it needs to go?

Back From the Count…

Yesterday, I took an hour out to do a spot of Telling outside a poll station for a local Isle of Wight council election (recounted here: Throughts on Telling….). Today I went along to the count…

The venue was a sports hall; along the back wall were the ballot boxes full sight. Numbered tables were arranged in a U shape, with counters sitting on the inside and observers stood outside the U, and announcements made as to what ballot was to be counted on what table. Boxes were brought to the appropriate table, and the parcel ties securing the lids cut off; ballot papers were then placed on the table and taken up by the counters. The first pass was to just count the ballot papers and arrange them so they were all the same way up. The papers were then taken away from the table, before being returned to it (I’m not sure why?)

The second pass was the actual count. Papers were sorted into piles corresponding to vote cast (only one vote per paper), and bundled in piles of 25 votes secured by a bulldog clip with clips open. When all the votes were counted, the open clipped bundles (corresponding to bundles for a particular candidate) were reallocated to different counters on the same table and counted again. If the tally was correct (25 papers in a bundle) it was reclipped with clips closed.

Bundles were then arranged by candidate in piles of four (i.e. 100 votes per pile) and checked off as bundles for the final count value. (A final unclipped pile in each column represented the remaining votes for that candidate.)

Where there was uncertainty as the validity of a ballot paper, the counter placed it into a wire in-tray. The returning officer adjudicated on the validity of votes in agreement with the agents or party reps. Typical rejected papers were no vote cast/no mark made, or two marks made. When rejection was agreed on, the paper was stamped with a red “rejected” mark and placed into a second wire in-tray. Where uncertain marks were made, e.g. a cross to the left of the ballot paper rather than in the square on the right, a decision was made by the returning officer, again acquiesced to by the agents, as to whether the voter had made a clear preference for one of candidates by the way they had made their mark. If agreed on, the vote was accepted.

A second sort of ballot was also being counted for local town council elections, where voters could make two marks on one ballot paper (which was a different colour to the Island council ballot paper); that is, they could vote for two candidates.

Piled ballot papers obviously doesn’t work in this case, so pieces of A4 paper, in landscape view with the names of the candidates down the left hand side in the same order as they appeared on the ballot paper were used to tally results. Each piece of paper had 25 (or was it 20?) columns and corresponded to one ballot paper. For piles of 25 (or 20?) papers, the votes on the top ballot paper were placed in the first column (a 1 marking a vote cast for the appropriate candidate), the second ballot paper the next column, and so on. Tallying across the rows gave the number of votes cast in that bundle for a particular candidate. Again, an open clips bulldog clip was used to secure a bundle of papers to a tally sheet. These were then recounted and if a match was made, reclipped with closed clips.

A second form (the ‘master sheet’), again with candidates names defining rows arranged down the left hand side, was used to collate totals from the tally sheets. Each column represented the votes cast as recorded on a particular ballot sheet. A desk calculator was then used to total across the rows. As two master sheets were required for the count I saw, the totals from the first sheet were added as an additional column on the second master sheet. Summing across the rows on this second sheet, and adding in the votes carried over from the first sheet, to give the final count. As a witness to the event, I was thinking it’d be handy to have something like, err, an iPad, laid out like the forms, but offering a simple spreadsheet view so that row totals could be calculated automatically, just to offer a quick check. As it was, I did a mental summation to check overall totals were right-ish…

The scope for error in this second style of vote seemed much more likely (several points of failure: creating the tally, counting the tally, transferring tally scores, summing rows, transferring master totals from one sheet to another, final summation). The checking process is further complicated by the fact that there is a mismatch between the number of ballot papers and the number of valid voting marks made; (not all voters voted for two candidates). It struck me that during the initial sort, ballot papers could be put into piles corresponding to 1 or 2 marks made as a way of trying to keep a matching overall vote count check in place (i.e. so it would be easier to check that the votes cast in the final result was more easily auditable from ballot paper bundles).

All in all, a fascinating experience…