digischol – OUseful.Info, the blog…

Name (Date) Title, Available at: URL (Accessed: DATE): So What?

Academic referencing is designed, in part, to support the retrieval of material that is being referenced, as well as recognising provenance.

The following guidance, taken from the OU Library’s Academic Referencing Guidelines, is, I imagine, typical:

That page appears in an OU hosted Moodle course (OU Harvard guide to citing references) that requires authentication. So whilst stating the provenance, it won’t necessarily support the retrieval of content from that site for most people.

Where an (n.d) — no date — citation is provided, it also becomes hard for someone checking the page in the future whether or not the content has changed, and if so, which parts.

Looking at the referencing scheme for organisational websites, there’s no suggestion that authentication is required is listed in the citation (the same is true in the guidance for citing online newspaper articles).

I also didn’t see guidance offhand for how to reference pages where the page presentation is likely customised by “an algorithm” according to personal preferences or interaction history; placement of things like ads are generally dynamic, and often personalised (personalisation may be based on multiple things, such as the cookie state of the browser with which you are looking at a page, or the history of transactions (sites visited) from the IP address you are connecting to a site from).

This doesn’t matter for static content, but it does matter if you want to reference something like a screenshot / screencapture, for example showing the results of a particular search on a web search engine. In this case, adding a date and citing the page publisher (that is, the web search engine, for example) is about as good as you can get, but it misses a huge amount of context. The fact that you got extremist results might be because your web history reveals you to be a raging fanatic, and the fact that you grabbed the screenshot from the premises of your neo-extremist clubhouse just added more juice to the search. One partial solution to disabling personalisation features might be to run a search in a “private” browser session where cookies are disabled, and cite that fact, although this still won’t stop IP address profiling and browser fingerprinting.

I’ve pondered related things before, eg when asking Could Librarians Be Influential Friends? And Who Owns Your Search Persona?, as well as in a talk given 10 years ago and picked up at the time by Martin Weller on his original blog site (Your Search Is Valuable To Us; or should that be: Weller, M. (2008) 'Your Search Is Valuable To Us' *The Ed Techie*, 9 September [Blog] Available at http://nogoodreason.typepad.co.uk/no_good_reason/2008/10/your-search-is-valuable-to-us.html (Accessed 26 September 2018).?).

Most of the time, however, web references are to static content, so what role does the Accessed on date play here? I can imagine discussions way back when, when this form was being agreed on (is there a history of the discussion that took place when formulating and adopting this form?) where someone said something like “what we need is to record the date the page was accessed on and capture it somewhere“, and then the second part of that phrase was lost or disregarded as being too “but how would we do that?”…

One of the issues we face in maintaining OU courses, where content starts being written 2 years before a course start and is expected to last for 5+ years of presentation, is maintaining the integrity of weblinks. Over that period of time, you might expect pages to change in a couple of ways, even if the URL persists and the “content” part remains largely the same:

the page style (that is, the view as presented) may change;
the surrounding navigation or context (for example, sidebar content) may change.

But let’s suppose we can ignore those. Instead, let’s focus on how we can try to make sure that the a student can follow a link to the resource we intend.

One of the things I remember from years ago were conversations around keeping locally archived copies of webpages and presenting those copies to students, but I’m not sure this ever happened. (Instead, there was a short of middle ground compromise of running link checkers, but I think that was just to spot 404 page not found errors rather than checking a hash made on the content you were interested in, which would be difficult.)

At one point, I religiously kept archived copies of pages I referenced in course materials so that if the page died, I could check back on my own copy to see what the sense of the page now lost was so I could find a sensible alternative, but a year or two off course production and that practice slipped.

Back to the (Accessed DATE) clause. So what? In Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs I mentioned a couple of Wikipedia bots that check link integrity on Wikipedia (see also: Internet Archive blog: More than 9 million broken links on Wikipedia are now rescued). These can perform actions like archiving web pages, checking links are still working, and changing broken links to point to an archived copy of the same link. I hinted that it would be useful if the VLE offered the same services. They don’t, at least, not going by reports from early starters to this year’s TM351 presentation who are already flagging up broken links (do we not run a link checker anymore? (I think I asked that in the Broken URLs post a year ago, too…?)

Which is where (Accessed DATE) comes in. If you do accede to that referencing convention, why not make sure that that an archived copy of that page, ideally made on that date. Someone chasing the reference can then see what you accessed, and perhaps if they are visiting the page somewhen in the future, see how the future page compares with the original. (This won’t help with authentication controlled content or personalised page content though.)

An easy way of archiving a page in a way that others can access it is to use the Internet Archive’s Wayback Machine (for example, If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine).

From the Wayback Machine homepage, you can simply add a link to a page you want to archive:

hit SAVE NOW (note, this is saving a different page; I forgot to save the screenshot of the previous one, even though I had grabbed it. Oops…):

and then you have access to the archived page, on the date it was accessed:

A more useful complete citation would now be Weller, M. (2008) 'Your Search Is Valuable To Us' *The Ed Techie*, 9 September [Blog] Available at http://nogoodreason.typepad.co.uk/no_good_reason/2008/10/your-search-is-valuable-to-us.html (Accessed 26 September 2018. Archived at https://web.archive.org/web/20180926102430/http://nogoodreason.typepad.co.uk/no_good_reason/2008/10/your-search-is-valuable-to-us.html).

Two more things…

Firstly, my original OUseful.info blog was hosted on an OU blog server; when that was decommissioned, I archived the posts on a subdomain of .open.ac.uk I’d managed to grab. That subdomain was deleted a few months ago, taking with it the original blog archive. Step in the Wayback Machine. It didn’t have a full copy of the original blog site, but I did manage to retrieve quite a few of the pages using this wayback machine downloader using the command wayback_machine_downloader http://blogs.open.ac.uk/Maths/ajh59 or for a slightly later archivewayback_machine_downloader http://ouseful.open.ac.uk/blogarchive. I made the original internal URLs relative (find . -name '*.html' | xargs perl -pi -e 's/http:\/\/blogs.open.ac.uk\/Maths\/ajh59/./g'), (or as appropriate for http://ouseful.open.ac.uk/blogarchive), used a similar approach to remove tracking scripts from the pages, uploaded the pages to Github (psychemedia/original-ouseful-blog-archive), enabled the repo as a Github pages site, and the pages are now at https://psychemedia.github.io/original-ouseful-blog-archive/pages/ It looks like the best archive is at the UK Web Archive, but I can’t see a way of getting a bulk export from that? https://www.webarchive.org.uk/wayback/archive/20170623023358/http://ouseful.open.ac.uk/blogarchive/010828.html

Secondly, bots; VLE bots… Doing some maintenance on TM351, I notice it has callouts to other OU courses, including TU100, which has been replaced by TM111 and TM112. It would be handy to be able to automatically discover references to other courses made from within a course to support maintenance. Using some OU-XML schema markup to identify such references would be sensible? The OU-XML document source structure should provide a veritable playground for OU bots to scurry around. I wonder if there are any, and if so, what do they do?

PS via Richard Nurse, reminding me that Memento is also useful when trying to track down original content and/or retrieve content for broken link pages from the Internet Archive: UK Web Archive Mementos search and time travel.

Richard also comments that “OU modules are being web archived by OU Archive – 1st, mid point and last presentation -only have 1 in OUDA currently https://www.open.ac.uk/library/digital-archive/web/warc:11589fbe-f596-4730-ac81-98eb10d4042e … staff login only – on list to make them more widely available but prob only to staff given 3rd party rights in OU courses“. Interesting…

PPS And via Herbert Van de Sompel, a list of archives accessed via time travel, as well as a way of decorating web links to help make them a bit more resilient: Robust Links – Link Decoration.

By the by, Richard, Kevin Ashley and @cogdog/Alan also point me to the various browser extensions that make life easier adding pages to archives or digging into their history. Examples here: Memento tools. I’m not sure what advice the OU Library gives to students about things like this; certainly my experience of interactions with students, academics and editors alike around broken links suggests that not many of them are aware of the Internet Archive / UK web Archive, Wayback Machine, etc etc?

OUseful.info – where the lede is usually buried…

Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs

Ish-via @opencorporates, I came across the “Virtues of a Programmer”, referenced from a Wikipedia page, in a Nieman Lab post by Brian Boyer on Hacker Journalism 101,, and stated as follows:

Laziness: I will do anything to work less.
Impatience: The waiting, it makes me crazy.
Hubris: I can make this computer do anything.

I can buy into those… Whilst also knowing (from experience) that any of the above can lead to a lot of, erm, learning.

For example, whilst you might think that something is definitely worth automating:

the practical reality may turn out rather differently:

The reference has (currently) disappeared from the Wikipedia page, but we can find it in the Wikipedia page history:

The date of the NiemanLab article was Sept. 7, 2012, 11:22 a.m., and it looks as if the reference, which is actually a quote from the Programming Perl book (which I should probably read again…), had already been removed by then?

So here’s one example of a linked reference to a web resource that we know is subject to change and that has a mechanism for linking to a particular instance of the page.

Academic citation guides tend to suggest that URLs are referenced along with the date that the reference was (last?) accessed by the person citing the reference, but I’m not sure that guidance is given that relates to securing the retrievability of that resource, as it was accessed, at a later date. (I used to bait librarians a lot for not getting digital in general and the web in particular. I think they still don’t…;-)

This is an issue that also hits us with course materials, when links are made to third party references by URI, rather than more indirectly via a DOI.

I’m not sure to what extent the VLE has tools for detecting link rot (certainly, they used to; now it’s more likely that we get broken link reports from students failing to access a particular resource…) or mitigating against broken links.

One of the things I’ve noticed from Wikipedia is that it has a couple of bots for helping maintain link integrity: InternetArchiveBot and Wayback Medic.

Bots help preserve link availability in several ways:

if a link is part of a page, that link can be submitted to an archiving site such as the Wayback machine (or if it’s a UK resource, the UK National Web Archive);
if a link is spotted to be broken (header / error code 404), it can be redirected to the archived link.

One of the things I think we could do in the OU is add an attribute to the OU-XML template that points to an “archive-URL”, and tie this in with service that automatically makes sure that linked pages are archived somewhere.

If a course link rots in presentation, students could be redirected to the archived link, perhaps via a splash screen (“The original resource appears to have disappeared – using the archived link”) as well as informing the course team that the original link is down.

Having access to the original copy can be really helpful when it comes to trying to find out:

whether a simple update to the original URL is required (for example, the page still exists in its original form, just at a new location, perhaps because of a site redesign); or,
whether a replacement resource needs to be found, in which case, being able to see the content of the original resource can help identify what sort of replacement resource is required.

Does that count as “digital first”, I wonder???

PPS Handy tool for creating backup links – create a link that submits the link to the Internet Archive and adds a versionurl attribute to your anchor: https://robustlinks.mementoweb.org/# About: https://www.infodocket.com/2021/02/10/new-journal-article-robustifying-links-to-combat-reference-rot/ /via Stephen’s Lighthouse.

https://robustlinks.mementoweb.org/robustify/?anchor_text=Visualising%20Rally%20Stages&url=https%3A%2F%2Frallydatajunkie.com%2Fvisualising-rally-stages%2F

Practical DigiSchol – Refinding Lost Recent Web History

I don’t know to what extent our current courses equip folk for the future (or even the present), but here’s a typical example of something I needed to figure out a solution to a problem (actually, more of a puzzle) today using the resources I had to hand…

Here’s the puzzle:

I couldn’t remember offhand what this referred to other than “body shape” and “amazon” (I’ve had a cold recently and it seems to have affected my mental indexing a bit…!), and a quick skim of my likely pinboard tags didn’t turn anything up…

Hmmm… I recall looking at a related page, so I should be able to search my browser history. An easy way to do this is using Simon Willison’s datasette app to search the SQLite database that contains my Chrome browser history:

pip3 install datasette
datasette ~/Library/Application\ Support/Google/Chrome/Default/History

This fires up a simple server with a browser based UI:

Body Labs – that was it…

Sigh… So what was there?

You can browse the archived site on the Wayback Machine.

On reflection (as I said: head cold, not focussing/concentrating properly), I could have web searched:

No closer to the video but it sets the context…

That said, searching for bodylabs video brings up some likely candidates for alternatives, such as this one (pixelfondue: Body + Measure + Scan + AI = Bodylabs delicacy):

Anyway, back to the question: what did I need to know in order to be able to do that? And where do folk learn that sort of thing, whether or not they are formally taught it. Indeed, if there is a related thing to be formally taught, what it is, at what level, and in what context?

PS I had actually also kept a note of the company and the video in a scribble pad draft post on the Digital Worlds blog where I collect irreality related stiff:

Programming, Coding & Digital Skills

I keep hearing myself in meetings talking about the “need” to get people coding, but that’s not really what I mean, and it immediately puts people off because I’m not sure they know what programming/coding is or what it’s useful for.

So here’s an example of the sort of thing I regularly do, pretty much naturally – automating simple tasks, a line or two at a time.

The problem was generating some data files containing weather data for several airports. I’d already got a pattern for the URL for the data file, now I just needed to find some airport codes (for airports in the capital cities of the BRICS countries) and grab the data into a separate file for each [code]:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

<br /> Viewer requires iframe.<br />

view raw

airportweather.ipynb

hosted with ❤ by GitHub

In other words – figuring out what steps I need to do to solve a problem, then writing a line of code to do each step – often separately – looking at the output to check it’s what I expect, then using it as the input to the next step. (As you get more confident, you can start to bundle several lines together.)

The print statements are a bit overkill – I added them as commentary…

On its own, each line of code is quite simple. There are lots of high level packages out there to make powerful things happen with a single command. And there are lots of high level data representations that make it easier to work with particular things. pandas dataframes, for example, allow you to work natually the contents of a CSV data file or an Excel spreadsheet. And if you need to work with maps, there are packages to help with those too. (So for example, as an afterthought I added a quick example to the notebook showing how to add markers for the airports to a map… (I’m not sure if the map will render in the embed or the gist?) That code represents a recipe that can be copied and pasted and used with other datasets more or less directly.

So when folk talk about programming and coding, I’m not sure what they mean by it. The way we teach it in computing departments sucks, because it doesn’t represent the sort of use case above: using a line of code at a time, each one a possible timesaver, to do something useful. Each line of code is a self-made tool to do a particular task.

Enterprise software development has different constraints to the above, of course, and more formalised methods for developing and deploying code. But the number of people who could make use of code – doing the sorts of things demonstrated as per the example above – is far larger than than the number of developers we’ll ever need. (If more folk could build their own single line tools, or work through tasks a line of a code at a time, we may not need so many developers?)

So when it comes to talk of developing “digital skills” at scale, I think of the above example as being at the level we should be aspiring to. Scripting, rather then developer coding/programming (h/t @RossMackenzie for being the first to comment back with that mention). Because it’s in the reach of many people, and it allows them to start putting together their own single line code apps from the start, as well as developing more complex recipes, a line of code at a time.

And one of the reasons folk can become productive is because there are lots of helpful packages and examples of cribbable code out there. (Often, just one or two lines of code will fix the problem you can’t solve for yourself.)

Real programmers don’t write a million lines of code at a time – they often write a functional block – which may be just a line or a placeholder function – one block at a time. And whilst these single lines of code or simple blocks may combine to create a recipe that requires lots of steps, these are often organised in higher level functional blocks – which are themselves single steps at a higher level of abstraction. (How does the joke go? Recipe for world domination: step 1 – invade Poland etc.)

The problem solving process then becomes one of both top-down and bottom up: what do I want to do, what are the high-level steps that would help me achieve that, within each of those: can I code it as a single line, or do I need to break the problem into smaller steps?

Knowing some of the libraries that exist out there can help in this problem solving / decomposing the problem process. For example, to get Excel data into a data structure, I don’t need to know how to open a file, read in a million lines of XML, parse the XML, figure out how to represent that as a data structure, etc. I use the pandas.read_excel() function and pass it a filename.

If we want to start developing digital skills at scale, we need to get the initiatives out of the computing departments and into the technology departments, and science departments, and engineering departments, and humanities departments, and social science departments…

Communicating Data – Different Takes

A couple of recent articles on bias in the justice system recently caught my eye that show different models of engagement around data analysis in a particular topic area:

Hester, Rhys, and Todd K. Hartman. “Conditional Race Disparities in Criminal Sentencing: A Test of the Liberation Hypothesis from a Non-Guidelines State”, Journal of Quantitative Criminology pp 1-24, an academically published, peer reviewed article that will cost you £30 to look at.
Uncovering Big Bias with Big Data By David Colarusso on May 31st, 2016, The Lawyerist blog, a recreational data blog post.

The blog post comes complete with links to a github repo containing a Jupyter notebook describing the analysis. The data is not provided, for data protection/privacy compliance, although a link to the original source of the data, and a brief description of it, is (but not a deep link to the actual data?). I’m not sure if any data associated with the academic paper is publicly or openly available, or whether any of the analysis scripts are (see below – analysis scripts are available).

The blog post is open to comments (there are a few) and the author has responded to some of them. The academic post authors made themselves available via a Reddit AMA (good on them:-): Racial Bias AMA (h/t @gravityvictims/Alastair McCloskey for the tip).

The Reddit AMA notes the following: an ungated (i.e., not behind paywall) version of our research at the SSRN or Dr. Hartman’s website. The official publication was First online: 29 February 2016. The SSRN version is dated as Date posted: November 6, 2014 ; Last revised: January 4, 2016. The SSRN version includes a small amount of Stata code at the end (the “official” version of the paper doesn’t?), but I’m not sure what data it applies to or whether it’s linked to from the data (I only skimmed the paper.) Todd Hartman’s website includes a copy of the published paper and a link to the replication files (7z compressed, so how many folk will be able to easily open that, I wonder?!).

Downloads

So Stata, R and data files. Good stuff. But from just the paper homepage on the Springer Journal site, I wouldn’t have got that?

Of course, the Springer paper reference gets academic brownie points.

PS by the by, In the UK the Ministry of Justice Justice Data Lab publish regular reports on who’s using their data. For example: Justice Data Lab statistics: June 2016.

Authoring Dynamic Documents in IPython / Jupyter Notebooks?

One of the reasons I started writing the Wrangling F1 Data With R book was to see how it felt writing combined text, code and code output materials in the RStudio/RMarkdown context. For those of you that haven’t tried it, RMarkdown lets you insert executable code elements inside a markdown document, either as code blocks or inline. The knitr library can then execture the code and display the code output (which includes tables and charts) and pandoc transforms the output to a desired output document format (such as HTML, or PDF, for example). And all this at the click of a single button.

In IPython (now Jupyter) notebooks, I think we can start to achieve a similar effect using a combination of extensions. For example:

python-markdown allows you to embed (and execute) python code inline within a markdown cell by enclosing it in double braces (For example, I could say “{{ print(‘hello world’}}”);
hide_input_all is an extension that will hide code cells in a document and just display their executed output; it would be easy enough to tweak this extension to allow a user to select which cells to show and hide, capturing that cell information as cell metadata;
Readonly allows you to “lock” a cell so that it cannot be edited; using a notebook server that implements this extension means you can start to protect against accidental changes being made to a cell by mistake within a particular workflow; in a journalistic context, assigning a quote to a python variable, locking that code cell, and then referencing that quote/variable in a python-markdown might be one of working, for example.
Printview-button will call nbconvert to generate an HTML version of the current notebook – however, I suspect this does not respect the extension based customisations that operate on cell metadata. To do that, I guess we need to generate our outptut via an nbconvert custom template? (The Download As... notebook option doesn’t seem to save the current HTML view of a notebook either?)

So – my reading is: tools are there to support the editing side (inline code, marking cells to be hidden etc) of dynamic document generation, but not necessarily the rendering to hard copy side, which need to be done via nbconvert extensions?

Paying for Dropbox and Other Useful Bits… (The Cost of Doing Business…)

A couple of years ago or so, Dropbox ran a promotion for academic users granting 15GB of space. Yesterday, I got an email:

As part of your school’s participation in Space Race, you received 15 GB of additional Dropbox space. The Space Race promotional period expires on March 4, 2015, at which point your Dropbox limit will automatically return to 5 GB.

As a friendly reminder, you’re currently using 14.6 GB of Dropbox space. If you’re over your 5 GB limit after March 4, you’ll no longer be able to save new photos, videos, and docs to Dropbox.

Need more space? Dropbox Pro gives you 1 TB of space to keep everything safe, plus advanced sharing controls, remote wipe for lost devices, and priority support. Upgrade before March 4 and we’ll give you 30% off your first year.

My initial thought was to tweet:

ah ha… email from dropbox telling me my promo 15GB will drop to 5GB.. Time to delete stuff & get out of shared folders…

— Tony Hirst (@psychemedia) February 2, 2015

but then I thought again… The discounted price on a monthly payment plan is £5.59/month which on PayPal converted this month to $8.71. I use Dropbox all the time, and it forms part of my workflow for using Leanpub. As it’s the start of the month, I received a small royalty payment for the Wrangling F1 Data With R book. The Dropbox fee is about amount I’m getting per book sold, so it seems churlish not to subscribe to Dropbox – it is part of the cost of doing business, as it were.

The Dropbox subscription gets me 1TB, so this also got me thinking:

space is not now an issue, so I can move the majority of my files to Dropbox, not just a selection of folders;
space is not now an issue, so I can put all my github clones into Dropboxl
space is now now an issue, so though it probably goes against terms of service, I guess I could set up toplevel “family member” folders and we could all share the one subscription account, just selectively synching our own folders?

In essence, I can pretty much move to Dropbox (save for those files I don’t want to share/expose to US servers etc etc; just in passing, one thing Dropbox doesn’t seem to want to let me do is change the account email to another email address that I have another Dropbox account associated with. So I have a bit of an issue with juggling accounts…)

When I started my Wrangling F1 Data With R experiment, the intention was always to make use of any royalties to cover the costs associated with that activity. Leanpub pays out if you are owed more than $40 collected in the run up to 45 days ahead of a payment date (so the Feb 1st payout was any monies collected up to mid-December and not refunded since). If I reckon on selling 10 books a month, that gives me about $75 at current running. Selling 5 a month (so one a week) means it could be hit or miss whether I make the minimum amount to receive a payment for that month. (I could of course put the price up. Leanpub lets you set a minimum price but allows purchasers to pay what they want. I think $20 is the highest amount paid for a copy I’ve had to date, which generated a royalty of $17.50 (whoever that was – thank you :-)) You can also give free or discounted promo coupons away.) As part of the project is to explore ways of identifying and communicating motorsport stories, I’ve spent royalties so far on:

a subscription to GP+ (not least because I aspire to getting a chart in there!;-);
a subscription to the Autosport online content, in part to gain access to forix, which I’d forgotten is rubbish;
a small donation to sidepodcast, because it’s been my favourite F1 podcast for a long time.

Any books I buy in future relating to sports stats or motorsport will be covered henceforth from this pot. Any tickets I buy for motorsport events, and programmes at such events, will also be covered from this pot. Unfortunately, the price of an F1 ticket/weekend is just too much. A Sky F1 Channel subscription or day passes is also ruled out because I can’t for the life of me work out how much it’ll cost or how to subscribe; but I suspect it’ll be more than the £10 or so I’d be willing to pay per race (where race means all sessions in a race weekend). If my F1 iOS app subscription needs updating that’ll also count. Domain name registration (for example, I recently bought f1datajunkie.com) is about £15/$25 a year from my current provider. (Hmm, that seems a bit steep?) I subscribe to Racecar Engineering (£45/$70 or so per year), the cost of which will get added to the mix. A “big ticket” item I’m saving for (my royalties aren’t that much) on the wants list is a radio scanner to listen in to driver comms at race events (I assume it’d work?). I’d like to be able to make a small regular donation to help keep the ergast site on, but can’t see how to… I need to bear in mind tax payments, but also consider the above as legitimate costs of a self-employed business experiment.

I also figure that as an online publishing venture, any royalties should also go to supporting other digital tools I make use of as part of it. Some time ago, I bought in to the pinboard.in social bookmarking service, I used to have a flickr pro subscription (hmm, I possibly still do? Is there any point…?!) and I spend $13 a year with WordPress.com on domain mapping. In the past I have also gone ad-free ($30 per year). I am considering moving to another host such as Squarespace ($8 per month), because WordPress is too constraining, but am wary of what the migration will involve and how much will break. Whilst self-hosting appeals, I don’t want the grief of doing my own admin if things go pear shaped.

I’m a heavy user of RStudio, and have posted a couple of Shiny apps. I can probably get by on the shinyapps.io free plan for a bit (10 apps) – just – but the step up to the basic plan at $39 a month is too steep.

I used to use Scraperwiki a lot, but have moved away from running any persistent scrapers for some time now. morph.io (which is essentially Scraperwiki classic) is currently free – though looks like a subscription will appear at some point – so I may try to get back into scraping in the background using that service. The Scraperwiki commercial plan is $9/month for 10 scrapers, $29 per month for 100. I have tended in the past to run very small scrapers, which means the number of scrapers can explode quickly, but $29/month is too much.

I also make use of github on a free/open plan, and while I don’t currently have any need for private repos, the entry level micro-plan ($7/month) offers 5. I guess I could use a (private?) github rather than Dropbox for feeding Leanpub, so this might make sense. Of course, I could just treat such a subscription as a regular donation.

It would be quite nice to have access to IPython notebooks online. The easiest solution to this is probably something like wakari.io, which comes in at $25/month, which again is a little bit steep for me at the moment.

In my head, I figure £5/$8/month is about one book per month, £10/$15 is two, £15/$20 is three, £25/$40 is 5. I figure I use these services and I’m making a small amount of pin money from things associated with that use. To help guarantee continuity in provision and maintenance of these services, I can use the first step of a bucket brigade style credit apportionment mechanism to redistribute some of the financial benefits these services have helped me realise.

Ideally, what I’d like to do is spend royalties from 1 book per service per month, perhaps even via sponsored links… (Hmm, there’s a thought – “support coupons” with minimum prices set at the level to cover the costs of running a particular service for one month, with batches of 12 coupons published per service per year… Transparent pricing, hypothecated to specific costs!)

Of course, I could also start looking at running my own services in the cloud, but the additional time cost of getting up and running, as well as hassle of administration, and the stress related to the fear of coping in the face of attack or things properly breaking, means I prefer managed online services where I use them.

Ephemeral Citations – When Presentations You Have Cited Vanish from the Public Web

A couple of months ago, I came across an interesting slide deck reviewing some of the initiatives that Narrative Science have been involved with, including the generation of natural language interpretations of school education grade reports (I think: some natural language take on an individual’s academic scores, at least?). With MOOC fever in part focussing on the development of automated marking and feedback reports, this represents one example of how we might take numerical reports and dashboard displays and turn them into human readable text with some sort of narrative. (Narrative Science do a related thing for reports on schools themselves – How To Edit 52,000 Stories at Once.)

Whenever I come across a slide deck that I think may be in danger of being taken down (for example, because it’s buried down a downloads path on a corporate workshop promoter’s website and has CONFIDENTIAL written all over it) I try to grab a copy of it, but this presentation looked “safe” because it had been on Slideshare for some time.

Since I discovered the presentation, I’ve been recommending it to variou folk, particularly slides 20-22? that refer to the educational example. Trying to find the slidedeck today, a websearch failed to turn it up so I had to go sniffing around to see if I had mentioned a link to the original presentation anywhere. Here’s what I found:

The Wayback machine had grabbed bits and pieces of text, but not the actual slides…

Not only did I not download the presentation, I don’t seem to have grabbed any screenshots of the slides I was particularly interested in… bah:-(

For what it’s worth, here’s the commentary:

Introduction to Narrative Science — Presentation Transcript

We Transform Data IntoStories and Insight…In Seconds
Automatically,Without Human Intervention and at a Significant Scale
To Help Companies: Create New Products Improve Decision-MakingOptimize Customer Interactions
Customer Types Media and Data Business Publishing Companies Reporting
How Does It Work? The Data The Facts The Angles The Structure Stats Tests Calls The Narrative Language Completed Text Our technology platform, Quill™, is a powerful integration of artificial intelligence and data analytics that automatically transforms data into stories.
The following slides are examples of our work based upon a simple premise: structured data in, narrative out. These examples span several domains, including Sports Journalism, Financial Reporting, Real Estate, Business Intelligence, Education, and Marketing Services.
Sports Journalism: Big Ten Network – Data InTransforming Data into Stories
Sports Journalism: Big Ten Network – NarrativeTransforming Data into Stories
Financial Journalism: Forbes – Data InTransforming Data into Stories
Financial Journalism: Forbes – NarrativeTransforming Data into Stories
Short Sale Reporting: Data Explorers – JSON Input
Short Sale Reporting: Data Explorers – Overview North America Consumer Services Short Interest Update There has been a sharp decline in short interest in Marriott International (MAR) in the face of an 11% increase in the companys stock price. Short holdings have declined nearly 14% over the past month to 4.9% of shares outstanding. In the last month, holdings of institutional investors who lend have remained relatively unchanged at just below 17% of the companys shares. Investors have built up their short positions in Carnival (CCL) by 54.3% over the past month to 3.1% of shares outstanding. The share price has gained 8.3% over the past week to $31.93. Holdings of institutional investors who lend are also up slightly over the past month to just above 23% of the common shares in issue by the company. Institutional investors who make their shares available to borrow have reduced their holdings in Weight Watchers International (WTW) by more than 26% to just above 10% of total shares outstanding over the past month. Short sellers have also cut back their positions slightly to just under 6% of the market cap. The price of shares in the company has been on the rise for seven consecutive days and is now at $81.50.
Sector Reporting: Data Explorers – JSON Input
Sector Reporting: Data Explorers – OverviewThursday, October 6, 2011 12:00 PM: HEALTHCARE MIDDAY COMMENTARY:The Healthcare (XLV) sector underperformed the market in early trading on Thursday. Healthcarestocks trailed the market by 0.4%. So far, the Dow rose 0.2%, the NASDAQ saw growth of 0.8%, andthe S&P500 was up 0.4%.Here are a few Healthcare stocks that bucked the sectors downward trend.MRK (Merck & Co Inc.) erased early losses and rose 0.6% to $31.26. The company recentlyannounced its chairman is stepping down. MRK stock traded in the range of $31.21 – $31.56. MRKsvolume was 86.1% lower than usual with 2.5 million shares trading hands. Todays gains still leavethe stock about 11.1% lower than its price three months ago.LUX (Luxottica Group) struggled in early trading but showed resilience later in the day. Shares rose3.8% to $26.92. LUX traded in the range of $26.48 – $26.99. Luxottica Group’s early share volumewas 34,155. Todays gains still leave the stock 21.8% below its 52-week high of $34.43. The stockremains about 16.3% lower than its price three months ago.Shares of UHS (Universal Health Services Inc.) are trading at $32.89, up 81 cents (2.5%) from theprevious close of $32.08. UHS traded in the range of $32.06 – $33.01…
Real Estate: Hanley Wood – Data InTransforming Data into Stories
Real Estate: Hanley Wood – NarrativeTransforming Data into Stories
BI: Leading Fast Food Chain – Data InTransforming Data into Stories
BI: Leading Fast Food Chain – Store Level Report January Promotion Falling Behind Region The launch of the bagels and cream cheese promotion began this month. While your initial sales at the beginning of the promotion were on track with both your ad co-op and the region, your sales this week dropped from last week’s 142 units down to 128 units. Your morning guest count remained even across this period. Taking better advantage of this promotion should help to increase guest count and overall revenue by bringing in new customers. The new item with the greatest growth opportunity this week was the Coffee Cake Muffin. Increasing your sales by just one unit per thousand transactions to match Sales in the region would add another $156 to your monthly profit. That amounts to about $1872 over the course of one year.Transforming Data into Stories
Education: Standardized Testing – Data InTransforming Data into Stories
Education: Standardized Testing – Study RecommendationsTransforming Data into Stories
Marketing Services & Digital Media: Data InTransforming Data into Stories
Marketing Services & Digital Media: NarrativeTransforming Data into Stories

Bah…:-(

PS Slideshare appears to have a new(?) feature – Saved Files – that keeps a copy of files you have downloaded. Or does it? If I save a file and someone deletes it, will the empty shell only remain in my “Saved Files” list?

A Question About Klout…

I’ve no idea how Klout works out it’s scores, but I’m guessing that there is an element of PageRank style algorithmic bootstrapping going on, in which a person’s Klout score is influenced by the Klout score of folk who interact with a person.

So for example, if we look at @briankelly, we see how he influences other influential (or not) folk on Klout:

One thing I’ve noticed about my Klout scrore is that it tends to be lower than most of the folk I have an OU/edtech style relationship with; and no, I don’t obsess about it… I just occasionally refer to it when Klout is in the news, as it was today with an announced tie up with Bing: Bing and Klout Partner to Strengthen Social Search and Online Influence. In this case, if my search results are going to be influenced by Bing, I want to understand what effect that might have on the search results I’m presented with, and how my content/contributions might be being weighted in other peoples’ search results.

So here’s a look at the Klout scrores of the folk I’ve influenced on Klout:

Hmm… seems like many of them are sensible and are completely ignoring Klout. So I’m wondering: is my Klout score depressed relative to other ed-tech folk who are on Klout because I’m not interacting with folk who are playing the Klout game? Which is to say: if you are generating ranking scores based at least in part on the statistics of a particular netwrok, it can be handy to know what netwrok those stats are being measured on. If Klout stats are dominated by components based on networks statistics calculated from membership of the Klout network, that is very different to the sorts of scores you might get if the same stats were calculated over the whole of the Twitter network graph…

Sort of, but not quite, related: a few articles on sampling error and sample bias – Is Your Survey Data Lying to You? and The Most Dangerous Porfession: A Note on Nonsampling Error.

PS Hmmm.. I wonder how my Technorati ranking is doing today…;-)

In Passing, Quick Comments On An OER Powered Digital Scholarship Resource Site

digitalscholarship.ac.uk is a new OER powered resource site intended to support the use of digital tools in academic study. Resources are both tagged and organised into a series of learning topics:

(I’m not sure how the tags relate to the learning topics? That’s one for my to do list…;-)

Some really quick observations that struck me about the pages used to describe resources that are linked to from the site:

I couldn’t immediately work out where the link to the resource actually was (it’s the dark blue backgrounded area top left, the sort of colour that for me gets pushed in to the background of a page and completely ignored, or the “open link in new window” link); I expected the value of the “Type of Media” attribute (eg “PDF download” in the above case) to be the link to the resource, rather than just a metadata statement.
What on earth is going on with the crumb trail… [ 1 ] [ 2 ] etc to me represents page numbers of a kind (eg first page of results, second page of results) not depth in some weird hierarchy, as seems to be the case in this site?
The comment captcha asks you “What is the code in the image?” Erm, code???? Do I have to decipher the captcha characters somehow (some Captchas offer a simple some for you to calculate, for example)? Erm… What do I do? What do I do??!?!!?
I got a bit confused at first that the page was just using redirects rather than direct links to resources – the “Visit and rate this resource” link is a redirect that loads the target resource into a frame topped by a comment bar. (I didn’t spot, at first, that the ‘open in new window’ link points directly to the resource. And, erm, why would I open in a new window? New tab, maybe, though typically I choose to that via a right-click contextual menu action if I don’t want a page to load in the current tab?

The “Visit and Rate” link adds a topbar view over the actual resource page:

The “Add a comment” call to action (again, deep blue background which pushed it away from my attention) opens a comment page in a new tab… I think I’d have preferred this to open within the context of the top bar, so that I could refer directly to the resource within the context of the same tab whilst actually making a comment?

One final comment – and something I’ll try to get round to at some point: how do the resources look as a graph…?;-) It would be great if all the resources were available as data via an OPML feed, with one item per resource and also metadata describing the tags and Learning Topics identified, and then we could map how tags related to Learning Topics and maybe try to learn something from that. As a data file listing the resources doesn’t seem to be available, a scrape is probably called for… Here’s a recipe I’ll maybe try out at some point:

– scrape the tagcloud page for: tags and individual tag page URL; learning topic page URLs
– for each tag page and learning topic page, grab the URLs to resource pages, deduping along the way; this should grab us links to all resource pages, include pathological ones that have a tag but no topic, or a topic but no tag;
– for each resource page, scrape everything ;-)
– and finally: play with the data… maybe do a few graphs, maybe generate a custom search engine over the resources (or add it to my UK HEI Library Community CSE [about] etc etc)

PS Martin Weller has blogged a short note about the process used to identify OERs included in the site here: Two OER sites for researchers.