When Will It Just Be Machines Talking to Machines?

Do you ever get the feeling that the machines are trying to influence the way you think, or take charge of your communicative acts?

One of the things I noticed for the first time today was that my WordPress editor seems to have started converting some pasted in URLs to actual links, using the pasted URL as the href attribute value and with the link text pulled from the referenced page (WordPress Editor Generates Page Title Links From Pasted URLs). Thinking about it, this is an example of an auto__completion__ behaviour in which the machine has detected some pattern and “completed” it based on the assumption that I intend to “complete” the pattern by turning it from a URL to a web hyperlink.

That is, I paste in X but actually want to represent it as [Y](X) (a link represented in markdown, where Y is the link text and X the target URL or <a href="X">Y</a> (an HTML link).

I imagine most people are familiar with the notion that Google offers a range of autocompletion and autosuggestion terms when you start to type in a Google web search (I don’t think the voice search (yet?) starts to interrupt as you when you ‘ok Google’ it (I don’t knowingly have any voice interfaces activated…))…

What I’ve also noticed over the last few days that a Gmail update seems to have come along with a new, positively set default that opts me in to an autocomplete service there when I’m replying to an email at least:

This service has been available since May, 2018, at least: SUBJECT: Write emails faster with Smart Compose in Gmail.

In look and feel, it’s very reminiscent of code autocompletion support in programming code editors. If you aren’t a programmer, know that computer programmes are essentially composed of fixed vocabulary terms (whether imposed by the language or defined within the programme itself), so code completion makes absolute sense to the people who built the Gmail software application and user interface. Why on earth wouldn’t you want it everywhere…

A couple of things concern me without even thinking about it:

  1. What could possibly go wrong…
  2. Does autocomplete change what people intend to write?

In the paper Responsible epistemic technologies: A social-epistemological analysis of autocompleted web search, Miller & Record write:

[U]sers’ exposure to autosuggestions is involuntary. Users cannot type a search without encountering autosuggestions. Once seen, they cannot “unsee” the results. …

Psychology suggests two likely consequences of involuntary exposure. First, initially disregarded associations sometimes transform into beliefs because humans are prone to source-monitoring errors: subjects mistake the original information source and may put more or less credence in the information than they would have given the correct source (e.g. Johnson et al., 1993). Someone might read [something] and initially disregard it, but later, having forgotten the source, recall having read it. This is supported by the sleeper effect, according to which when people receive a message with a discounting cue, they are less persuaded by it immediately than later in time (Kumkale and Albarracín, 2004). Second, involuntary exposure to certain autosuggestions may reinforce unwanted beliefs. Humans are bad at identifying and rooting out their implicit biases (Kenyon, 2014). Because exposure is involuntary, even subjects hygienic in their epistemic practices may be negatively affected.

[A]utosuggestions interactively affect a user’s inquiry, leading to paths she might not have pursued otherwise. Effectively, if a user looks at the screen, she can’t help but see the autosuggestions, and these impressions can affect her inquiry. Autosuggestions may seem to a user to delimit the possible options or represent what most people find relevant, either of which may change her search behavior. She may change her search terms for one of the suggestions, add or subtract additional terms to rule out or in suggested results. She may abandon her search altogether because the autosuggestions seem to provide the answer or indicate that there is no answer to be found that is she may assume that because nothing is being suggested, no results for the query exist. Furthermore, because the displayed information may be incomplete or out of context, she might reach a different conclusion on the basis of autosuggestions than if she actually visited the linked page.

Altering a user’s path of inquiry can have positive effects, as when he is exposed to relevant information he might not have encountered given his chosen search terms. But the effects may also be negative. … Such derails in inquiry may be deleterious… .

Finally, autosuggestions affect users’ belief formation process in a real-time interactive and responsive manner. “It helps to complete a thought,” as one user put this (Ward et al., 2012: 12). They may thus generate beliefs the user might not have had. Based on autosuggestions, I might erroneously believe [X]. Alternatively, I might come to believe that these things are possible, where before I held no beliefs about them, or I might give these propositions more credence than I would otherwise. Autocomplete is like talking with someone constantly cutting you off trying to finish your sentences. This can be annoying when the person is way off base or pleasant when he seems like your mind-reading soulmate. Either way, it has a distracting, attention-shifting effect that other interactive interface technologies lack.

As an aside, I also note that as well as offering autosuggestion possibilities that intrude on our personal communicative acts, it’s also acting as a proxy that can buffer us from having to engage in those actions. Spam filtering is one example (I tend not to review my spam filter folders, so I’m not sure how many legitimate emails get passed through to it. Hmm, thinks, does a contemporary version of the OSS Simple Sabotage Field Manual markdown include suggestions to train corporate spam filters on legitimate administrative internal emails?)

A good example of creeping intermediation comes in the form of Google Duplex, a voice agent / assistant demoed earlier this year that can engage in certain phone-based, voice interactions on your behalf. It’s about to start appearing in the wild on Pixel phones (Pixel 3 and on-device AI: Putting superpowers in your pocket).

One of the on-device features that will be supported is a new Call Screen service:

You can see who’s calling and why before you answer a call with the help of your Google Assistant. …

  1. When someone calls, tap Screen call.
  2. The Google Assistant will … ask who’s calling and why. Then you’ll see a real-time transcript of how the caller responds.
  3. Once the caller responds, choose a suggested response or an action. Here are some responses and what the caller will hear:
    • ​​Is it urgent? – “Do you need to get a hold of them urgently?”
    • Report as spam – “Please remove this number from your mailing and contact list. Thanks, and goodbye.”
    • I’ll call you back – “They can’t talk right now, but they’ll give you a call later. Thanks, and goodbye.”
    • I can’t understand – “It’s difficult to understand you at the moment. Could you repeat what you just said?”

But not actually “transfer” the call to the user so they can answer it?!

According to Buzzfeed (The Pixel 3: Everything You Need To Know About Google’s New Phone), the Call Screen bot will answer the phone for you and challenge the caller: “The person you’re calling is using a screening service and will get a copy of this conversation. Go ahead and say your name and why you’re calling.” This raises the interesting question of how another (Google) bot on the calling side might respond…

(By the by, thinks: phone receptionists – the automated voice assistants will be after your job…)

It’s probably also worth remembering that:

[s]ometimes Call Screen may not understand the caller. To ask the caller to repeat themselves, tap I can’t understand. The caller will hear, “It’s difficult to understand you at the moment. Could you repeat what you just said?”

So now rather than you spending a couple of seconds to answer the phone, realise it’s a spam caller, and hang up, you have to take even more time out waiting on Call Screen, reading the Call Screen messages and training it a bit further when it gets stuck? But I guess that’s how you pay for its freeness.

Anyway, as part of your #resistance defense toolkit, maybe add that phrase to your growing list of robot tells. (Is there a full list anywhere?)

As well as autocomplete and autosuggest, I note the ever engaging Pete Warden blogging recently on the question of Will Compression Be Machine Learning’s Killer App?:

One of the other reasons I think ML is such a good fit for compression is how many interesting results we’ve had recently with natural language. If you squint, you can see captioning as a way of radically compressing an image. One of the projects I’ve long wanted to create is a camera that runs captioning at one frame per second, and then writes each one out as a series of lines in a log file. That would create a very simplistic story of what the camera sees over time, I think of it as a narrative sensor.

The reason I think of this as compression is that you can then apply a generative neural network to each caption to recreate images. The images won’t be literal matches to the inputs, but they should carry the same meaning. If you want results that are closer to the originals, you can also look at stylization, for example to create a line drawing of each scene. What these techniques have in common is that they identify parts of the input that are most important to us as people, and ignore the rest.

Which is to say: compress the image by creating a description of it and then generating an image based on the description at the other end. A picture may save a thousand words, but if the thousand words compress smaller than the picture in terms of bits and bytes, that makes sense to the data storage and transmission folk, albeit at the trade off of increased compute requirements on either side.

Hmm, this reminds me of a thinkses from over a decade ago on The Future of Music:

My expectation over the last 5 years or so was that CD singles/albums would start to include remix applications/software studios on that medium – but I’ve been tracking it as a download reality on and off for the last 6 months or so (though it’s been happening for longer).

That said – my expectation of getting the ‘src’ on the CD was predicated on the supply of the remix application on the CD too, rather than it being pre-installed on the users’ computer.

The next thing I’m looking out for is a ‘live by machine’ gig, where a club franchise has real hardware/synths being played at a distance by the band, who are maybe in another venue owned by that club chain?

For this, you have to imagine banks of synths receiving (MIDI) control signals over the net from the real musicians playing live elsewhere.

This is not so much online jamming (or here: eJamming) – where you mix realtime audio feeds from other musicians on the web with your own efforts – as real time creation of the music from audio generators…

It’s also interesting to note that the “reproducibility” requirement associated with shipping the software tooling required to let you make use of the data (“predicated on the supply of the remix application on the CD too”), as well as the data, was in my thinking even then…

Pete Warden goes on:

It’s not just images

There’s a similar trend in the speech world. Voice recognition is improving rapidly, and so is the ability to synthesize speech. Recognition can be seen as the process of compressing audio into natural language text, and synthesis as the reverse. You could imagine being able to highly compress conversations down to transmitting written representations rather than audio. I can’t imagine a need to go that far, but it does seem likely that we’ll be able to achieve much better quality and lower bandwidth by exploiting our new understanding of the patterns in speech.

I even see interesting possibilities for applying ML compression to text itself. Andrej Karpathy’s char-rnn shows how well neural networks can mimic styles given some examples, and that prediction is a similar problem to compression. If you think about how much redundancy is in a typical HTML page, it seems likely that there would be some decent opportunities for ML to improve on gzip. This is getting into speculation though, since I don’t have any ML text compression papers handy.

Ah ha…

Tangentially related, ramblings on Google languaging: Translate to Google Statistical (“Google Standard”?!) English? and Google Translate Equilibrium Finder. FWIW, these aren’t machine generated “related” items: they’re old thoughts I remembered blogging about before…)

Wordpress Editor Generates Page Title Links From Pasted URLs

Noting that if I paste a URL into my WordPress.com visual editor, behind the scenes it can look up the link, pull a page title back, and create a link using the title as link text and the link set to the URL I pasted in:

I’m not sure if this requires any particular metadata on the page referenced by the link? Certainly, it doesn’t seem to work for every URL? But then, Pete Warden’s blog – what do you expect?!;-)

Will Compression Be Machine Learning’s Killer App?

Here’s a closer look, watching the page traffic that’s returned using browser developer tools (View->Developer in Chrome):

This is what’s returned:

{"success":true,"data":{"body":"<a href="https:\/\/petewarden.com\/2018\/10\/16\/will-compression-be-machine-learnings-killer-app\/">Will Compression Be Machine Learning’s Killer App?","attr":{"width":676,"height":1000}}}


And this is what was sent:

I wonder if the same mechanic is used to embed Youtube videos when you paste in a Youtube URL? Although that may be done in the web page itself (you can generate the Youtube embed code simply by extracting the video ID from the pasted URL and constructing the embed code using a template.)

Styling Python and SQL Code in Jupyter Notebooks

One of the magics we use in the TM351 Jupyter notebooks is the ipython-sql magic that lets you create a connection to a database server (in our case, a PostgreSQL database) and then run queries on it:

Whilst we try to use consistent code styling across the notebooks, such as capitalisation of SQL reserved words (SELECT, FROM, WHERE etc), sometimes inconsistencies can crop in. (The same is true when formatting Python code.)

One of the notebook extensions can help in this respect: code prettifier. This extension allows you to style one or all code cells in a notebook using a templated recipe:

The following snippet applies to Python cells and will apply the yapf Python code formatter to Python code cells by default, or the sqlparse SQL code formatter if the cell starts with a %%sql block magic. (It needs a bit more work to cope with %sql line magic, in which case the Python formatter needs to be applied first and then the SQL formatter applied from the start of the %sql line magic to the end of the line.

#"library":'''
import json
def code_reformat(cell_text):
    import yapf.yapflib.yapf_api
    import sqlparse
    import re
    comment = '--%--' if cell_text.startswith('%%sql') else '#%#'
    cell_text = re.sub('^%', comment, cell_text, flags=re.M)
    reformated_text = yapf.yapflib.yapf_api.FormatCode(cell_text)[0] if comment=='#%#' else sqlparse.format(cell_text, keyword_case='upper')
    return re.sub('^{}'.format(comment), '%', reformated_text, flags=re.M)
#''',
#"prefix":"print(json.dumps(code_reformat(u",
#"postfix": ")))"

Or as a string:

"python": {\n"library": "import json\ndef code_reformat(cell_text):\n import yapf.yapflib.yapf_api\n import sqlparse\n import re\n comment = '--%--' if cell_text.startswith('%%sql') else '#%#'\n cell_text = re.sub('^%', comment, cell_text, flags=re.M)\n reformated_text = yapf.yapflib.yapf_api.FormatCode(cell_text)[0] if comment=='#%#' else sqlparse.format(cell_text, keyword_case='upper')\n return re.sub('^{}'.format(comment), '%', reformated_text, flags=re.M)",
"prefix": "print(json.dumps(code_reformat(u",
"postfix": ")))"\n}

On my to do list is to find a way of running the code prettifier over notebooks from the command line using nbconvert. If you have a snippet that shows how to do that, please share via the comments:-)

PS clunky, but this sort of handles the line magic?

import json

def code_reformat(cell_text):
    import yapf.yapflib.yapf_api
    import sqlparse
    import re

    def sqlmatch(match):
        return '%sql'+sqlparse.format(match.group(1), keyword_case='upper')

    comment = '--%--' if cell_text.startswith('%%sql') else '#%#'
    cell_text = re.sub('^%', comment, cell_text, flags=re.M)
    reformatted_text = yapf.yapflib.yapf_api.FormatCode(cell_text)[0] if comment=='#%#' else sqlparse.format(cell_text, keyword_case='upper')
    reformatted_text = re.sub('^{}'.format(comment), '%', reformatted_text, flags=re.M)
    if not cell_text.startswith('%%sql'):
        reformatted_text=re.sub('%sql(.*)',sqlmatch , reformatted_text, re.MULTILINE)
    return reformatted_text

NB the sqlparse function doesn’t seem to handle functions (eg count(*)) [bug?] but this horrible workaround hack to substitute for sqlparse.format() may provide a stop gap?

#replace sqlparse.format with sqlhackformat() defined as follows:
def sqlhackformat(sql):
    #Escape the brackets, parse, then unescape the brackets
    return re.sub(r'\\(.)', r'\1', sqlparse.format(re.escape(sql), keyword_case='upper'))

If there is no semi-colon at the end of the final statement, we could handle any extra white space and add one with something like: s  = '{};'.format(s.strip()) if not s.strip().endswith(';') else s.strip().

Fragment – ROER: Reproducible Open Educational Resources

Fragment, because I’m obviously not making sense with this to anyone…

In the words of David Wiley (@opencontent), in defining the “open” in open content and open educational resources [link], he identifies “the 5R activities” that are supported by open licensing:

  1. Retain – the right to make, own, and control copies of the content (e.g., download, duplicate, store, and manage)
  2. Reuse – the right to use the content in a wide range of ways (e.g., in a class, in a study group, on a website, in a video)
  3. Revise – the right to adapt, adjust, modify, or alter the content itself (e.g., translate the content into another language)
  4. Remix – the right to combine the original or revised content with other material to create something new (e.g., incorporate the content into a mashup)
  5. Redistribute – the right to share copies of the original content, your revisions, or your remixes with others (e.g., give a copy of the content to a friend)

Whilst the legal framework is the one that has to be in place for educational institutions as publishers of (third party) content, where there is particular emphasis on citing others and not reusing content in an unacknowledged way, I have always been more interested in the practice of reusing content, particular when that means reuse with modification.

Various others have suggested sixth Rs. For example, Chris Aldridge’s The Sixth “R” of Open Educational Resources identifies “Request update (or maybe pull Request, Recompile, or Report to keep it in the R family?)“, in the sense of keeping stuff in a repo so people can fork the report, update it and track changes (maybe Revisions is a better R for that?). Rather than be an R relating to a right you can assert, Revisions is more about the practice.

As a trawl through the history of this blog suggests (for example, Open Content Anecdotes from nigh on a decade ago), I’ve also been less interested in the legal framework around OERs than I am in the practical reuse (with modification) of particular (micro? atomic?) assets, such as diagrams, or problem sets. That is, the things you are more likely to spot as relevant, useful or interesting and weave into your own materials, or replace your own crappy diagrams.

To a large extent, the legal bit doesn’t stop me, particularly if no-one finds out. The blocker is in the practicalities associated with reversioning the physical asset, and making actual changes or modifications to it, that makes it appropriate for my course.

(You can always redraw diagrams, which can also help you get round copyright on non-openly licensed works, but that takes time, skill, and maybe a drawing package you don’t have access to.)

So the idea that I started trying to crystallise out almost a year ago now  — OERs in Practice: Re-use With Modification — is based around another R, a leading R, or a +1 R, which (as with Aldridge’s suggested R) is a practice based R: Reproducibility. (Hmm.. maybe this post should have been The 5+2Rs of Open Content?)

By their very nature, these resources are resources that include their own “source code” for creating assets, so that if you want to create a modified version of the asset, you can modify the source and then regenerate the asset. A trivial example is to use diagrams that are “written diagrams” – diagrams generated from textual, written descriptions or codified versions of them and rendered from the description by a particular chart generating tool (for example, Writing Diagrams, Writing Diagrams – Boxes and Arrows and Writing Diagrams (Incl. Mathematical Diagrams)).

As to why this is a fragment, I’m stopping here… discussion about reproducibility is elsewhere, and will be to follow too, along with why this approach is opens up new opportunities for educators as well as learners. For now, see these other fragments on that topic in date order.

[Update: via a comment, @opencontent reminds me that he also made the distinction between legal and practical issues, with practical concerns raised in the ALMS framework – I should have read on from the legal issues to the Poor Technical Choices Make Open Content Less Open section… See the comment thread to this post for more, as well as this related post from the previous time the ALMS model was raised to my attention: Open ALMS.  I also note this recent IRRODL paper on Defining OER-Enabled Pedagogy, which I need to read through…]

PS some more related fragments from a hastily written, unsuccessful internal Esteem Project bid:

Description

One problem associated with producing rich educational materials is that inconsistencies can occur when cross referencing text with media assets such as charts, tables, diagrams and computer code produced via different production routes. The project will explore and demonstrate how emerging technologies and workflows developed to support reproducible work practices can be adopted for the development of reproducible educational resources, including but not limited to educational materials rich in mathematical content, scientific / engineering diagrams, exploratory and explanatory statistics, maps and geospatial analysis, music theory and analysis, interactive browser based activities, animations and dynamically created audio assets.

The aim is to demonstrate:

  • The range of assets that can be produced / directly authored by academics including static, animated and interactive elements such as print quality drawings, animated scientific diagrams, and interactive web activities and applications (eg interactive maps, 3D models, etc.)
  • The workflows associated with demonstrating the production, maintenance, and reuse with modification of the assets and works derived from them, including but not limited variations on a theme in the production of parameterised assessment materials
  • The potential for using reproducible materials to facilitate maintenance, reversioning / updating and reuse with modification of module materials The outputs will include:
    • A range of OU module and OpenLearn unit materials reworked using the proposed technologies and workflows
    • A library of reproducible educational resource templates for a range of topic areas capable of being reused with modification in order to produce a range of assets from a common template.

The project will demonstrate how freely available, open source technologies can be used to support the direct authoring of rich and interactive media assets in a reproducible way.

Rationale

Reproducible research tools increasingly support the direct authoring of rich documents that blend text, data, code, code outputs, with media assets (audio, video, static and animated images, interactives) generated from text based computer scripts.

The project proposes the co-option of such tools for use as authoring tools for reproducible educational materials. A single “source document” can include text as well as scripts for generating tables and charts, for example, from data contained within the document itself, minimising the distance between the production of assets from the materials they are used in.

The resulting workflow supports consistency in production and maintenance as well as reuse with modification thereafter by allowing updates in situ that can be used to recreate modified assets (diagrams created dynamically using updated values, for example). Materials will also be modifiable by ALs for tutorial use.

Authoring tools also support the direct authoring and creation of interactive components based around templated third party widgets such as 3D molecule viewers, or interactive maps, allowing authors to directly author interactive components that can be embedded in online / browser accessed materials.

Examples of the sorts of assets I had in mind to rework can be found in several of the notebooks available here.

Jupyter Notebooks Seep into the Everyday…

If you ever poke around in developer documentation, you’ll be familiar with the idea that it often contains code fragments, and things to copy and paste into the command line.

This morning, I noticed an announcement around the Android Management API, and in particular the detail of the Quick Start Guide:

To get started with the Android Management API, we’ve created a Colab notebook that you can follow to enroll (sic) an enterprise, create a policy, and provision a device.

 

Colab notebook [their strong emphasis]…

Which is to say, a Jupyter notebook running on Google’s collaborative Colab notebook platform.

Here’s what the start of the quick start docs, notebook style, look like:

As you’d expect, a bled of text and code. You can see the first code block at the bottom of the screenshot where you can enter your own personal project id; later code cells walk you through connecting to the API and working with it:

 

So – that’s interesting thing number one… Google using notebooks just anyway as part of “operationalised” tutorial materials (Interactive documentation? Operationalised documentation?

See also: [Jupyter Notebooks for Scheduling HESA Data Returns? If It’s Good Enough for Netflix…] (https://blog.ouseful.info/2018/09/25/jupyter-notebooks-for-hesa-data-returns/) for how notebooks can be used in production environments for scheduling operations and not just for analysis.

Second thing: Colab. It’s been some time since I looked at it, but one thing that jumped out at me was the inclusion of code snippets in the sidebar…

These working demo code snippets can be added the document at the cursor point:

Running them produces the example output, as you’d expect, in the output part of the cell:

Snippets to add linking and brushing (data vis 101; look it up…) or linked charts are also available:

I’m not sure if this sort of interaction (in terms of getting interactive assets into a document) is what the OpenCreate team are looking to provide, but this may be worth looking at just to see how the interaction feels, and the way in which it allows authors to add live interactive charts, for example, to a content notebook.

 

In other news, my belated and hastily written project bid to the internal Esteem group to look at creating “reproducible educational materials” as demonstrated by reworking examples of OpenLearn materials was rejected as “not a clearly defined scholarship project”. I’m going to carry on with it anyway, because I think we can learn a lot from it about how notebook style environments can be used to produce:

  • new forms of interactive educational material, opened up by the availability of “generative” content creation code/magics;
  • educational resources that are by their very “source included” nature capable of reuse with modification;
  • educational resources that can provide students with a rich range of interactive activities, presented inline/contextualised in a narrative document, added to the document by authors directly with little, if any, technical / programming skills;
  • an interactive environment where students/learners can create their own interactives and / or generative examples to explore a particular topic or idea, again without the need for much in the way of technical / programming skills.

I’m also going to start working on a set of training resources around Jupyter notebooks using my 0.2FTE not OU time; depending on how that goes, I may try to turn Jupyter training / development into an exit plan. Please get in touch — tony.hirst@open.ac.uk — if this sounds of interest…

Name (Date) Title, Available at: URL (Accessed: DATE): So What?

Academic referencing is designed, in part, to support the retrieval of material that is being referenced, as well as recognising provenance.

The following guidance, taken from the OU Library’s Academic Referencing Guidelines, is, I imagine, typical:

That page appears in an OU hosted Moodle course (OU Harvard guide to citing references) that requires authentication. So whilst stating the provenance, it won’t necessarily support the retrieval of content from that site for most people.

Where an (n.d) — no date — citation is provided, it also becomes hard for someone checking the page in the future whether or not the content has changed, and if so, which parts.

Looking at the referencing scheme for organisational websites, there’s no suggestion that authentication is required is listed in the citation (the same is true in the guidance for citing online newspaper articles).

 

I also didn’t see guidance offhand for how to reference pages where the page presentation is likely customised by “an algorithm” according to personal preferences or interaction history; placement of things like ads are generally dynamic, and often personalised (personalisation may be based on multiple things, such as the cookie state of the browser with which you are looking at a page, or the history of transactions (sites visited) from the IP address you are connecting to a site from).

This doesn’t matter for static content, but it does matter if you want to reference something like a screenshot / screencapture, for example showing the results of a particular search on a web search engine. In this case, adding a date and citing the page publisher (that is, the web search engine, for example) is about as good as you can get, but it misses a huge amount of context. The fact that you got extremist results might be because your web history reveals you to be a raging fanatic, and the fact that you grabbed the screenshot from the premises of your neo-extremist clubhouse just added more juice to the search. One partial solution to disabling personalisation features might be to run a search in a “private” browser session where cookies are disabled, and cite that fact, although this still won’t stop IP address profiling and browser fingerprinting.

I’ve pondered related things before, eg when asking Could Librarians Be Influential Friends? And Who Owns Your Search Persona?, as well as in a talk given 10 years ago and picked up at the time by Martin Weller on his original blog site (Your Search Is Valuable To Us; or should that be: Weller, M. (2008) 'Your Search Is Valuable To Us' *The Ed Techie*, 9 September [Blog] Available at http://nogoodreason.typepad.co.uk/no_good_reason/2008/10/your-search-is-valuable-to-us.html (Accessed 26 September 2018).?).

Most of the time, however, web references are to static content, so what role does the Accessed on date play here? I can imagine discussions way back when, when this form was being agreed on (is there a history of the discussion that took place when formulating and adopting this form?) where someone said something like “what we need is to record the date the page was accessed on and capture it somewhere“, and then the second part of that phrase was lost or disregarded as being too “but how would we do that?”…

One of the issues we face in maintaining OU courses, where content starts being written 2 years before a course  start and is expected to last for 5+ years of presentation, is maintaining the integrity of weblinks. Over that period of time, you might expect pages to change in a couple of ways, even if the URL persists and the “content” part remains largely the same:

  • the page style (that is, the view as presented) may change;
  • the surrounding navigation or context (for example, sidebar content) may change.

But let’s suppose we can ignore those. Instead, let’s focus on how we can try to make sure that the a student can follow a link to the resource we intend.

One of the things I remember from years ago were conversations around keeping locally archived copies of webpages and presenting those copies to students, but I’m not sure this ever happened. (Instead, there was a short of middle ground  compromise of running link checkers, but I think that was just to spot 404 page not found errors rather than checking a hash made on the content you were interested in, which would be difficult.)

At one point, I religiously kept archived copies of pages I referenced in course materials so that if the page died, I could check back on my own copy to see what the sense of the page now lost was so I could find a sensible alternative, but a year or two off course production and that practice slipped.

Back to the (Accessed DATE) clause. So what? In Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs I mentioned a couple of Wikipedia bots that check link integrity on Wikipedia (see also: Internet Archive blog: More than 9 million broken links on Wikipedia are now rescued). These can perform actions like archiving web pages, checking links are still working, and changing broken links to point to an archived copy of the same link. I hinted that it would be useful if the VLE offered the same services. They don’t, at least, not going by reports from early starters to this year’s TM351 presentation who are already flagging up broken links (do we not run a link checker anymore? (I think I asked that in the Broken URLs post a year ago, too…?)

Which is where (Accessed DATE) comes in. If you do accede to that referencing convention, why not make sure that that an archived copy of that page, ideally made on that date. Someone chasing the reference can then see what you accessed, and perhaps if they are visiting the page somewhen in the future, see how the future page compares with the original. (This won’t help with authentication controlled content or personalised page content though.)

An easy way of archiving a page in a way that others can access it is to use the Internet Archive’s Wayback Machine (for example, If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine).

From the Wayback Machine homepage, you can simply add a link to a page you want to archive:

 

hit SAVE NOW (note, this is saving a different page; I forgot to save the screenshot of the previous one, even though I had grabbed it. Oops…):

and then you have access to the archived page, on the date it was accessed:

A more useful complete citation would now be Weller, M. (2008) 'Your Search Is Valuable To Us' *The Ed Techie*, 9 September [Blog] Available at http://nogoodreason.typepad.co.uk/no_good_reason/2008/10/your-search-is-valuable-to-us.html (Accessed 26 September 2018. Archived at https://web.archive.org/web/20180926102430/http://nogoodreason.typepad.co.uk/no_good_reason/2008/10/your-search-is-valuable-to-us.html).

Two more things…

Firstly, my original OUseful.info blog was hosted on an OU blog server; when that was decommissioned, I archived the posts on a subdomain of .open.ac.uk I’d managed to grab. That subdomain was deleted a few months ago, taking with it the original blog archive. Step in the Wayback Machine. It didn’t have a full copy of the original blog site, but I did manage to retrieve quite a few of the pages using this wayback machine downloader using the command wayback_machine_downloader http://blogs.open.ac.uk/Maths/ajh59 or for a slightly later archivewayback_machine_downloader http://ouseful.open.ac.uk/blogarchive. I made the original internal URLs relative (find . -name '*.html' | xargs perl -pi -e 's/http:\/\/blogs.open.ac.uk\/Maths\/ajh59/./g'),  (or as appropriate for http://ouseful.open.ac.uk/blogarchive), used a similar approach to remove tracking scripts from the pages, uploaded the pages to Github (psychemedia/original-ouseful-blog-archive), enabled the repo as a Github pages site, and the pages are now at https://psychemedia.github.io/original-ouseful-blog-archive/pages/ It looks like the best archive is at the UK Web Archive, but I can’t see a way of getting a bulk export from that? https://www.webarchive.org.uk/wayback/archive/20170623023358/http://ouseful.open.ac.uk/blogarchive/010828.html

Secondly, bots; VLE bots… Doing some maintenance on TM351, I notice it has callouts to other OU courses, including TU100, which has been replaced by TM111 and TM112. It would be handy to be able to automatically discover references to other courses made from within a course to support maintenance. Using some OU-XML schema markup to identify such references would be sensible? The OU-XML document source structure should provide a veritable playground for OU bots to scurry around. I wonder if there are any, and if so, what do they do?

 

PS via Richard Nurse, reminding me that Memento is also useful when trying to track down original content and/or retrieve content for broken link pages from the Internet Archive: UK Web Archive Mementos search and time travel.

Richard also comments that “OU modules are being web archived by OU Archive – 1st, mid point and last presentation -only have 1 in OUDA currently staff login only – on list to make them more widely available but prob only to staff given 3rd party rights in OU courses“. Interesting…

PPS And via Herbert Van de Sompel, a list of archives accessed via time travel, as well as a way of decorating web links to help make them a bit more resilient: Robust Links – Link Decoration.

By the by, Richard, Kevin Ashley and @cogdog/Alan also point me to the various browser extensions that make life easier adding pages to archives or digging into their history. Examples here: Memento tools. I’m not sure what advice the OU Library gives to students about things like this; certainly my experience of interactions with students, academics and editors alike around broken links suggests that not many of them are aware of the Internet Archive / UK web Archive, Wayback Machine, etc etc?

OUseful.info – where the lede is usually buried…