Category: Anything you want

More Thoughts On Jupyter Notebook Search

Following on from initial sketch of Searching Jupyter Notebooks Using lunr, here’s a quick first pass [gist] at pouring Jupyter notebook cell contents (code and markdown) into a SQLite database, running a query over it and then inspecting the results using a modified NLTK text concordancer to show the search phrase in the context of where it’s located in a document.

The concordancer means we can offer a results listing more in accordance with a traditional search engine, showing just the text in the immediate vicinity of a search term. (Hmm, I’d need to check what happens if the search term appears multiple times in the search result text.) This means we can offer a tidier display the dumping the contents of a complete cell into the results listing.

The table the notebook data is added to is created so that it supports full text search. However, I imagine that any stemming that we could apply is not best suited to indexing code.

Similarly, the NLTK tokeniser doesn’t handle code very well. For example, splits occur around # and % symbols, which means things like magics, such as %load_ext, aren’t recognised; instead, they’re split into separate tokens: % and load_ext.

A bigger issue for the db approach is that I need to find a way to update / clean the database as and when notebooks are saved, updated, deleted etc.

Surveillance Art?

An interesting sounding site, Artificial Senses, which “visualizes sensor data of the machines that surround us to develop an understanding how they experience the world”.

Artificial Senses is a project by Kim Albrecht in collaboration with metaLAB (at) Harvard, and supported by the Berkman Klein Center for Internet & Society. The project is part of a larger initiative researching the boundaries between artificial intelligence and society.

But along the way, so you can “participate”, it prompts you for access to various sensors on the device you are viewing the page from. So for example, your location:

To your camera:

And to your microphone:

Here’s the Javascript:

var touching = true;
var seeing = false;
var hearing = false;
var orienting = false;
var moving = false;
var locating = false;

var isSafari = /^((?!chrome|android).)*safari/i.test(navigator.userAgent);

// // // // // // // // // // // // // // // // // // // // Seeing

document.getElementById('liveTouching').style.visibility = "visible";
document.getElementById('touchingLiveButton').style.visibility = "visible";

// // // // // // // // // // // // // // // // // // // // Seeing

var constraintsSee = {
  audio: false,
  video: { }
};

function handleSuccessSee() {
    seeing = true;
    document.getElementById('liveSeeing').style.visibility = "visible";
  document.getElementById('seeingLiveButton').style.visibility = "visible";
}
function handleErrorSee(error) {
  console.log('navigator.getUserMedia error: ', error);
}

if (!isSafari) {
  navigator.mediaDevices.getUserMedia(constraintsSee).then(handleSuccessSee).catch(handleErrorSee);
}

// // // // // // // // // // // // // // // // // // // // Hearing

var constraintsHear = {
  audio: true,
  video: false
};

function handleSuccessHear() {
    hearing = true;
    document.getElementById('liveHearing').style.visibility = "visible";
  document.getElementById('hearingLiveButton').style.visibility = "visible";
}
function handleErrorHear(error) {
  console.log('navigator.getUserMedia error: ', error);
}

if (!isSafari) {
  navigator.mediaDevices.getUserMedia(constraintsHear).then(handleSuccessHear).catch(handleErrorHear);
}

// // // // // // // // // // // // // // // // // // // // Orienting

if (!orienting) {
  window.addEventListener('deviceorientation', function(event) {

    if (event.alpha !== null) {
        // orienting = true;
        document.getElementById('liveOrienting').style.visibility = "visible";
      document.getElementById('orientingLiveButton').style.visibility = "visible";
    }

  });
}  


// // // // // // // // // // // // // // // // // // // // Moving

if (!moving) {
  window.addEventListener('devicemotion', function(event) {

    if (event.acceleration.x !== null) {
        moving = true;
        document.getElementById('liveMoving').style.visibility = "visible";
      document.getElementById('movingLiveButton').style.visibility = "visible";
    }

  });
} 

// // // // // // // // // // // // // // // // // // // // Locating

navigator.geolocation.getCurrentPosition(function(position) {

  locating = true;

  document.getElementById('liveLocating').style.visibility = "visible";
  document.getElementById('locatingLiveButton').style.visibility = "visible";

});

One of the things I wanted to do in my (tiny) bit of the new OU level 1 course, a section on “location based computing”, was try to get folk to reflect on how easily tracked we are through our computational devices. (If you want to play along, try this browser based activity sign up for a Microsoft Live account (OU staff/ALs can sign in with their OUCU*oen.ac.uk credentials) and try these notebooks: TM112 Geo Activity Notebooks.)

The same course has a section on the mobile phone system more generally. I’m not sure if it has similarly minded activities that demonstrate the full range of sensors that that can be found on most of today’s smartphones? If not, the Artifical Senses sight might be worth adding as a resource – with a reminder for folk to disable site access to the sensors once they’ve done playing…

OER Methods – Generative Designs for Reuse-With-Modification

Via my feeds (The stuff ain’t enough), I notice Martin pointing to some UNESCO draft OER Recommendations.

Martin writes:

… the resources are a necessary starting point, but they are not an end point. Particularly if your goal is to “ensure inclusive and equitable quality education and promote lifelong opportunities for all”, then it is the learner support that goes around the content that is vital.

And on this, the recommendations are largely silent. There is a recommendation to develop “supportive policy” but this is focused on supporting the creation of OER, not the learners. Similarly the “Sustainability models for OER” are aimed at finding ways to fund the creation of OER. I think we need to move beyond this now. Obviously having the resources is important, and I’d rather have OER than nothing, but unless we start recognising, and promoting, the need for models that will support learners, then there is a danger of perpetuating a false narrative around OER – that content is all you need to ensure equity. It’s not, because people are starting from different places.

I’ve always thought that too much focus has always been on “the resources”, but I’ve never really got to grips with how the resources are supposed to be (re)used, either by educators or learners.

For educators, reuse can often come in the form of “assign that thing someone else wrote, and wrap it with your own teaching context”, or “pinch that idea and modify it for your own use”. So if I see a good diagram, I might “reuse” it by inserting it in my own materials or I might redraw it with some tweaks.

Assessment reuse (“open assessment resources”?) can be handy too: a question form that someone else has worked up that I can make use of. In some cases, the question may include either exact, or ‘not drawn to scale’ media assets. But in many cases, I would still need to do work to generalise or customise the answer, and work out my own correct answer or marking guide.

(See for example Generative Assessment Creation.)

If an asset is not being reused directly, but the idea is, with some customisation, or change in parameter values, then creating the new asset may require significant effort, as well as access to, and skills in using, particular drawing packages. In some cases the liquid paper method works: Tipp-Ex out the original numbers, write in your own, photocopy to produce the new asset. Digital cut or crop alternatives are available.

Another post in my feeds today – Enterprise Dashboards with R Markdown, via Rbloggers – described a rationale for using reproducible methods to generate dashboards:

We have been living with spreadsheets for so long that most office workers think it is obvious that spreadsheets generated with programs like Microsoft Excel make it easy to understand data and communicate insights. Everyone in a business, from the newest intern to the CEO, has had some experience with spreadsheets. But using Excel as the de facto analytic standard is problematic. Relying exclusively on Excel produces environments where it is almost impossible to organize and maintain efficient operational workflows. …

[A particular] Excel dashboard attempts to function as a real application by allowing its users to filter and visualize key metrics about customers. It took dozens of hours to build. The intent was to hand off maintenance to someone else, but the dashboard was so complex that the author was forced to maintain it. Every week, the author copied data from an ETL tool and pasted it into the workbook, spot checked a few cells, and then emailed the entire workbook to a distribution list. Everyone on the distribution list got a new copy in their inbox every week. There were no security controls around data management or data access. Anyone with the report could modify its contents. The update process often broke the brittle cell dependencies; or worse, discrepancies between weeks passed unnoticed. It was almost impossible to guarantee the integrity of each weekly report.

Why coding is important

Excel workbooks are hard to maintain, collaborate on, and debug because they are not reproducible. The content of every cell and the design of every chart is set without ever recording the author’s actions. There is no simple way to recreate an Excel workbook because there is no recipe (i.e., set of instructions) that describes how it was made. Because Excel workbooks lack a recipe, they tend to be hard to maintain and prone to errors. It takes care, vigilance, and subject-matter knowledge to maintain a complex Excel workbook. Even then, human errors abound and changes require a lot of effort.

A better approach is to write code. … When you create a recipe with code, anyone can reproduce your work (including your future self). The act of coding implicitly invites others to collaborate with you. You can systematically validate and debug your code. All of these things lead to better code over time.

Many of the issues described there are to do with maintenance. Many of the issues associated with “reusing OERs with modification” are akin to maintenance issues. (When an educator updates their materials year on year – maintenance – they are reusing materials they have permission to use, with modification.)

In both the maintenance and the wider reuse-with-modification activity, it can really help if you have access to the recipe that created the thing you are trying to maintain. Year on year reuse is not buying 10 exact clone pizzas in the first year, freezing 9, taking one out each year, picking off the original topping and adding this year’s topping du jour for the current course presentation. It’s about saving and/or sharing the recipe and generating a fresh version of the asset each year, perhaps with some modification to the recipe.

In other words, the asset created under the reuse-with-modification licence is not subtractive/additive to the original asset, it is (re)generative from the original recipe.

This is where things like Jupyter notebooks or Rmd documents come in – they can be used to deliver educational resources that are in principle reusable-with-modification because they are generative of the final asset: the asset is produced from a modifiable recipe contained within the asset.

I’ve started trying to put together some simple examples of topic based recipes as Jupyter notebooks that can run on Microsoft’s (free) Azure Notebooks service: Getting Started With OER notebooks.

To run the notebooks, you need to create a Microsoft Live account, log in to notebooks.azure.com, and then clone the above linked repository.

OU staff and ALs should be able to log in using their oucu@open.ac.uk credentials. If you work for a company that uses Office 365 / Live online applications, ask them to enable notebooks too…

Once you have cloned the notebooks, you should be able to run them…

PS if you have examples of other things I should include in the demos, please let me know via the comments. I’m also happy to do demos, etc.

Keeping Up With OpenRefine – Database Connections

It’s been a few months since I last checked out updates to OpenRefine, but reading a (completed) phase 1 project plan associated with some funding the OpenRefine Foundation received from Google News Labs it looks like database support is on the cards.

Database Table import/export – COMPLETED

Historically, OpenRefine has been limited compared to other data tools in that it does not have a way to connect to a database table. This is especially useful at export time, when there is a need to save a cleaned CSV for example into a database table. Importing from a database is useful also. It can help to join clean data in a database table against messy data in OpenRefine, in order to clean and prepare it for use. Database Drivers exist for many databases such as Oracle, MySQL, Postgres, and even many schema-less databases such as MongoDB. Most database drivers use JDBC which makes it easier for us to develop against, and others typically use a custom Java driver that sometimes is non-trivial to integrate with. Since OpenRefine is built with Java this should be relatively straightforward to utilize existing JDBC drivers for our import/export operations and for support of MongoDB there is a Java driver available.

Looking through the repo, it looks like there are a couples of related PRs:

I’m not sure about the export to a db?

The tests suggest drivers are in place for PostgreSQL, MySQL and MariaDB:

public class DatabaseTestConfig extends DBExtensionTests {

private DatabaseConfiguration mysqlDbConfig;
private DatabaseConfiguration pgsqlDbConfig;
private DatabaseConfiguration mariadbDbConfig;

It also looks like an upgrade to the internal data representation may be being considered: Research Apache Arrow to improve in-memory data model. FWIW, I think Apache Arrow really is one to watch.

Via the OpenRefine Google Group, I also noticed a couple of references to future planned activity / roadmap items:

Phase 2

Front / Backend separation

Scope: completely separating the backend so that an full API can be exposed for all OpenRefine operations and commands. Once the decoupling done, we can move to a modern front end framework and
Deliverable: Functional and documented API covering all the commands available in OpenRefine 3 front end.

Phase 3
R Lang support
Work with community to bring support for R lang via an extension.
https://github.com/OpenRefine/OpenRefine/issues/1226
There is significant use of statistics within News Organizations where the goal of minimizing the back and forth between R tooling and OpenRefine would be explored and assessed by the community.

rrefine is around and needs investigation – https://github.com/vpnagraj/rrefine

Hmmm… rrefine?

rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby. rrefine extends this functionality to R users.

Okay – that makes me think of the OpenRefine Python Client Library?

But how about that Edit cells > Transform > Language support for R #1226` issue? “This is a feature-request to add R support in Edit cells > Transform > Language.”

That fits in with an earlier thought I had along the lines of “what if OpenRefine was a Jupyter client?” In an imagining frame of mind, this seems to me to offer a couple of potential benefits:

  • if the Transform > Language utility supports hooks into a Jupyter kernel and exposes an executable code cell onto that (state persisting) kernel, and the data can be transferred efficiently using serialisations like feather or deeper hooks into Apache Arrow representations that might be supported in R or Python pandas, then any language with a Jupyter kernel could be used for transformations?
  • if OpenRefine was exposed as a panel in Jupyterlab, which it presumably could be simply by embedding the HTML UI in an IFrame, then it have a role as part of the look and feel of a single working environment, even if it was only loading and saving CSV files into the environment workspace.

But then let’s imagine something a bit more extreme (I’m not sure if / how this might fit into the Jupyterlab architecture, indeed whether it’s possible or just imagine magic, I’m just riffing…): if the data being manipulated within OpenRefine could be synched with a representation of the data being manipulated elsewhere in the Jupyterlab environment, then we could be viewing a dataset in one panel (Jupyterlab has crazy efficient support for viewing large datafiles), manipulating it in an OpenRefine panel, and running analysis scripts over it in a third. The reticulate package suddenly comes to mind here as an example of accessing data objects from one environment in another.

It also strikes me that use cases of the data represented in OpenRefine reflecting updates to the data from the analysis environment are less likely. The analysis should be operating on data after it has been cleaned, rather than passing it to OpenRefine?

PS by the by, if you want to run OpenRefine using the Jupyter ecosystem Binderhub machinery, here’s a proof of concept from @betatim: openrefineder.

Using Your Photocopier to Share Data…

Via Charles Arthur’s Overspill, an interesting story about Digital Photocopiers Loaded With Secrets, telling a tale of how you can buy scrapped photocopiers for their hard drives and then trawl them for data, as you might do with old office computers, or phones…

A quick skim of the Xerox website turns up a photocopier product line listing that includes details of whether a photocopier includes a hard drive, along with some general guidance information:

Security Features

Jobs may be written to nonvolatile memory (e.g. to a hard drive) during processing. Generally, when a job finishes, this data is deleted, but may still be recoverable using forensic tools. Image overwrite is effective at eliminating this job data from the hard drive once the data is no longer needed. Xerox also scrambles the data with the user data encryption feature.

This further protects data at rest from unauthorized access. Xerox recommends that the following features be enabled.

Fortunately, countermeasures are built into products to reduce this risk.

• Immediate Job Overwrite or Immediate Image Overwrite is a feature that deletes and overwrites (with a specific data pattern) disk sectors that temporarily contained electronic image data. Products that use hard disk drives to store job data initiate this process at the completion of each job. … This should be enabled (and is by default on many products).
• On Demand Image Overwrite is a manually initiated (can also be scheduled) feature that deletes and overwrites (with a specific data pattern) every sector of any partitions of the hard drive that may contain customer job data. The device will be offline for a period of 20 minutes to one hour while this completes. [Makes me think of coffee machine self-clean cycles, –Ed.]
• Disk or User Data Encryption is a feature which encrypts all partitions of the hard drive that may contain customer job data with AES encryption. This should be enabled (and is by default on many products). Encryption can be used in combination with either overwrite feature.

Hard Disk Drive Retention Offering

If the security features built into Xerox products do not meet your security requirements, Xerox offers another alternative.
Hard Drive Retention Offering is a service that can be requested by a customer who wants to retain a hard drive for security reasons. A Xerox technician will remove the hard drive and leave it with the customer.

Things to Remember
• Not all products have hard disk drives.
• Some products have hard disk drives, but do not use the hard disk drive to save document images.
• If a Xerox product is powered off before an Overwrite operation completes, there may be remnants of data left on the drive. A persistent message will appear on the device indicating the incomplete overwrite operation. In this event, it is recommended that an On Demand Image Overwrite be performed.
• Image overwrite features are available for hard drive equipped devices only. Currently it is not possible to overwrite images on solid-state nonvolatile memory.

• NOTE: Xerox strongly recommends the default Administrator password be changed on all devices to prevent unauthorized access to configuration settings.

Xerox does not offer sanitization or cleansing services for returned disk drives.

Many photocopiers nowadays are intended to be accessed over a network (they double up as network printers), and may incorporate a webserver to facilitate that. Which means they may also be a network security hazard. Which is why photocopiers should be regarded as part of the IT estate so that IT can be responsible for regularly checking a vendor’s photocopier security bulletin. (As computers, photocopiers are also susceptible to hardware/processor vulnerabilities.)

PS think also connected vending machines ?!

A Couple of Days Out Attending the Corbeau Seats Rally, 2018

This time last week, I was waiting for a boat in advance of heading off to Clacton for the weekend. The event? The Corbeau Seats Rally, 2018, England’s first closed road rally (which is to say: #firstontheroad)…

…made possible as a result of changes to law last year. Here’s a copy of the permission granted:

Part of the service area was along the sea front, which was closed to public road traffic, but open to pedestrians, for the duration of the event:

Scrutineering was also on the front and lasted for much of Saturday, just down the road from Rally HQ, which was also doing a sideline in tea and cake. (I’m starting to think rally events are a bit like library events: they’re equally friendly, and there always seems to be cake available:-)

On Saturday evening I went to find my hotel for the night, checking out the special stage I was marshalling on beforehand. The special stage itself had already been prepared.

The MSA Blue Book, the de facto regulatory handbook for motorsport in the UK, goes so far as to define what sort of tape to use:

DbVuUeHW0AEmolN

I also found the FIA Rally Safety Guidelines a fascinating read too… (Whilst there were a couple of official spectator areas, viewing on most of the special stages was unofficial – local residents standing away from the road at the end of marshal controlled footpaths, for example. One of the key learnings I think that can be taken away from the event is sighting where spectators unofficially congregated with a view helping more people see more of the rally safely in future.)

Fortuitously, the hotel was only 15 minutes or so from the special stage start, where I had to sign on Sunday just after 7am on Sunday morning. Knowing the likely special stage locations beforehand might be useful when booking my hotel next year.

Several competitor groups as well as officials were in the hotel I was in, and also in the local pub that night for dinner where the bar staff were quizzing drivers about the event and then passing on that information to locals.

I imagine one of the arguments made for the rally to the Tendring local council was the economic benefit the rally would bring to the area. A Clacton Gazette piece on the rally, Motor rally unveils maps of 5 stages around Tendring also picked up this cause:

Mr Clements [event director] said the event is set to bring in hundreds of thousands of pounds to the Tendring area.

He said: “There will be 120 competing teams with each team having a driver, co-driver and support team of three or four people.

“We probably have 50 senior organisers and between 500 and 1000 marshals coming in, spending money and maybe staying over night.

“Just from the organisers side of things we have calculated that the spend will be £150,000 to £250,000.

“In addition, our best guess is that we will have five to ten thousand spectators coming.”

He said they could bring in about £250,000 to £350,000 through paying for parking, food, drinks and accommodation over the weekend of the rally.

It will be interesting to see what the final estimated economic contribution was. My spend in the area was of the order of £100, for example.

(If you search for academic papers around motorsport, you often find economic impact analyses.)

And so to rally day, and sign on, where I was given a briefing pack for my post showing where to park and where to stand,  the special stage schedule, a can of pop and a Mars bar, and an official whistle (for warning others of an approaching car when the stage was live). I’m gutted I forgot to collect an official rally T-shirt though…

The post itself was at then end of a straight run:

…just before a slight S:

…with a no-go area alongside a footpath:

…and a couple of concrete blocks to stop cars from flying up an slight earth ramp just before the corner entry:

So then it was time to sit down and wait…

After an hour or so, an incoming text message that brought a tear to my eye (honestly!) announced the start of the rally proper:

The view from my chair, watching the officials and safety cars go by, an then the full speed rally cars themselves…

Unfortunately, I don’t have any photos of those – so here’s a picture of the view from my chair, watching out for incoming vehicles. It looks like there could be some really nice walks and bike rides around there…

As to why no photos of the cars: cameras are distracting and accidents can happen quickly. Here are the tyre tracks left from a wobbly incident by car 73 on their first run through the stage:

Between stage runs, there were a couple of hours to kill, so I wandered up and down the stage to chat to folk on the posts next to mine. Up the road, Anthony Concannon of the Southern Car Club, who organise the rally stage at the Goodwood Festival of Speed (I need to sign up for that…) mentioned that they had been working with the Isle of Wight Council to run the first closed road rally when the change in legislation was originally proposed, but that delays in the legislative process had led to it falling through. It would be great if the plan to bring a closed road rally to the island could be revived though. I’d certainly help put some hours in…

After over 8 hours on post, with three runs of the stage completed, it was time to help clear away tape and signage in the vicinity, and head off. Along the way, chats with supportive local residents, several of whom we’re interested in how to get started marshalling. It struck me that post event advertising by the likes of GoMotorsport making people aware of how to get involved might pay dividends. As might giving marshals fliers and recruitment promo materials that we could hand out to spectators who might be interested in getting involved.

I’d like to thank the sponsors, organisers, local council and local residents for helping make this event possible: it was a great day out. And as for my fellow marshals: I’ll see you around…

If you are interested in getting involve, check out the GoMotorsport or Volunteers in Motorsport websites and sign up for the (free) MSA online rally marshals’ training. And definitely sign up for a Rally Marshal Taster Event if you spot one running.

PS this weekend, it’s back to my automated race / rally data journalism hacks… For example: F1Datajunkie Azerbaijan 2018 F1 Race Weekend Review or the RallyDatajunkie WRC Argentina Rally Review.