OpenRefine Style Data Wrangling Tool for VS Code?

I’ve been following tech for long enough to know that many of the shiny toys and tech tools reported in academic conferences never actually work on any body else’s machine, and that if they ever did the code has rotted in an unmaintained repo somewhere in the year between the submission of the paper and its actual publication.

Corps also tease product announcements, particularly in conference sessions, with releases due “any time now”, that get a bit of social media hype at the time (they used to get blog mentions…) but then never actually appear.

I’m hopeful that that the following VS Code extension will appear this side of the New Year, but the release schedule is “lag a month” rather than magazine style “lead a month” cover dates (I’m guessing the issue of Racecar Engineering Magazine that hit our letterbox a few days ago is the January, 2023, issue (maybe even February, 2023?!); by contrast, the November release of the VS Code Python and Jupyter extensions should probably hit any time now (second week of December)).

The extension is a “data wrangler” extension that looks like it will provide a lot of OpenRefine style functionality for cleaning and manipulating data in the browser. In OpenRefine, a browser based GUI interface can be used to wrangle a dataset and alos generate a replayable history file. The data wrangler extension also provides a GUI interface, but rather than a history file it generates pandas Python code to replay the manipulation steps.

I first caught sight of it mentioned in a Github Universe conference session (GitHub and VS Code tricks for data scientists – Universe 2022):

It was also demoed at EuroPython, 2022 (Python & Visual Studio Code – Revolutionizing the way you do data science – presented by Jeffrey Mew):

I’m wondering whether we should switch to this from OpenRefine. The issue then would be whether we should also switch to VS Code notebooks rather than our planned move to JupyterLab.

My gut feeling is that JupyterLab environent is preferable for presentational, rather than technical, reasons: specifically, we can brand it and we can customise the notebook rendering. The branding means that we can give students a sense of place when working in the computational environment we provide them with. They are not in a workplace coding environment, they are in a teacjing and learning environment, and the sort of code we might expect them to work with, and how we want them to work with it, may be slightly different than the sort of code they would be expected to work with in working environment.

The presentational tweaks I think are also useful, becuase we can use them as prompts to particular sorts of action, or ways of framing how we expect student to interact and work with particular content elements. The visual cues also set up expectations regarding how much time a particular content section might take (20 lines of activity is likely to take longer to work through than it takes to read 20 lines of text), and whether you are likely to be able to do it from a prit out on a bus or wether you are likely need access to a code execution environment. The colour theming also matches that used in the VLE, at least in the colouring of activities, though we also provide additional colour prompts for areas where students are expected to righ things down, or to highlight feedback from tutors, for example.

Note that the rationales I claim for the benefits of branding and colour theming are gut, rather than evidence, based. I haven’t done an internal Esteem research project to justify them, and no-one from any of the educational research units that exist in the university have ever expressed interest in evaluating my claims. Whilst at least two other modules have adopted the original colour theming extension that can be used in classic notebooks, I don’t think other modules use the branding hack, not least because to date it has required a manual hack to date for customising local installs which other modules have tended to opt for. (I have started exploring a classic notebook branding extension, that will attempt to deploy the branding hack locally…) So maybe they aren’t value adding anyway…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: