Onboarding Into A New Github Repo – Initial Commit Actioned PRs

One of the blockers to folk using Github, I think, is the perception that it’s hard and requires lots of command line foo. That is, and isn’t true: you can get started very easily just by addign and editing files via the web (eg Easy Web Publishing With Github).

A new template repo, fastai/fastpages, from the fastai folk uses Github Actions to manage an automated publishing workflow that generates a blogsite, using Github pages, from notebooks and markdown files.

Actually, this is a better repo to use maybe? fastai/fast_template Could still do with some more babysteps explanation about what the _config.yml is and does, for example?

The automation requires a secret to be set as part of the repo settings to allow Github Actions to push generated files the branch that’s actually used to publish the resulting website.

Trying to write instructions that a novice can follow to set up the repo can be a bit fiddly, but the fastai have an interesting take on that: the template also includes a “first run” / “inital commit” action that makes a PR into your cloned repository telling you how to proceed, and providing direct links to the pages on which you need to edit your own settings.

A few moments after your cloned repo is loaded, refresh it, and you should see at least one pull request (PR).

My screenshot actually shows two automated PRs have been made: as well as the fastai on-boarding PR, a Github bot has spotted a vulnerability in a dependency and made its own PR to fix it.

The fastai PR provides instructions, and direct links to pages you need to access, to set up the repo:

There’s still the issue of directing the novice user to the PR page (the repo home page, as created, will show 0 PRs initially: it doesn’t refresh itself to show the updated PR count automatically, I don’t think?) and then how to merge the PR. [Downes also commented on the instructions and tried to make them even more baby steps followable here.)]

But the use of the initial commit triggered PR to stage the presentation of instructions is an interesting onboarding mechanic, I think, and one I might add to some of my own template repos.

PS I also started riffing around how we could use actions to allow us to use things like issues as a CLI into Github Action command executor: PoC: Using Git Commit Messages As a Github Actions CLI.

Easy Web Publishing With Github

A quick note I’ve been meaning to post for ages… If you need some simple web hosting, you can use Github: simply create a top level docs folder in your repo and pop the files you want to serve in that directory.

And the easiest way to create that folder? In a repo web page, click the Create New File button, and use the filename docs/index.md to create the docs folder and add a web homepage to it as a markdown file.

(And to create a Github repo, if it’s your first time? Get a Github account, click on the “Create New Repository” or “Add New Repository” or a big “+” button somewhere and create a new repo, checking the box to automatically create an empty README file automatically to get the repo going. (If you forget to do that that, don’t panic – you should still be able to create a new file somewhere from the repo webpage to get the repo going…)

To serve the contents of the docs folder as a mini-website, click on the repo Settings button:

then scroll down to the Github Pages area, and select as a source master branch/docs folder.

When you save it, wait a minute or two for it to spin up, and then you should be able to see your website published at https://YOUR_GITHUB_USERNAME.github.io/YOUR_REPO_NAME.

For example, the files in the docs folder of  the default master branch of my Github psychemedia/parlihacks repository are rendered here: https://psychemedia.github.io/parlihacks/.

If you have files on you  desktop you want to publish to the web, click on the docs folder on the Github repo webpage to list the contents of that directory, and simply drag the files from your desktop onto the page, then click the Commit changes button to upload them. If there are any files with the same name in the directory, the new file will be checked in as an updated version of the file, and you should be able to compare the differences to the previous version…

PS Github can be scary at times, but the web UI makes a lot of simple git interactions easy. See here for some examples – A Quick Look at Github Classroom and a Note on How Easy Github on the Web Is To Use… – such as editing files directly in Github viewed in your browser, or dragging and dropping files from your desktop onto the a Github repo web page to check in an update to a file.

A Quick Look at Github Classroom and a Note on How Easy Github on the Web Is To Use…

With the OU VC announcing a future vision for the OU as a “university of the cloud” (quiet, at the back there…), I thought I’d have a look at Github Classroom, prompted by a post I was alerted to by @biztechpm on How to grade programming assignments on GitHub.

Getting started meant creating a Github organisation (I created innovationOUtside), logging in to Github Classroom (about), and requesting some private repositories (I asked for 25 and got unlimited) that could be used for setting Github mediated assignments as well as receiving student submissions back).

The Github Classroom model is based around Classrooms that contain Assignments, which can be specified as individual assignments or group assignments.

When students accept an invitation to an assignment (or at least, a private individual assignment, which is all I’ve gad a chance to look at so far), creates a repository viewable within the classroom on the group account. This repo is viewable by the student and the Classroom moderators.

If the assignment has been seeded with starter code, such as the statement of the assignment, or files associated with the assignment, will be used to see the student’s repository. (For convenience, I created a private repo on the group account to act as the repo for see files in a particular assignment.) If the seed files are updated, the student can update their own repository and then mark changes against that, but this needs to be done under git in a synched repo on the command line:-(

git remote add upstream https://github.com/innovationOUtside/ASSIGNMENT_BASE_REPO.git

git fetch upstream
git rebase upstream/master

git can be a bit of a pain to work with on the command line and the desktop client, but it doesn’t have to be that hard. If the seed files arenlt updated once student repos are created, the student can operate completely via the Github website. For example, to add or update files contained within a particular directory in the repository, users can simply drag a file from their desktop and drop it onto the appropriate repo and directory listing webpage.

Additional files can then be uploaded and committed, along with corresponding commit messages:

Files  – and directories – can also be created and edited directly on the Github website. Simply Create a new file:

and then enter the file name required, along with a new subdirectory path if you want to place the file in a newly created subdirectory.

You can then edit the file (and if it’s markdown, preview a rendered version of it) in the browser.

Of course, working in the Github file editor means students can’t execute and test their code [update: though I guess they could by using continuous integration and test tools, such as Circle CI or Travis; CircleCI is free for public repos but probably not private ones?]; but it’s easy enough to just download a zip file of the files contained in a repo, work on those, and then drag the completed files back into the online repo to upload it and check it in.

GitHub_Classroom10

If a student checks in their updated script and gives it an obvious commit message, a moderator with privileges over the classroom the student’s repository is associated with can click onto that check in and (If the student has made multiple checkins, it would be useful if they rebased against the upstream master. I’m not sure if this can be done in the web client?)

The moderator can then see what changes made to the original script as their submitted work and comment on a line by line basis on the student’s script.

Unfortunately, if checked in documents are Jupyter notebooks, the simple github differencer isn’t as helpful as it could be:

Given that Github added a notebook previewer for notebook .ipynb files, it’s possible that they may add in support for the nbdime notebook differ. However, in the meantime, it would be possible to download student repos (possibly using a tool like Gitomator? and then check them using nbdime. It may also be possible to create a moderator/marker extension to help with this? The problem then is getting annotated notebooks back to the student. In which respect, I’m not sure if a moderator can check a file back in to a student’s private assignment repository, although I suppose a group assignment could perhaps be used in this respect with two member groups: the student and the moderator.

One final thought in terms of OU workflow, assignments are marked by students’ tutors, which makes for a management issue when allocating moderator permissions. One solution here may be to create a classroom for each moderator and with duplicate assignments seeded from the same original repo then allocated into each classroom.

PS one way of testing student code may be to autograde it using CI tools; for example, this thread on Github Classroom versus nbgrader.

PPS diffing notebooks is made easier if you don’t commit ipynb files, but instead at least commit py or md adjunct to notebooks generated using something like Jupytext. See a review of my intial tinkering w/ Jupytext here: Exploring Jupytext – Creating Simple Python Modules Via a Notebook UI.

Diff or Chop? Github, CSV data files and OpenRefine

A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:

csv diff

With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.

* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.

So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.

One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:

openrefine history

Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.

A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.

Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:

open refine extract

This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?

For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.

OPen Refine export to google

So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)

For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)

And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.