Fragment – Searchable SQLite Database of Andrew Lang Fairy Stories

A week or two ago, I bought a couple of ready-made print-on-demand public domain volumes that colleced together all of Andrew Lang’s coloured Fairy Books. There are no contents lists and no index, but the volumes didn’t cost that much more than printing-them on demand myself, and they saved me the immediate hassle of compiling my own PDFs.

But… there’s too much to skim through if you’re trying to find a particular story. So I started to wonder about creating a simple full-text search tool to search through the stories. A first attempt, that scrapes the story texts from, can be accessed here but it’s in pretty raw form – a SQL query interface essentially published via GitHub Pages and running against a db in the repo. (The query interface is powered via SQLite compiled to WASM and running in the browser, a trick I discovered several years ago… I’m still waiting for datasette in the browser! ;-))

Anyway… code for the scraper and the db constructor is in the repo, with an earlier version available as a gist. And of course, the query UI is available here. The scraper and sample db queries took maybe a couple of hours to pull together in all. And then another half hour today to set the repo up with the SQL GUI and write thisblog post…

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Note to self – the db is intended to run as a full-text searchable db via a neater user inferface, ideally with some sensible facet-based search options (with facet elements identified using entity extraction etc. I think I need to start working on my own “fairy story entities” model too…)

Also on the to do list:

  • annotate stories with Aarne-Thompson or Aarne-Thompson-Uther (ATU) story classification codes (is there a dataset I can pull in to do this keyed on story title? There is one for Grimm here. There’s a motif index here.)
  • pull out first lines as a separate item;
  • explore generating book indexes based on a hacky pagination estimate;
  • put together a fairy story entity model.

It’d also be really interesting to come up with a way of tagging each record with a story structure / morphology (e..g. Propp morphology of each story), eg so I could easily search for stories with different structure types.

PS (suggested via @TarkabarkaHolgy / @OnlineCrsLady) also add link to Wikipedia page for each story (thinks: that should be easy enough to at least partially automate…)

Fragment: AutoBackup to git

Noting, as ever, that sometimes students either don’t save, or accidentally delete, all their work (I take it as read that most folk don’t back-up: I certainly don’t (“one copy, no back-ups”)) I started pondering whether we could create local git repos in student working directories in provided Docker containers with a background process running to commit changed files every 5 minutes or so.

Note that I haven’t had a chance to test that any of this works yet!

The inotify-tools package wraps the Linux inotify command with a handy CLI wrapper. (inotify itself provides handy tools for monitoring the state of the file system (for example, Monitor file system activity with inotify.) The gitwatch package uses inotify-tools to as part of “a bash script to watch a file or folder and commit changes to a git repo” (you can also run it as a service).

So I wonder: can we use gitwatch to back-up (ish) files in a student’s working directory in a provided container into a local persistent git repo mounted into the container?

# With git installed, we need to create a nominal user
git config
git init

# Example of adding a file...
git add
# And committing it...
git commit -m first commit
# But students won't remember to do that very often...

# So we need to shcedule something
# inotify-tools gives us access to a directory monitor...
apt-get update -y && apt-get install -y inotify-tools

# The gitwatch package uses inotify-tools to trigger
# automatic git add/commits for changed files
# It's most conveniently installed using bpkg package manager
curl -Lo- "" | bash
bpkg install  gitwatch/gitwatch

# Note that it requires that USER is set
# The following is probably not recommended...
#export USER=root

# Run as a service

# Example of viewing state of github repo at at particular time #

It strikes me that if students are working on notebooks, we really want to commit the notebooks cleared of cell outputs. One way of doing this would be tweak gitwatch to spot changed files and if they are .ipynb notebook files, actually backup a copy of them (perhaps as .ipynb files via a hidden directory, perhaps via hidden Jupytext paired markdown files) rather than the notebook; pre-commit actions might also be useful here…

Having got files backed up, -ish, into a git repo, the next issue is how students can revisit a point back in time. If the repo is mirrored in GitHub, it looks like you can revisit the state of a repo if you don’t want to go too far back in time (“simply add the date in the following format – HEAD@{2019-04-29} – in the place of the URL where the commit would usually belong”).

On the command line, it seems we can get a commit immediately prior to a particular date: git rev-list master --max-count=1 --before=2014-10-01 (can we also fdo that relative to datetime?). Then it seems that git branch NEW_BRANCH_NAME COMMIT_HASH will create a branch based around the state of the files at the time of that commit, or git checkout -b NEW_BRANCH_NAME COMMIT_HASH will create the branch and check it out (just make sure you have all your current files checked into the repo so you can revert back to them…)

Ah… this looks handy:

# To keep your current changes

You can keep your work stashed away, without commiting it, with `git stash`. You would than use `git stash pop` to get it back. Or you can `git commit` it to a separate branch.

# Checkout by date using `rev-parse`

You can checkout a commit by a specific date using rev-parse like this:

`git checkout 'master@{1979-02-26 18:30:00}'`

More details on the available options can be found in the `git-rev-parse`.

As noted in the comments this method uses the reflog to find the commit in your history. By default these entries expire after 90 days. Although the syntax for using the reflog is less verbose you can only go back 90 days.

# Checkout out by date using `rev-list`

The other option, which doesn't use the reflog, is to use `rev-list` to get the commit at a particular point in time with:

git checkout `git rev-list -n 1 --first-parent --before="2009-07-27 13:37" master`

Note the `--first-parent` if you want only your history and not versions brought in by a merge. That's what you usually want.

PS seems like @choldgraf got somewhere close to this previously… choldgraf/gitautopush: watch a local git repository for any changes and automatically push them to GitHub.

Fragment – Embedding srcdoc IFrames in Jupyter Notebooks

Whilst trying to create IFrame generating magics to embded content in Jupyter Book output, I noticed that the IPython.display.IFrame element only appears to let you refer to external src linked HTML content and not inline/embedded srcdata content. This has the downside that you need to find a way ofcopying any generated src-linked HTML page into the Jupyter Book / sphinx generated distribution directory (Sphinx/Jupyter Book doesn’t seem to copy linked local pages over (I think bookdown publishing workflow does?).

Noting that folium maps render okay in Jupyter Book without manual copyting of the map containing HTML file, I had a peek at the source code and noticed it was using embedded srcdata content.

Cribbing the mechanics, the following approach can be used to create an object with a __rep_html__ method that returns an IFrame with embedded srcdoc content that will render the content in Jupyter Book output without the need for an externally linked src file. The HTML is generated from a template page (template) populated using named template attributes passed via a Python dict ( data). Once the object is created, when used as the last item in a notebook code cell it will return the inlined-IFrame as the display object.

from html import escape
from IPython.display import IFrame

class myDisplayObject:
    def __init__(self, data, template, width="100%", height=None, ratio=1):
        self.width = width
        self.height = height
        self.ratio = ratio
        self.html = self.js_html(data, template)

    def js_html(self, data, template):
        """Generate the HTML for the js diagram."""
        return template.format(**data)

    # cribbed from branca Py package
    def _repr_html_(self, **kwargs):
        """Displays the Diagram in a Jupyter notebook."""
        html = escape(self.html)
        if self.height is None:
            iframe = (
                '<div style="width:{width};">'
                '<div style="position:relative;width:100%;height:0;padding-bottom:{ratio};">'  # noqa
                '<span style="color:#565656">Make this Notebook Trusted to load map: File -> Trust Notebook</span>'  # noqa
                '<iframe srcdoc="{html}" style="position:absolute;width:100%;height:100%;left:0;top:0;'  # noqa
                'border:none !important;" '
                'allowfullscreen webkitallowfullscreen mozallowfullscreen>'
            ).format(html=html, width=self.width, ratio=self.ratio)
            iframe = (
                '<iframe srcdoc="{html}" width="{width}" height="{height}"'
                'style="border:none !important;" '
                '"allowfullscreen" "webkitallowfullscreen" "mozallowfullscreen">'
            ).format(html=html, width=self.width, height=self.height)
        return iframe

For an example of how this is used, see innovationOUtside/nb_js_diagrammers (functionality added via this commit).

More Scripted Diagram Extensions For Jupyter Notebook, Sphinx and Jupyter Book

Following on from Previewing Sphinx and Jupyter Book Rendered Mermaid and Wavedrom Diagrams in VS Code, I note several more sphinx extensions for rendering diagrams from source script in appropriately tagged code fenced blocks:

  • blockdiag/sphinxcontrib-blockdiag: a rather dated, but still working, extension, that generates png images from source scripts. (The resolution of the text in the image is very poor. It would perhaps be useful to be able to specify outputting SVG?) See also this Jupyter notebook renderer extension: innovationOUtside/ipython_magic_blockdiag. I haven’t spotted a VS Code preview extension for blockdiag yet. Maybe this is something I should try to build for myself? Maybe a strike day activity for me when the strikes return…
  • sphinx-contrib/plantuml: I have’t really looked at PlantUML before, but it looks like it can generate a whole host of diagram types, including sequence diagrams, activity diagrams, state diagrams, deployment diagrams, timing diagrams, network diagrams, wireframes and more.
PlantUML Activity Diagram
PlantUML Deployment Diagram
PlantUML Timing Diagram
PlantUML Wireframe (1)
PlantUML Wireframe (2)

The jbn/IPlantUML IPython extension and the markdown-preview-enhanced VS Code extension will also preview PlantUML diagrams in Jupyter notebooks and VS Code respectively. For example, in a Jupyter notebook we can render a PlantUML sequence diagram via a block magicked code cell.

Simple Jupytext Github Action to Update Jupyter .ipynb Notebooks From Markdown

In passing, a simple Github Action that will look for updates to markdown files in a GitHub push or pull request and if it finds any, will run jupytext --sync over them to update any paired files found in markdown metadata (and/or via jupytext config settings?)

Such files might have been modified, for example, by an editor proof reading the markdown materials in a text editor.

If I read the docs right, the --use-source-timestamp will set the notebook timestamp to the same as the modified markdown file(s)?

The modified markdown files themselves are identified using the dorny/paths-filter action. Any updated .ipynb files are then auto-committed to the repo using the stefanzweifel/git-auto-commit-action action.

name: jupytext-changes


    runs-on: ubuntu-latest

    # Checkout
    - uses: actions/checkout@v2

    # Test for markdown
    - uses: dorny/paths-filter@v2
      id: filter
        # Enable listing of files matching each filter.
        # Paths to files will be available in `${FILTER_NAME}_files` output variable.
        # Paths will be escaped and space-delimited.
        # Output is usable as command-line argument list in Linux shell
        list-files: shell

        # In this example changed markdown will be spellchecked using aspell
        # If we specify we are only interested in added or modified files, deleted files are ignored
        filters: |
                - added|modified: '**.md'
        # Should we also identify deleted md files
        # and then try to identify (and delete) .ipynb docs otherwise paired to them?
        # For example, remove .ipynb file on same path ($FILEPATH is a file with .md suffix)
        # rm ${}.ipynb

    - name: Install Packages if changed files
      if: ${{ steps.filter.outputs.notebooks == 'true' }}
      run: |
        pip install jupytext

    - name: Synch changed files
      if: ${{ steps.filter.outputs.notebooks == 'true' }}
      run: |
        # If a command accepts a list of files,
        # we can pass them directly
        # This will only synch files if the md doc include jupytext metadata
        # and has one or more paired docs defined
        # The timestamp on the synched ipynb file will be set to the
        # same time as the changed markdown file
        jupytext --use-source-timestamp  --sync ${{ steps.filter.outputs.notebooks_files }}

    # Auto commit any updated notebook files
    - uses: stefanzweifel/git-auto-commit-action@v4
        # This would be more useful if the git hash were referenced?
        commit_message: Jupytext synch - modified, paired .md files

Note that the action does not execute the notebook code cells (adding --execute to the jupytext command would fix that, although steps would also need to be taken to ensure that an appropriate code execution environment is available): for the use case I’m looking at, the assumption is that edits to the markdown do not include making changes to code.

Supporting Playful Exploration of Data Clustering and Classification Using datadraw

One of the most powerful learning techniques I know that works for me is play, the freedom to explore an idea or concept or principle in an open-ended, personally directed way, trying things out, test them, making up “what if?” scenarios, and so on.

Playing takes time of course, and the way we construst courses means that we donlt give students time to play, preferring to overload them with lots of stuff read, presumably on the basis that stuff = value.

If I were to produce a 5 hour chunk of learning material that was little more three or four pages of text, defining various bits of playful activity, I suspect that questions would be asked on the basis that 5 hours of teaching should include lots more words… I also suspect that the majority of students would not know how to play consructively within the prescribed bounds for that length of time.


In passing, I note this rather neat Python package, drawdata, that plays nice with Jupyter notebooks:

Example of use drawdata.draw_scatter() mode

Select a group (a, b, or c), draw a buffered line, and it will be filled (ish) with random dots. Click the copy csv button to grab the data into the clipboard, and then you can retireve it from there into a pandas dataframe:

Retrieve data from clipboard into pandas dataframe

At the risk of complicating the UI, I wonder about adding a couple more controls, one to tune with width of buffered line (and also ensure that points are only generated inside the line envelope), another to set the density of the points.

Another tool allows you to generate randonly sampled points along a line:

I note this could be a limiting case of a zero-width line in a draw-data() widget with a controllable buffer size.

Could using such a widget in a learning activity provide an example of technology enhanced learning, I wonder?! (I still don’t know what that phrase is supposed to mean…)

For example, I can easily imagine creating a simple activity where students get to draw different distributions and then run their own simple classifiers over them. The playfulness aspect would come in when students starting wondering about how different datas groups might interact, or how linear classifiers might struggle with particular multigroup distributions.

As a related example of supporting such palyfulness, the tensorflow playground provides several different test distributions with different interesing properties:

Distributions in tensorflow playground

To run your own local version of tensflow playground via a jupyter-server-proxy, see innovationOUtside/nb_tensorflow_playground_serverproxy.

With datadraw, students could quite easily create their own test cases to test their own understanding of how a particular classifier works. To my mind, developing such an understanding is supported if we can also visualise the evolution of a classifier over time. For example, the following animation (taken from some material I developed for a first year module that never made it past the “optional content” stage) shows the result of training a simple classifier over a small dataset with four groups of points.

Evolution of a classifier

See also: How to Generate Test Datasets in Python with scikit-learn, a post on the Machine Learning Mastery blog, and Generating Fake Data – Quick Roundup, which summarises various other takes on generating synthetic data.

PS This also reminds me a little bit of Google Correlate (for example,
Google Correlate: What Search Terms Does Your Time Series Data Correlate With?), where you could draw a simple timeseries and then try to find search terms on Google Trends with the same timeseries behaviour. On a quick look, none of the original URLs I had for that seem to work anymore. I’m not sure if it’s still available via Google Trends, for example?

PPS Here’s another nice animation from Doug Blank demonstrating a PCA based classification:

30 Second Bookmarklet for Saving a Web Location to the Wayback Machine

In passing, I just referenced a web page in another post, the content of which I really don’t want to lose access to if the page disappears. A quick fix is to submit the page to the Internet Archive Wayback Machine, so that at least I know a copy of the page will be available there.

From the Internet Archive homepage, you can paste in a URL and the Archive will check to see if it has a copy of the page. In many cases, the page will have been grabbed multiple times over the years, which also means you can track a page’s evolution over time.

Also on the homeopage is a link that allows you to submit a URL to request that that page is also saved to the archive:

Here’s the actual save page:

When you save the page, a snapshot is grabbed:

Saving a page to the Wayback Machine

Checking the URL for that page, it looks like we can grab a snapshot by passing the URL followed by the URL of the page we want to save…

Hmmm… 30s bookmarklet time. Many years ago, I spent some of the happiest and most productive months (for me) doing an Arcadia Fellowship with the University Library in Cambridge, tinkering with toys and exploring that incredible place.

Diring my time there, I posted to the Arcadia Mashups Blog, which still exists as a web fossil. One of the posts there, The ‘Get Current URL’ Bookmarklet Pattern, is a blog post and single page web app in one, that lets you generate simple redirection bookmarklets:

Window location bookmarklet form, Arcadia Mashups Blog

If you haven’t come across bookmarklets before, you could think of them as automation web links that run a bit of Javascript to do something useful for you, either my modifying the current web page, or doing something with its web location / URL.

When you save a bookmarklet, you should really check that the bookmarklet javascript code isnlt doing anyhting naughty, or make sure you inly install bookmarklets from trusted locations.

In the above Archive-it example, the code grabs the current page location and passes it to . If you drag the bookmarklet to your browser toolbar, open a web page, and click the bookmarklet, the page is archived:

Oh, happy days…

So, a 30s hack and I have built myself a tool to quickly archive a web URL. (Writing this blog post took much longer than remembering that post existed and generating the bookmarklet.)

There are of course other tools for doing similar things, not least, but it was as quick to create my own as to try to re-find that…

See also: Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs and Name (Date) Title, Available at: URL (Accessed: DATE): So What?

Previewing Sphinx and Jupyter Book Rendered Mermaid and Wavedrom Diagrams in VS Code

In VS Code as an Integrated, Extensible Authoring Environment for Rich Media Asset Creation, I linked to a rather magical VS Code extension (shd101wyy.markdown-preview-enhanced) that lets you preview diagrams rendered from various diagram scripts, such as documents defined using Mermaid markdown script or wavedrom.

The diagram script is incorporated in code fenced block qualified by the scripting language type, such as ```mermaid or ```wavedrom.

Pondering whether I this was also a route to live previews of documents rendered from the original markdown using Sphinx (the publishing engine used in Jupyter Book workflows, for example), I had a poke around for related extensions and found a couple of likely candidates, such as:

After installing the packages from PyPi, these extensions are enabled in a Jupyter Book workflow by adding the following to the _config.yml file:

    - sphinxcontrib.mermaid
    - sphinxcontrib.wavedrom

Building a Sphinx generated book from a set of markdown files using Jupyter Book (e.g. by running jupyter-book build .) does not render the diagrams… Boo…

However, changing the code fence label to a MyST style label (as suggested here), does render the diagrams in the Sphinx generated Jupyter Book output, albeit at the cost of not now being able to preview the diagram directly in the VS Code editor.

It’s not so much of an overhead to flip between the two, and an automation step could probably be set up quite straightforwardly to convert between the forms as part of a publishing workflow, but I’ve raised an issue anyway suggesting it might be nice if the shd101wyy.markdown-preview-enhanced extension also supported the MyST flavoured syntax…

See also: A Simple Pattern for Embedding Third Party Javascript Generated Graphics in Jupyter Notebooks which shows a simple recipe for addiing js diagram generation support to classic Jupyter notebooks, at least, using simple magics. A simple trasnformation script should be able to map between the magic cells and an appropriately fenced code block that can render the diagram in a Sphinx/Jupyter Book workflow.

Another Automation Step: Spell-Checking Jupyter Notebooks

Another simple automation step, this time to try to add spell checking of notebooks.

I really need to find a robust and useful way of doing this. I’ve previously explored using pyspelling, and started tinkering with a thing to generate tidier reports form it, and have also found codespell to be both quick and effective.

Anywhere, here’s a hacky Github Action for spellchecking notebooks files in the last commit of a push onto a Github repo (see also this issue regarding getting changed files from all commits in the push that raised the action, which seems to have been addressed by this PR):

name: spelling-partial-test

    runs-on: ubuntu-latest
    # Set job outputs to values from filter step
      notebooks: ${{ steps.filter.outputs.notebooks }}
    # (For pull requests it's not necessary to checkout the code)
    - uses: actions/checkout@v2
        fetch-depth: 0
    - uses: dorny/paths-filter@v2
      id: filter
        filters: |
            - '**.ipynb'

    needs: changednb
    if: ${{ needs.changednb.outputs.notebooks == 'true' }}
    runs-on: ubuntu-latest

    - uses: actions/checkout@master
        fetch-depth: 0 # or 2?
        #ref: nbval-test-tags

    - id: changed-files
      uses: tj-actions/changed-files@v11.4
        since_last_remote_commit: 'true'
        separator: ','
        files: |

    - name: Install spelling packages
      run: |
        sudo apt-get update && sudo apt-get install -y aspell aspell-en
        python3 -m pip install --upgrade
        python3 -m pip install --upgrade
        python3 -m pip install --upgrade
        python3 -m pip install --upgrade codespell

    - name: Codespell
      # Codespell is a really quick and effective spellchecker
      run: |
        touch codespell.txt
        IFS="," read -a added_modified_files <<< "${{ steps.changed-files.outputs.all_modified_files }}"
        # This only seems to find files from the last commit in the push?
        for added_modified_file in "${added_modified_files[@]}"; do
          codespell  "${added_modified_files[@]}" | tee -a codespell.txt

    - name: pyspelling test of changed files
      # This runs over changed files one at a time, though we could add multiple -S in one call...
      run: |
        touch typos.txt
        touch .wordlist.txt
        IFS="," read -a added_modified_files <<< "${{ steps.changed-files.outputs.all_modified_files }}"
        # This only seems to find files from the last commit in the push?
        for added_modified_file in "${added_modified_files[@]}"; do
          pyspelling -c .ipyspell.yml -n Markdown -S "${added_modified_files[@]}" | tee -a typos.txt || continue
          pyspelling -c .ipyspell.yml -n Python -S "${added_modified_files[@]}" | tee -a typos.txt || continue
        cat typos.txt
        touch summary_report.txt
        nb_spellchecker reporter -r summary_report.txt
        cat summary_report.txt
      shell: bash
      # We could let the action fail on errors
      continue-on-error: true

    - name: Upload partial typos
      # Create a downloadable bundle of zipped typo reports
      uses: actions/upload-artifact@v2
        name: typos
        path: |

Typos are displayed inline in the action run:

And a zipped file of spellcheck reports is also available for download:

More Automation Sketches – Creating Student Notebook Releases

Tinkering a bit more with Github Actions, I’ve hacked together some sort of workflow for testing notebooks in a set of specified directories and then clearing the notebook output cells, zipping the notebooks into a release zip file, and then making the release zip file via a github release page.

The test and release is action is triggered by making a release with a body that contains a list of comma separate directory paths identifying the directories we want in the release. For example:

The following action is triggered by a release creation event:

name: example-release
      - created
    runs-on: ubuntu-latest
      image: ouvocl/vce-tm351-monolith
      RELEASE_DIRS: ${{ github.event.release.body }}
    - uses: actions/checkout@master
    - name: Install nbval (TH edition) and workflow tools
      run: |
        python3 -m pip install --upgrade
        python3 -m pip install
    - name: Restart postgres
      run: |
        sudo service postgresql restart
    - name: Start mongo
      run: |
        sudo mongod --fork --logpath /dev/stdout --dbpath ${MONGO_DB_PATH}
    - name: Get directories
      run: |
        #IFS=$"\n" read -a file_paths <<< "${{ github.event.head_commit }}"
        IFS="," read -a file_paths <<< "${{ github.event.release.body }}"
        # Test all directories
        for file_path in "${file_paths[@]}"; do
          py.test  --nbval "$file_path" || continue
      shell: bash
      # For testing...
      continue-on-error: true
    - name: Create zipped files
      run: |
        IFS="," read -a file_paths <<< "${{ github.event.release.body }}"
        for file_path in "${file_paths[@]}"; do
          tm351zip -r clearOutput -a "$file_path"
        echo "Release paths: $RELEASE_DIRS" > release-files.txt
        echo "\n\nRelease zip contents:\n" >> release-files.txt
        tm351zipview >> release-files.txt
      shell: bash
    - name: Create Release
      id: create_release
      uses: softprops/action-gh-release@v1
      # The commit must be tagged for a release to happen
      # Tags can be added via Github Desktop app
        tag_name: ${{ github.ref }}-files
        name: ${{ }} files
        #body: "Release files/directories: ${RELEASE_DIRS}"
        body_path: release-files.txt
        files: |

It then runs the tests and then generates another release that includes the cleaned and zipped release files:

Ideally, we’d just add the zip file to the original release but I couldn’t spot a way to do that.

At the moment the action will publish the file release even if some notebook tests fail. A production action should fail if a test fails, or perhaps parse the release name and ignore the fails if the original release name contains a particular flag (for example, --force).

The idea of using the release form to create the release was to try to simplify the workflow and allow a release to be generated quite straightforwardly from a repository on the Github website.