Supporting Playful Exploration of Data Clustering and Classification Using datadraw

One of the most powerful learning techniques I know that works for me is play, the freedom to explore an idea or concept or principle in an open-ended, personally directed way, trying things out, test them, making up “what if?” scenarios, and so on.

Playing takes time of course, and the way we construst courses means that we donlt give students time to play, preferring to overload them with lots of stuff read, presumably on the basis that stuff = value.

If I were to produce a 5 hour chunk of learning material that was little more three or four pages of text, defining various bits of playful activity, I suspect that questions would be asked on the basis that 5 hours of teaching should include lots more words… I also suspect that the majority of students would not know how to play consructively within the prescribed bounds for that length of time.

Whatever.

In passing, I note this rather neat Python package, drawdata, that plays nice with Jupyter notebooks:

Example of use drawdata.draw_scatter() mode

Select a group (a, b, or c), draw a buffered line, and it will be filled (ish) with random dots. Click the copy csv button to grab the data into the clipboard, and then you can retireve it from there into a pandas dataframe:

Retrieve data from clipboard into pandas dataframe

At the risk of complicating the UI, I wonder about adding a couple more controls, one to tune with width of buffered line (and also ensure that points are only generated inside the line envelope), another to set the density of the points.

Another tool allows you to generate randonly sampled points along a line:

I note this could be a limiting case of a zero-width line in a draw-data() widget with a controllable buffer size.

Could using such a widget in a learning activity provide an example of technology enhanced learning, I wonder?! (I still don’t know what that phrase is supposed to mean…)

For example, I can easily imagine creating a simple activity where students get to draw different distributions and then run their own simple classifiers over them. The playfulness aspect would come in when students starting wondering about how different datas groups might interact, or how linear classifiers might struggle with particular multigroup distributions.

As a related example of supporting such palyfulness, the tensorflow playground provides several different test distributions with different interesing properties:

Distributions in tensorflow playground

To run your own local version of tensflow playground via a jupyter-server-proxy, see innovationOUtside/nb_tensorflow_playground_serverproxy.

With datadraw, students could quite easily create their own test cases to test their own understanding of how a particular classifier works. To my mind, developing such an understanding is supported if we can also visualise the evolution of a classifier over time. For example, the following animation (taken from some material I developed for a first year module that never made it past the “optional content” stage) shows the result of training a simple classifier over a small dataset with four groups of points.

Evolution of a classifier

See also: How to Generate Test Datasets in Python with scikit-learn, a post on the Machine Learning Mastery blog, and Generating Fake Data – Quick Roundup, which summarises various other takes on generating synthetic data.

PS This also reminds me a little bit of Google Correlate (for example,
Google Correlate: What Search Terms Does Your Time Series Data Correlate With?), where you could draw a simple timeseries and then try to find search terms on Google Trends with the same timeseries behaviour. On a quick look, none of the original URLs I had for that seem to work anymore. I’m not sure if it’s still available via Google Trends, for example?

PPS Here’s another nice animation from Doug Blank demonstrating a PCA based classification: https://nbviewer.org/github/Calysto/conx-notebooks/blob/60106453bdb66a83da7c2741d7644b7f8ee94517/PCA.ipynb

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: