One of the ways our Data management and analysis (TM351) course differs from the module it replaced, a traditional databases module, was the way in which we designed it to cover a full data pipeline, from data acquisition, through cleaning, management (including legal issues), analysis, visualisation and reporting. The database content takes up about a third of the course and covers key elements of relational and noSQL databases. Half the course content is reading, half is practical. To help frame the way we created the course, we imagined a particular persona: a postdoc researcher who had to manage everything data related in a research project.
Jupyter notebooks, delivered with PostgreSQL and MongoDB databases, along with OpenRefine, inside a VirtualBox virtual machine, provide the practical environment. Through the assessment model, two tutor marked assessments as continuous assessment, and a final project style assessment, we try to develop the idea that notebooks can be used to document small data investigations in a reproducible way.
The course has been running for several years now — the notebooks were stilled called IPython notebooks when we started — but literature we can use to post hoc rationalise the approach we’ve taken is now starting to appear…
For example, this preprint recently appeared on arXiv — Three principles of data science: predictability, computability, and stability (PCS), Bin Yu, Karl Kumbier, arxiv:1901.08152 — and provides a description of how to use notebooks in a way that resembles the model we use in TM351:
We propose the following steps in a notebook
1. Domain problem formulation (narrative). Clearly state the real-world question one would like to answer and describe prior work related to this question.
2. Data collection and relevance to problem (narrative). Describe how the data were generated, including experimental design principles, and reasons why data is relevant to answer the domain question.
3. Data storage (narrative). Describe where data is stored and how it can be accessed by others.
4. Data cleaning and preprocessing (narrative, code, visualization). Describe steps taken to convert raw data into data used for analysis, and Why these preprocessing steps are justified. Ask whether more than one preprocessing methods should be used and examine their impacts on the final data results.
5. PCS inference (narrative, code, visualization). Carry out PCS inference in the context of the domain question. Specify appropriate model and data perturbations. If necessary, specify null hypothesis and associated perturbations (if applicable). Report and post-hoc analysis of data results.
6. Draw conclusions and/or make recommendations (narrative and visualization) in the context of domain problem.
As early work for a new course on datascience kicks off, a module I’m hoping will be notebook mediated, I’m thinking now would be a good time for us to revisit how we use the notebooks for teaching, and how we expect the students the use them for practice and assessment, and pull together some comprehensive notes on emerging notebook pedagogy from a distance HE perspective.