Open Research Data Processes: KMi Crunch – Hosted RStudio Analytics Environment
One of the possible barriers to widespread adoption of open notebook science is knowing where to start. Video reports of lab experiments hosted on Youtube can be easily embedded in a hosted WordPress blog; a MediaWiki wiki can be used to provide one page per experiment, with change tracking/history on each page and a shadow page for commentary and discussion; Github can be used to provide a version control environment for software code, results data, project pages and documentation. For tabulated data, Google Spreadsheets provides a hosting environment and an API that lets you treat the data as a database and also explore it dashboard style via a range of interactive visual filtering and charting components. Alternatively, a CKAN instance (such as is used to run thedatahub.org) offers data management and preview tools.
Keeping track of data analysis in an open way is also getting easier. In An R-chitecture for Reproducible Research/Reporting/Data Journalism, I briefly mentioned RPubs.com, a site that can be used to 1-click publish HTML reports of statistical analyses executed within the RStudio environment (I really need to do a proper post about this). But now there’s an example of another hosted solution from Fridolin Wild of the OU’s KMi: Crunch.
Crunch offers a hosted RStudio environment (so you can access RStudio via a browser) with public and private areas. The public areas allow you to post datasets, run scripts as a service, or publish results (Sweave generated PDFs, or knitr generated HTML reports, for example).
Crunch also incorporates a MySQL database for each user. (Scheduling and pipelining are also on the cards…)
Whilst developed as an application to support learning analytics (I think?), Crunch also provides a great demonstration of a more general open research data workbench. You can store – and publish – data sets, along with analysis scripts and reports generated by executing those scripts over your data set. Version control isn’t available at the moment (I think?) but RSTudio does have git/github support, so that may be coming. The provision of a MySql database means that data collections can be managed within a database environment. (From a data journalism, rather than an open/reproducible research, perspective, I did wonder whether it would be possible to situate something like Scraperwiki on the same platform and replace its SQLite support with MySQL support, so a Scraperwiki scraper could be used to scrape data into a MySQL database that was then accessed from RStudio? Being able to wire MySQL read/write access into Google Refine on the same platform could also be interesting..;-)
I’m not sure about the extent to which the OU LIbrary is taking an interest in the development of Crunch, but providing best practice support and advice in the orchestration of information and data handling tools seems to me to be in-scope for the academic research librarian, in much the same way as advising on the use of bibliography data management tools used to be…? (For a recent take on this, see Dorothea Salo’s recent Ariadne article Retooling Libraries for the Data Challenge.)