Prompted by a conversation with Rufus Pollock over lunch today, in part about data containerisation and the notion of “frictionless” data that can be easily discovered and is packaged along with metadata that helps you to import it into other tools or applications (such as a database), I’ve been confusing myself about what it might be like to have a frictionless data analysis working environment, where I could do something like write fda --datapackage http://example.com/DATAPACKAGE --db postgres --client rstudio ipynb and that would then:
- generate a fig script (eg as per something like Using Docker to Build Linked Container Course VMs);
- download the data package from the specified URL, unbundle it, create an SQL file to create an appropriate init file for the database specified, fire up the database and use the generated SQL file to configure the database by creating any necessary tables and loading the data in;
- fire up any specified client applications (IPython notebook and RStudio server in this example) and ideally seed them with SQL magic or database connection statements, for example, that automatically define an appropriate data connection to the database that’s just been configured;
- launch browser tabs that contain the clients;
- it might also be handy to be able to mount local directories against directory paths in the client applications, so I could have my R scripts in one directory of my own desktop, IPython notebooks in another, and then have visibility of those analysis scripts from the actual client applications.
The idea is that from a single command I can pull down a datafile, ingest it into a database, fire up one or more clients that are connected to that database, and start working with the data immediately. It’s not so different to double clicking on a file on your desktop and launching it into an application to start working on it, right?!
Can’t be that hard to wire up, surely?!;-) But would it be useful?
PS See also a further riff on this idea: Data Analysis Packages…?
I would say that before you pull data down from anywhere and reconstitute it in a local database, you should already have done initial analysis with that data so you know what it represents. The more desirable workflow IMHO would be to have a data provider offer an API to manipulate and access the data, preferable with data quality assurances, so that you can use it as a mashup in the first order analysis. Once you are satisfied by the depth and quality of the data, and when you need to seed a more defined or articulated model would you have to go through the trouble to suck the data out of the remote repo and reconstitute it in your own world. Otherwise stated, most analysis should happen at the data providers side through open apis and through a wide area network SOA software architecture that brings multiple sources together.
I think you will come to the above architecture when we predict out into the future. Take for example the cancer genome database, or Facebook’s social graph: both are mammoth data sets that due to their size are immobilized. Mashing these up with other data sets requires an Internet-distributed SOA. As data sets are growing rapidly, the future will shape up like these early movers.