Via a post on the rud.is blog – Drilling Into CSVs — Teaser Trailer – I came across a handy looking Apache tool: Apache Drill. A Java powered service, Apache Drill allows you to query large CSV and JSON files (as well as a range of other data backends) using SQL, without any particular manipulation of the target data files. (The notes also suggest you can query directly over a set of CSV files (with the same columns?) in a directory, though I haven’t tried that yet…)
To give it a go, I dowloaded Evan Odell’s Hansard dataset which comes in as a CSV file at just over 3GB.
Installing Apache Drill, and running it from the command line – ./apache-drill-1.10.0/bin/drill-embedded
– it was easy enough to start running queries from the off (Querying Plain Text Files):
SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv` LIMIT 3;
By default, queries from a CSV file ignore headers and treat all rows equally. Queries over particular columns can be executed by referring to numbered columns in he form COLUMNS[0]
, COLUMNS[1]
, etc. (Querying Plain Text Files). However, Bob Rudis’ blog hinted there was a way to configure the server to use the first row of a CSV file as a header row. In particular, the Data Source Plugin Configuration Basics docs page describes how the CSV data source configuration can be amended with the clauses "skipFirstLine": true, "extractHeader": true
to allow CSV files to be queried with the header row intact.
The configuration file for the live server can be amended via a web page published by the running Apache Drill service, by default on localhost port 8047:
Updating the configuration means we can start to run named column queries:
The config files are actually saved to a temporary location – /tmp/drill
. If the (updated) files are copied to a persistent location – mv /tmp/drill /my/configpath
– and the drill-override.conf
file updated with the setting drill.exec: {sys.store.provider.local.path="/my/configpath"}
, the (updated) configuration files will in future be uploaded from that location, rather than temporary default config files being created for each new server session (docs: Storage Plugin Registration).
Bob Rudis’ post also suggested that more efficient queries could be run by converting the CSV data file to a parquet data format, and then querying over that:
CREATE TABLE dfs.tmp.`/senti_post_v2.parquet` AS SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv`;
This creates a new parquet data folder /tmp/senti_post_v2.parquet
. This can then be queried as for the CSV file:
SELECT gender, count(*) FROM dfs.tmp.`/senti_post_v2.parquet` GROUP BY gender;
…but with a significant speed up, on some queries at least:
To quit, !exit
.
And finally, to make using the Apache Drill service easier to use from code, wrapper libraries are available for R – sergeant
R package – and Python – pydrill
package.