Querying Large CSV Files With Apache Drill

Via a post on the rud.is blog – Drilling Into CSVs — Teaser Trailer – I came across a handy looking Apache tool: Apache Drill. A Java powered service, Apache Drill allows you to query large CSV and JSON files (as well as a range of other data backends) using SQL, without any particular manipulation of the target data files. (The notes also suggest you can query directly over a set of CSV files (with the same columns?) in a directory, though I haven’t tried that yet…)

To give it a go, I dowloaded Evan Odell’s Hansard dataset which comes in as a CSV file at just over 3GB.

Installing Apache Drill, and running it from the command line – ./apache-drill-1.10.0/bin/drill-embedded – it was easy enough to start running queries from the off (Querying Plain Text Files):

SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv` LIMIT 3;

By default, queries from a CSV file ignore headers and treat all rows equally. Queries over particular columns can be executed by referring to numbered columns in he form COLUMNS[0], COLUMNS[1], etc. (Querying Plain Text Files). However, Bob Rudis’ blog hinted there was a way to configure the server to use the first row of a CSV file as a header row. In particular, the Data Source Plugin Configuration Basics docs page describes how the CSV data source configuration can be amended with the clauses "skipFirstLine": true, "extractHeader": true to allow CSV files to be queried with the header row intact.

The configuration file for the live server can be amended via a web page published by the running Apache Drill service, by default on localhost port 8047:

Updating the configuration means we can start to run named column queries:

The config files are actually saved to a temporary location – /tmp/drill. If the (updated) files are copied to a persistent location – mv /tmp/drill /my/configpath – and the drill-override.conf file updated with the setting drill.exec: {sys.store.provider.local.path="/my/configpath"}, the (updated) configuration files will in future be uploaded from that location, rather than temporary default config files being created for each new server session (docs: Storage Plugin Registration).

Bob Rudis’ post also suggested that more efficient queries could be run by converting the CSV data file to a parquet data format, and then querying over that:

CREATE TABLE dfs.tmp.`/senti_post_v2.parquet` AS SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv`;

This creates a new parquet data folder /tmp/senti_post_v2.parquet. This can then be queried as for the CSV file:

SELECT gender, count(*) FROM dfs.tmp.`/senti_post_v2.parquet` GROUP BY gender;

…but with a significant speed up, on some queries at least:

To quit, !exit.

And finally, to make using the Apache Drill service easier to use from code, wrapper libraries are available for R – sergeant R package – and Python – pydrill package.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst