SQLite is a really handy file based database engine. When you put data into a database, it can make it easier to search; it also provides a great tool for looking for stories or story leads hidden in the data. So here are seven ways of getting started with SQLite.
Querying SQLite Databases Using Rich User Interfaces
SQLite comes with it’s own command line client, many people will find the easiest way of getting started with querying SQLite databases is to use an application, either on the desktop or accessed via a browser based UI.
franchise is a browser based UI that you can access via the cloud or run locally (
(https://github.com/hvf/franchise)). If you have a SQLite database file (make sure the suffix is
.sql) you can upload it and explore it using the franchise application. (If you have a CSV or Excel data file, you can upload that too and it will add it to its own temporary SQLite database). Here’s a review: Asking Questions of CSV Data, Using SQL In the Browser, With Franchise.
If you prefer something on the desktop, the cross-platform sqlitebrowser might suit your needs.
Another great way of making use of SQLite is bring it alive using
datasette. A single command allows you to publish an interactive, browser based user interface to one or more databases, either on your own computer or via an online host such as Zeit Now, or Heroku. For example, I’ve popped up three databases I scrape together on Heroku and pointed my own URL at them (unfortunately, I tend to run out of free Heroku credits in the last week of the month at which point the site goes down!)
datasette allows you to query the databases through a customisable webpage and supports a range of plugins. For example, the
datasette-cluster-map will detect latitude and longitude columns in a datatable and present the results using an interactive map. I gave it a spin with this map of UK Food Hygiene Ratings. You can find several other examples of datasettes published in the wild on the datasette wiki.
Finding Data: SQLite Databases in the Wild
Whilst tools such as
datasette are great for quickly getting started with querying a SQLite database, one obvious question that arises is: what SQLite database?
Once you start poking around, however, you can start to find examples of SQLite databases working quietly behind the scenes on you own computer. (Searching your computer for files with a
.sqlite suffix is one way of finding them!)
As a specific example, the Safari, Chrome and Firefox web browsers all keep track of your browser history using a SQLite database on your computer (this gist –
dropmeaword/browser_history.md – tells you where you can find the files. You can then simply query them using
datasette. On a Mac, I can simply run:
datasette ~/Library/Application\ Support/Google/Chrome/Default/History
and I can then start to query my browser history using the
datasette browser based UI. Here’s an ‘inspect your browser history’ tutorial to get you started…
Applications such as franchise allow you to upload a CSV or Excel datafile and automatically import it into a SQLite database so that it can be queried using SQLite. The datasette ecosystem also includes and application for uploading CSV files and ingesting them into a SQlite database: datasette publish.
Behind the scenes of that application is a Python command line utility called
csvs-to-sqlite. A simple command lets yoiu convert a CSV file to a SQLite database:
csvs-to-sqlite myfile.csv mydatabase.db
csvs-to-sqlite focusses on the conversion of CSV files into a SQLite database, the more general
sqlitebiter command line utility can convert CSV, Excel, HTML tables (eg from a URL), JSON, Jupyter Notebooks, Markdown, TSV and Google-Sheets to a SQLite database file using a similar command format.
Using SQLite from the Commandline
Natively, SQLite comes with its own command line shell that allows you to connect to and query a SQLite database from the command line.
As well as command line tools for converting data contained in various file formats into a SQLite database, several command line tools embed that conversion within a command line interface that allows you convert a data file to an in-memory SQLite database and query it directly.
For example, using the
csvsql command from
csvsql --query "select * from iris as i join irismeta as m on (i.species = m.species)" examples/iris.csv examples/irismeta.csv
Or the simply named
q -H "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"
Querying SQLite Databases from Programming Languages Such as Python and R
If you are developing your own data-centric reproducible research pipelines, it’s likely that you will be using a programming language such as R or the Python pandas library.
There are two main ways for using SQLite in this context. The first is to connect to the database from the programming language and then query it from within the language. For example, in R, you might use the
RSQLite package. In Python, you can connect to a SQLite database using the base
The other approach is to use SQLite as an in-memory database that provides a SQL query interface to a dataframe. In R, this can be achieved using the
library(sqldf) sqldf("select * from iris limit 5")
In Python/pandas, you can use the
from pandasql import sqldf, load_births pysqldf = lambda q: sqldf(q, globals()) births = load_births() print(pysqldf("SELECT * FROM births LIMIT 10;").head())
In many respects,
pandasql behave like programming language native versions of command-line utilities such as
q, although rather than importing a data file into a SQLite database so that it can be queried, they import the contents of the referenced dataframe.
pandas also provides native support for adding dataframes as tables to a connected SQLite database, as well as reading results from queries onto the database back into a dataframe. Once you start getting into the swing of putting data into a database, and then running JOINed queries over multiple tables, you’ll start to wonder why you spent so much time getting hassled by
VLOOKUP. As an example, here’s a way of Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes.
Using SQLite in Your Own Applications
If you are comfortable with using programming code to manipulate your data, then you may want to explore ways of using SQLite to create your own data driven applications.
As well as running SQLite in a browser, SQLite can also be used to power an API. One of the advantages of running a
datasette service is that it also exposes a datasette API. This means you can publish your datasette to a web host then other computers can querying it programmatically.
If you are working in a Python Jupyter environment, it’s simple enough to use the Jupyer kernel gateway to create your own APIs. Here’s an example of building a service to allow the lookup of ONS codes from a simple SQLite database: Building a JSON API Using Jupyter Notebooks in Under 5 Minutes.
Another way of using SQLite databases in a Jupyter environment is to use
Scripted Forms to For example, here’s one of my own recipes for Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter that shows how to create a simple interactive form for querying a SQLite database containing descriptions of images used in OpenLearn courses.
SQLite Database Admin Tools
As well as providing a simple explorer and query interface, the sqlitebrowser tool also supports a range of other SQLite database administration functions such as the ability to create, define, modify and delete tables and indexes, or edit, add and delete individual records.
The browser based
sqlite-web application provides a similar range of utulities via a browser based, rather than desktop client, UI.
SQLite is lightweight, in-memory and file based database that allows you to run SQL queries over a wide range of tabular datasets. If you work with data, knowing how to write even simple SQL queries can add a powerful tool to your toolbox. SQLite, and the associated tools created around it, is almost certainly one of the easiest ways in to using this most versatile, portable, and personal data power tool.
PS Here are some more ways of using SQLite:
- as a spatial database using [
Notebook style interfaces, in which content blocks are organised in separate cells that can be moved up or down a document, are starting to look as if their time may have come. Over the last week, I’ve come across two examples.
The first, an early preview of the OU’s under development OpenCreate authoring environment uses an outliner style editor to support the creation of a weekly study planner and topics within each week, and a notebook style interface for editing the topical content pages. I would show screenshots but I’ve pre-emptively been told to not to post videos or screenshots…
The second is an open project – a live demo and the code repository are available – and it comes in the form of Franchise, a simple app service that delivers a rich, browser based SQL query engine for querying simple data files (read the strapline and the name makes punful sense!).
Launching the service provides you with an interface that lets you connect to a database, or load in a data file, either by selecting it from a file browser or just dragging it onto the launch page.
Uploading a CSV document creates a SQLite3 database containing the data in a single table
Selecting the data table generates a SQL query that reveals the column names. Running the query generates a preview of the data table and also makes the output queryable as a “tagged” table.
The resulting table can then be queried using the tag name:
You can also use the download button to download the results table in a variety of formats:
If the resulting table has numerical columns, you can display the table using a range of charts, such as a bar chart.
For now, it looks as if the charts are quite simplistic – for example, we can’t stack or group the bars:
Other charts are available depending in a context sensitive way. For example, if there are two numerical columns we can plot a scatter char. Line charts are also available.
If the dataset contains latitude and longitude date, we we can use the data to plot points on a map.
For those new to SQL, there’s a handy cribsheet a the bottom of the page:
(If SQL is new to you, you might also find things like this handy: Asking Questions of Data – Garment Factories Data Expedition.)
We can also add textual commentary to the notebook in the form of markdown cells.
The markdown is styled responsively – but I couldn’t see how to go to to “preview” mode where the styling is applied but the markdown modifiers are hidden?
Cells are archived rather than deleted:
Although they can be deleted, as well as restored, from the archive.
Cells can also be reordered – click on the right hand siadebar of a cell to drag it into a slot above or below another cell, or alongside one.
Cells can also be duplicated, in which case they appear alongside the cloned cell.
The side by side view allows you to look at the effect of a changing a query compared to its original form.
I was quite excited by the idea that you could download the notebook:
and export it as an HTML file:
I had expected this to generated a standalone HTML file, but that appears not to be the case, at least for now. Instead, the cell data is packed into a JSON object:
and then passed to either a local Franchise server, or the web based one.
As a quick tool for querying data, Franchise looks to be pretty handy, although you soon realise how lacking in control it is over chart styles and axis labelling, for example (at least in int’s early form). If you could export standalone HTML, it would also make it more useful as an asset generating tool, but I guess it’s still early days.
According to a release thread – Franchise – An Open-Source SQL Notebook (franchise.cloud) – it looks as if a standalone electron app version is on the way. (In the meantime, I followed the build instructions from the repo README to produce a quick docker container: psychemedia/franchise.)
The ability to get started querying data using SQL without the need to install anything offers a way of having a quick chat with a file based dataset. (I couldn’t get it to work with Excel or JSON files, and didn’t try a SQL file or connecting to a separate database server.)
At the moment, I don’t think you can connect to a Google spreadsheet, so you have to download one , although a SQL like API is available for Google Sheets (eg I used it for this simple SQL query interface to Google spreadhseets way back when).
From a getting started with data conversations perspective, though, this offers quite a nice on ramp to a SQL query environment without the need to worry about the DBA (database admin) chores of setting up a database, defining tables, importing the data and so on.
I also wonder if it might act as a gateway to more aggressive and powerful query engines that are capable of querying over large and multiple datasets contained in local files? Things like Apache Drill, for example?
- Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language
- Querying Large CSV Files With Apache Drill
- Tinkering With Apache Drill – JOINed Queries Across JSON and CSV files
- Rolling Your Own Jupyter and RStudio Data Analysis Environment Around Apache Drill Using docker-compose
In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)
From the Scraperwiki database, we need to interrogate the API:
To do this, we’ll use a database query language – SQL.
What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:
select * from support where support in (select candinit from support)
Note that “support” refers to two things here – these are columns:
select * from support where support in (select candinit from support)
and these are the table the columns are being pulled from:
select * from support where support in (select candinit from support)
Here’s the result of Runing the query:
We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).
There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other: