Tagged: ddj

Seven Ways of Making Use of SQLite

SQLite is a really handy file based database engine. When you put data into a database, it can make it easier to search; it also provides a great tool for looking for stories or story leads hidden in the data. So here are seven ways of getting started with SQLite.

Querying SQLite Databases Using Rich User Interfaces

Whilst SQLite comes with it’s own command line client, many people will find the easiest way of getting started with querying SQLite databases is to use an application, either on the desktop or accessed via a browser based UI.

franchise is a browser based UI that you can access via the cloud or run locally (

(https://github.com/hvf/franchise)). If you have a SQLite database file (make sure the suffix is .sql) you can upload it and explore it using the franchise application. (If you have a CSV or Excel data file, you can upload that too and it will add it to its own temporary SQLite database). Here’s a review: Asking Questions of CSV Data, Using SQL In the Browser, With Franchise.

If you prefer something on the desktop, the cross-platform sqlitebrowser might suit your needs.

Another great way of making use of SQLite is bring it alive using datasette. A single command allows you to publish an interactive, browser based user interface to one or more databases, either on your own computer or via an online host such as Zeit Now, or Heroku. For example, I’ve popped up three databases I scrape together on Heroku and pointed my own URL at them (unfortunately, I tend to run out of free Heroku credits in the last week of the month at which point the site goes down!)

datasette allows you to query the databases through a customisable webpage and supports a range of plugins. For example, the datasette-cluster-map will detect latitude and longitude columns in a datatable and present the results using an interactive map. I gave it a spin with this map of UK Food Hygiene Ratings. You can find several other examples of datasettes published in the wild on the datasette wiki.

Finding Data: SQLite Databases in the Wild

Whilst tools such as datasette are great for quickly getting started with querying a SQLite database, one obvious question that arises is: what SQLite database?

Once you start poking around, however, you can start to find examples of SQLite databases working quietly behind the scenes on you own computer. (Searching your computer for files with a .sqlite suffix is one way of finding them!)

As a specific example, the Safari, Chrome and Firefox web browsers all keep track of your browser history using a SQLite database on your computer (this gist – dropmeaword/browser_history.md – tells you where you can find the files. You can then simply query them using datasette. On a Mac, I can simply run:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History

and I can then start to query my browser history using the datasette browser based UI. Here’s an ‘inspect your browser history’ tutorial to get you started…

Ingesting Data

Applications such as franchise allow you to upload a CSV or Excel datafile and automatically import it into a SQLite database so that it can be queried using SQLite. The datasette ecosystem also includes and application for uploading CSV files and ingesting them into a SQlite database: datasette publish.

Behind the scenes of that application is a Python command line utility called csvs-to-sqlite. A simple command lets yoiu convert a CSV file to a SQLite database:

csvs-to-sqlite myfile.csv mydatabase.db

Whilst csvs-to-sqlite focusses on the conversion of CSV files into a SQLite database, the more general sqlitebiter command line utility can convert CSV, Excel, HTML tables (eg from a URL), JSON, Jupyter Notebooks, Markdown, TSV and Google-Sheets to a SQLite database file using a similar command format.

Using SQLite from the Commandline

Natively, SQLite comes with its own command line shell that allows you to connect to and query a SQLite database from the command line.

As well as command line tools for converting data contained in various file formats into a SQLite database, several command line tools embed that conversion within a command line interface that allows you convert a data file to an in-memory SQLite database and query it directly.

For example, using the csvsql command from csvkit:

csvsql --query "select * from iris as i join irismeta as m on (i.species = m.species)" examples/iris.csv examples/irismeta.csv

Or textql:

Or the simply named q:

q -H "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"

Querying SQLite Databases from Programming Languages Such as Python and R

If you are developing your own data-centric reproducible research pipelines, it’s likely that you will be using a programming language such as R or the Python pandas library.

There are two main ways for using SQLite in this context. The first is to connect to the database from the programming language and then query it from within the language. For example, in R, you might use the RSQLite package. In Python, you can connect to a SQLite database using the base sqlite3 package.

The other approach is to use SQLite as an in-memory database that provides a SQL query interface to a dataframe. In R, this can be achieved using the sqldf package:

library(sqldf)
sqldf("select * from iris limit 5")

In Python/pandas, you can use the pandasql package:

from pandasql import sqldf, load_births
pysqldf = lambda q: sqldf(q, globals())
births = load_births()
print(pysqldf("SELECT * FROM births LIMIT 10;").head())

In many respects, sqldf and pandasql behave like programming language native versions of command-line utilities such as csvsql, textql and q, although rather than importing a data file into a SQLite database so that it can be queried, they import the contents of the referenced dataframe.

pandas also provides native support for adding dataframes as tables to a connected SQLite database, as well as reading results from queries onto the database back into a dataframe. Once you start getting into the swing of putting data into a database, and then running JOINed queries over multiple tables, you’ll start to wonder why you spent so much time getting hassled by VLOOKUP. As an example, here’s a way of Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes.

Using SQLite in Your Own Applications

If you are comfortable with using programming code to manipulate your data, then you may want to explore ways of using SQLite to create your own data driven applications.

One way yo start is to use SQLite completely within the browser. Accessing desktop applications from a webpage is typically a no-no because of browser security restrictions, but SQLite is quite a light application, so it can – and has been – compiled to Javascript so that it can be imported as a Javascript library and run from within a webpage: sql.js. You can see an example of how it can be used to provide a simple browser based, SQLite powered data explorer, running solely within a browser here: official demo or SQLite Viewer.

As well as running SQLite in a browser, SQLite can also be used to power an API. One of the advantages of running a datasette service is that it also exposes a datasette API. This means you can publish your datasette to a web host then other computers can querying it programmatically.

If you are working in a Python Jupyter environment, it’s simple enough to use the Jupyer kernel gateway to create your own APIs. Here’s an example of building a service to allow the lookup of ONS codes from a simple SQLite database: Building a JSON API Using Jupyter Notebooks in Under 5 Minutes.

Another way of using SQLite databases in a Jupyter environment is to use Scripted Forms to For example, here’s one of my own recipes for Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter that shows how to create a simple interactive form for querying a SQLite database containing descriptions of images used in OpenLearn courses.

SQLite Database Admin Tools

As well as providing a simple explorer and query interface, the sqlitebrowser tool also supports a range of other SQLite database administration functions such as the ability to create, define, modify and delete tables and indexes, or edit, add and delete individual records.

The browser based sqlite-web application provides a similar range of utulities via a browser based, rather than desktop client, UI.

Summary

SQLite is lightweight, in-memory and file based database that allows you to run SQL queries over a wide range of tabular datasets. If you work with data, knowing how to write even simple SQL queries can add a powerful tool to your toolbox. SQLite, and the associated tools created around it, is almost certainly one of the easiest ways in to using this most versatile, portable, and personal data power tool.

PS Here are some more ways of using SQLite:

Generating Text From An R DataFrame using PyTracery, Pandas and Reticulate

In a couple of recent posts (Textualisation With Tracery and Database Reporting 2.0 and More Tinkering With PyTracery) I’ve started exploring various ways of using the pytracery port of the tracery story generation tool to generate variety of texts from Python pandas data frames.

For my F1DataJunkie tinkerings I’ve been using R + SQL as the base languages, with some hardcoded Rdata2text constructions for rendering text from R dataframes (example).

Whilst there is a basic port of tracery to R, I want to make use of the various extensions I’ve been doodling with to pytracery, so it seemed like a good opportunity to start exploring the R reticulate package.

It was a bit of a faff trying to get things to work the first time, so here on some notes on what I had to consider to get a trivial demo working in my RStudio/Rmd/knitr environment.

Python Environment

My first attempt was to use python blocks in an Rmd document:

```{python}
import sys
print(sys.executable)
````

but R insisted on using the base Python path on my Mac that was not the path I wanted to use… The fix turned out to be setting the engine…

```{python, engine.path ='/Users/f1dj/anaconda3/bin/python' }
import sys
print(sys.executable)
````

This could also be done via a setting: opts_chunk$set(engine.path = '/Users/f1dj/anaconda3/bin/python')

One of the problems with this approach is that a Python environment is created for each chunk – so you can’t easily carry state over from one Python chunk to another.

So I had a look at a workaround using reticulate instead.

Calling pytracery from R using reticulate

The solution I think I’m going for is to put Python code into a file, call that into R, then pass an R dataframe as an argument to a called Python function and gett a response back into R as an R dataframe.

For example, here’s a simple python test file:

import tracery
from tracery.modifiers import base_english
import pandas as pd

def pandas_row_mapper(row, rules, root,  modifiers=base_english):
    ''' Function to parse single row of dataframe '''
    row=row.to_dict()
    rules=rules.copy()

    for k in row:
        rules[k] = str(row[k])
        grammar = tracery.Grammar(rules)
        if modifiers is not None:
            if isinstance(modifiers,list):
                for modifier in modifiers:
                    grammar.add_modifiers(modifier)
            else:
                grammar.add_modifiers(modifiers)

    return grammar.flatten(root)

def pandas_tracery(df, rules, root, modifiers=base_english):
  return df.apply(lambda row: pandas_row_mapper(row, rules, root, modifiers), axis=1)

def pdt_inspect(df):
  return(df)

def pdt_test1(df):
  return type(df)

def pdt_demo(df):
  return pandas_tracery(df, _demo_rules, "#origin#",  modifiers=base_english)

#Create example rule to apply to each row of dataframe
_demo_rules = {'origin': "#code# was placed #position#!",
         'position': "#pos.uppercase#"}

We can access a python environment using reticulate:

library(reticulate)

#Show conda environments
conda_list("auto")

#Use a particular, name conda environment
use_condaenv(condaenv='anaconda3', required=T)

#Check the availability of a particular module in the environment
py_module_available("tracery")

Now we can load in the python file – and the functions it defines – and then call one of the  loaded Python functions.

Note that I seemed to have to force the casting of the R dataframe to a python/pandas dataframe using r_to_py(), although I’d expected the type mapping to be handled automatically? (Perhaps there is a setting somewhere?)

```{r}
source_python("pd_tracery_demo.py")
df1=data.frame(code=c('Jo','Sam'), pos=c('first','Second'))
df1$result = pdt_demo(r_to_py(df1, convert=T))
df1
```

#Displays:
Jo	first	Jo was placed FIRST!
Sam	Second	Sam was placed SECOND!

(Note: I also spotted a gotcha – things don’t work so well if you define an R column name called name… )

So now I can start looking at converting sports reporting tropes like these:

into tracery story models I can call using my pandas/pytracery hacks:-)

PS here’s a quick demo of inlining Python code:

library(reticulate)
rvar=1

#Go into python shell - this persists
repl_python()
#Access R variables with r.
pyvar=r.rvar+1

#Return to R shell
exit

#Access Python variable with py$
py$pyvar

Sketch – Data Trivia

A bit more tinkering with F1 data from the ergast db, this time trying to generating trivia / facts around races.

The facts are identified using SQL queries:

#starts for a team
q=paste0('SELECT d.code, COUNT(code) AS startsforteam, c.name AS name FROM drivers d JOIN races r JOIN results rs JOIN constructors c WHERE c.constructorId=rs.constructorId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND d.code IN (',driversThisYear_str,') ',upto,' GROUP BY d.code, c.name HAVING (startsforteam+1) % 50 = 0')
startsTeammod50=dbGetQuery(ergastdb, q)

#looking for poles to date modulo 5 
q=paste0('SELECT d.code, COUNT(code) AS poles FROM drivers d JOIN qualifying q JOIN races r WHERE r.raceId=q.raceId AND d.code IN (',driversThisYear_str,') AND d.driverId=q.driverId AND q.position=1',upto,' GROUP BY code HAVING poles>1 AND (poles+1) % 5 = 0')
lookingpolesmod5=dbGetQuery(ergastdb, q)

Some of the queries also embed query fragments, which I intend to develop further…

upto=paste0(' AND (year<',year,' OR (year=',year,' AND round<',round,')) ')

I'm using knitr to generate Github flavoured markdown (gfm) from my Rmd docs – here’s part of the header:

---
output:
  md_document:
    variant: gfm
---

The following recipe then takes results from the trivia queries and spiels the output:

if (nrow(startsTeammod50)>0) {
  for (row in 1:nrow(startsTeammod50)) {
    text = '- `r startsTeammod50[row, "code"]` is looking for their `r toOrdinal(startsTeammod50[row, "startsforteam"]+1)` start for `r startsTeammod50[row, "name"]`'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

if (nrow(lookingpolesmod5)>0) {
  for (row in 1:nrow(lookingpolesmod5)) {
    text = '- `r lookingpolesmod5[row, "code"]` is looking for their `r toOrdinal(lookingpolesmod5[row, "poles"]+1)` ever pole position'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

We then get outputs of the form:

  • BOT is looking for their 100th race start
  • HAM is looking for their 100th start for Mercedes

See more example outputs here: Bahrain F1 2018 – Race Trivia.

This is another recipe I need to work up a little further and add to Wrangling F1 Data With R.

Some More Rally Result Chart Sketches

Some more sketches, developing / updating one of the charts I first played with last year (the stage chart and tinkering with something new.

First the stage chart – I’ve started pondering a couple of things with this chart to try to get the information density up a bit.

At a first attempt at updating the chart, I’ve started to look at adding additional marginal layers. In the example above:

  • vertical dashed lines separate out the different legs. As soon as I get the data to hand, I think it could make sense to use something like a solid line to show service, maybe a double solid line to show *parc fermé*; I’m not sure about additionally separating the days? (They’re perhaps implied by *parc fermé*? I need to check that…)
  • I added stage names *above* the chart  – this has the benefit of identifying stages that are repeated;
  • stage distances are added *below* the chart. I’ve also been wondering about adding the transit distances in *between* the stages;
  • driver labels – and positions – are shown to the left and the right.

As a second attempt, I started zooming in to just the stages associated with a particular leg. This encouraged me to start adding more detailed layers. These can be applied to the whole chart, but it may start to get a bit cluttered.

Here’s an example of a chart that shows three stages that make up a notional leg:

You’ll notice several additions to the chart:

  • the labels to the left identify the driver associated with each line. The number is currently the overall position of the driver at the end of the first stage in the leg, but I’m not sure if it should be the position at the end of the previous stage so it carries more information. The time is the gap to the overall leading driver at the end of the first stage;
  • the labels to the right show the overall positions and gap to overall leader at the end of the leg. The position label is in bold font if the driver position has improved over the leg (a switch lets you select whether this is a class rank improvement or an overall position improvement). Thinking about it, I could use italics for class improvement and bold for overall improvement to carry both pieces of information in the same label. The position is actually redundant (you can count…) so maybe it’d make more sense to give a position delta from the start of the leg (that is, the position at the end of the stage prior to the first stage shown in the current leg). The time delta is given in bold if it is better than at the start of the leg.
  • the red dots depict that the gap to the overall leader had *increased* for a driver by the end of the stage compared to the end of the previous stage. So a red dot means the car is further behind the leader at the end of the stage than they were at the end of the previous stage; this indicator could be rebased to show deltas between a target (“hero”) car and the other cars on the stage. The green dot shows that the time to the leader did not increase;
  • the grey labels at the top are a running count of the number of “wins in a row” a driver has had. There are options to choose other running counts (eg stage wins so far), and flags available for colouring things like “took lead”, “retained lead”, “lost lead”.

As well as the stage chart, I started wondering about an “ultimate stage report” for each stage, showing the delta between each driver and the best time achieved in a sector (that is, the time spent between two splits).

Here’s what I came up with at a first attempt. Time delta is on the bottom. The lower level grey bar indicates the time a driver lost relative to the “ultimate” stage. (The bar maxes out at the upper limit of the chart to indicate “more than” – I maybe need to indicate this visually eg with a dashed / broken line at the end of a maxed out bar.)

Within each driver area is a series of lollipop style charts. These indicate the gap between a driver and the best time achieved on the sector (first sector at the top of the group, last at the bottom). The driver label indicates the driver who achieved the best sector time. This chart could be rebased to show other gaps, but I need to think about that… The labels are coloured to indicate sector, and transparent to cope with some of the overlapping issues.

It’s also possible to plot this chart using a log scale:

This makes it easier to see the small gaps, as well as giving a far range on the delta. However, the log scale is harder to read for folk not familiar with them. It might be handy to put in a vertical dashed line for each power of 10 time (so a dashed line at 1s and 10s; the limit is 100s). It might also make sense to add a label to the right of the total delta bar to show what the actual delta time is.

So… tinkering… I was hoping to start to pull all the chart types I’ve been playing with together in a Leanpub book, but Leanpub is not free to play anymore unless you have generated over $10k of royalties (which I haven’t…). I’ve started looking at gitbook, but that’s new to me so I need to spend some time getting a feel for how to use it and to come up with a workflow /toolchain around it.

Tilted Axes and a Reminder About “Glanceability”

Via the Flowing Data blog (Roger Federer career in rankings and wins) this pretty chart from an SRF story, 20 Jahre, 20 Titel.

The rotated axes balance the chart and make the achievements of Connors, Federer and Lendl more “central” to the story. (SRF are a Swiss news organisation…)

I quickly copied and rotated the chart, and here’s how it look with the axes arranged more traditionally:

The composition feels unbalanced, and at a glance feels like there is less of a story in the chart. (Glanceability… that takes me back…. I first heard the phrase from James Cridland – Learnin’ from Virgin – then, as I noted in Powerpoint Presentations That Support Glanceability, Lorcan Dempsey dug into it a little bit more: Glanceability.)

It also reminds me of “banking to 45 degrees”, “the idea that the average line slope in a line chart should be 45º. This has been dubbed banking to 45º and has turned into one of the bits of common wisdom in visualization as determining the ideal aspect ratio [although it more specifically relates to] the comparison between the slopes of two lines, and the slope is the average between those two lines. So if the goal is to be able to compare the rates of change between lines, the 45º average slope makes sense as a rule” Robert Kosara blog post on Aspect Ratio and Banking to 45 Degrees.

The original statement comes from Cleveland, W.S., McGill, M.E. and McGill, R., (1988) The shape parameter of a two-variable graph, Journal of the American Statistical Association, 83(402), pp.289-300 [JSTOR],  and I think I should probably read it again…

By the by, SRF are one the best news orgs I know for sharing their working – the recipes for the above story can be found on their Gihub repo at srfdata/2018-01-roger-federer – so I could probably have recreated the unrotated chart directly from that source, if I’d had the time to play.

PS see also: Five ways to read a scatterplot on the Datawrapper blog.

Reproducible Notebooks Help You Track Down Errors…

A couple of days ago on the Spreadsheet Journalism blog, I came across a post on UK Immigration Raids:.

The post described a spreadsheet process for reshaping a a couple of wide format sheets in a spreadsheet into a single long format dataset to make the data easier to work with.

One of my New Year resolutions was to try to look out for opportunities to rework and compare spreadsheet vs. notebook data treatments, so here’s a Python pandas reworking of the example linked to above in a Jupyter notebook: UK Immigration Raids.ipynb.

You can run the notebook “live” using Binder/Binderhub:

Look in the notebooks folder for the UK Immigration Raids.ipynb notebook.

A few notes on the process:

  • there was no link to the original source data in the original post, although there was a link to a local copy of it;
  • the original post had a few useful cribs that I could use to cross check my working with, such as the number of incidents in Bristol;
  • the mention of dirty data made me put in a step to find mismatched data;
  • my original notebook working contained an error – which I left in the notebook to show how it might propagate through and how we might then try to figure out what the error was having detected it.

As an example, it could probably do with another iteration to demonstrate a more robust process with additional checks that transformations have been correctly applied to data along the way.

Anyway, that’s another of my New Year resolutions half implemented: *share your working, share your mistakes(.

Fragment – Carillion-ish

A quick sketch of some companies that are linked by common directors based on a list of directors seeded from Carillion PLC.

The data was obtained from the Companies House API using the Python chwrapper package and some old code of my own that’ll I’ll share once I get a chance to strip the extraneous cruft out of the notebook it’s in.

The essence of the approach / recipe is an old one that I used to use with OpenCorporates data, as described here: Mapping corporate networks-with Opencorporates.

Note the sketch doesn’t make claims about anything much. The edges just show that companies are linked by the same directors.

A better approach may be to generate a network based on control / ownership registration data but I didn’t have any code to hand to do that (it’s on my to do list for my next company data playtime!).

One way of starting to use this sort of structure is to match companies that appear in the network with payments data to see the actual extent of public body contracting with Carillion group companies. For other articles on Carillion contracts, see eg Here’s the data that reveals the scale of Carillion’s big-money government deals.