Tagged: ddj

Scraping ASP Web Pages

For a couple of years now, I’ve been using a Python based web scraper that runs once a day on morph.io to scrape planning applications from the Isle of Wight website into a simple SQLite database. (It actually feeds a WordPress plugin I started tinkering with to display currently open applications in a standing test blog post. I really should tidy that extension up and blog it one day…)

In many cases you can get a copy of the HTML content of page you want to scrape simply by making an http GET request to the page. Some pages, however, display content on a particular URL as a result of making a form request on a particular page that makes an http POST request to the same URL, and then gets content back dependent on the POSTed form variables.

In some cases, such as the Isle of Wight Council planning applications page, the form post is masked as a link that fires of a Javascript request that posts form content in order to obtain a set of query results:

The Javascript function draws on state baked into the page to make the form request. This state is required in order to get a valid response – and the lists of the current applications:

We can automate the grabbing of this state as part of our scraper by loading the page, grabbing the state data, mimicking the form content and making the POST request that would otherwise by triggered by the Javascript function run as a result of clicking the “get all applications” link:

import requests

#Get the original page
response = requests.get(url)

#Scrape the state data we need to validate the form request
soup=BeautifulSoup(response.content)
viewstate = soup.find('input' , id ='__VIEWSTATE')['value']
eventvalidation=soup.find('input' , id ='__EVENTVALIDATION')['value']
viewstategenerator=soup.find('input' , id ='__VIEWSTATEGENERATOR')['value']
params={'__EVENTTARGET':'lnkShowAll','__EVENTARGUMENT':'','__VIEWSTATE':viewstate,
        '__VIEWSTATEGENERATOR':viewstategenerator,
        '__EVENTVALIDATION':eventvalidation,'q':'Search the site...'}

#Use the validation data when making the request for all current applications
r = requests.post(url, data=params)

In the last couple of weeks, I’ve noticed daily errors from morph.io trying to run this scraper. Sometimes errors come and go, perhaps as a result of the server on the other end being slow to respond, or maybe even as an edge case in scraped data causing an error in the scraper, but the error seemed to persist, so I revisited the scraper.

Running the scraper script locally, it seemed that my form request wasn’t returning the list of applications, it was just returning the original planning application page. So why had my  script stopped working?

Scanning the planning applications page HTML, all looked to be much as it was before, so I clicked through on the all applications link and looked at the data now being posted to the server by the official page using my Chrome browswer’s Developer Tools (which can be found from the browser View menu):

Inspecting the form data, everything looked much as it had done before, perhaps except for the blank txt* arguments:

Adding those in to the form didn’t fix the problem, so I wondered if the page was now responding to cookies, or was perhaps sensitive to the user agent?

We can handle that easily enough in the scraper script:

#Use a requests session rather than making simple requests - this should allowing the setting and preservation of cookies
# (Running in do not track mode can help limit cookies that are set to essential ones)
session = requests.Session()

#We can also add a user agent string so the scraper script looks like a real browser...
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
session.headers.update(headers)

response =session.get(url)
soup=BeautifulSoup(response.content)

viewstate = soup.find('input' , id ='__VIEWSTATE')['value']
eventvalidation=soup.find('input' , id ='__EVENTVALIDATION')['value']
viewstategenerator=soup.find('input' , id ='__VIEWSTATEGENERATOR')['value']
params={'__EVENTTARGET':'lnkShowAll','__EVENTARGUMENT':'','__VIEWSTATE':viewstate,
        '__VIEWSTATEGENERATOR':viewstategenerator,
        '__EVENTVALIDATION':eventvalidation,'q':'Search the site...'}

#Get all current applications using the same session
r=session.post(url,headers=headers,data=params)

But still no joy… so what headers were being used in the actual request on the live website?

Hmmm… maybe the server is now checking that requests are being made from the planning application webpage using the host, origin and or referrer attributes? That is, maybe the server is only responding to requests it things are being made from its own web pages off it’s own site?

Let’s add some similar data to the headers in our scripted request:

session = requests.Session()
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
session.headers.update(headers)

response =session.get(url)
soup=BeautifulSoup(response.content)

viewstate = soup.find('input' , id ='__VIEWSTATE')['value']
eventvalidation=soup.find('input' , id ='__EVENTVALIDATION')['value']
viewstategenerator=soup.find('input' , id ='__VIEWSTATEGENERATOR')['value']
params={'__EVENTTARGET':'lnkShowAll','__EVENTARGUMENT':'','__VIEWSTATE':viewstate,
        '__VIEWSTATEGENERATOR':viewstategenerator,
        '__EVENTVALIDATION':eventvalidation,'q':'Search the site...'}

#Add in some more header data...
#Populate the referrer from the original request URL
headers['Referer'] = response.request.url
#We could (should) extract this info by parsing the Referer; hard code for now...
headers['Origin']= 'https://www.iow.gov.uk'
headers['Host']= 'www.iow.gov.uk'

#Get all current applications
r=session.post(url,headers=headers,data=params)

And… success:-)

*Hopefully by posting this recipe, the page isn’t locked down further… In mitigation, I haven’t described how to pull the actual planning applications data off the page…

Seven Ways of Making Use of SQLite

SQLite is a really handy file based database engine. When you put data into a database, it can make it easier to search; it also provides a great tool for looking for stories or story leads hidden in the data. So here are seven ways of getting started with SQLite.

Querying SQLite Databases Using Rich User Interfaces

Whilst SQLite comes with it’s own command line client, many people will find the easiest way of getting started with querying SQLite databases is to use an application, either on the desktop or accessed via a browser based UI.

franchise is a browser based UI that you can access via the cloud or run locally (

(https://github.com/hvf/franchise)). If you have a SQLite database file (make sure the suffix is .sql) you can upload it and explore it using the franchise application. (If you have a CSV or Excel data file, you can upload that too and it will add it to its own temporary SQLite database). Here’s a review: Asking Questions of CSV Data, Using SQL In the Browser, With Franchise.

If you prefer something on the desktop, the cross-platform sqlitebrowser might suit your needs.

Another great way of making use of SQLite is bring it alive using datasette. A single command allows you to publish an interactive, browser based user interface to one or more databases, either on your own computer or via an online host such as Zeit Now, or Heroku. For example, I’ve popped up three databases I scrape together on Heroku and pointed my own URL at them (unfortunately, I tend to run out of free Heroku credits in the last week of the month at which point the site goes down!)

datasette allows you to query the databases through a customisable webpage and supports a range of plugins. For example, the datasette-cluster-map will detect latitude and longitude columns in a datatable and present the results using an interactive map. I gave it a spin with this map of UK Food Hygiene Ratings. You can find several other examples of datasettes published in the wild on the datasette wiki.

Finding Data: SQLite Databases in the Wild

Whilst tools such as datasette are great for quickly getting started with querying a SQLite database, one obvious question that arises is: what SQLite database?

Once you start poking around, however, you can start to find examples of SQLite databases working quietly behind the scenes on you own computer. (Searching your computer for files with a .sqlite suffix is one way of finding them!)

As a specific example, the Safari, Chrome and Firefox web browsers all keep track of your browser history using a SQLite database on your computer (this gist – dropmeaword/browser_history.md – tells you where you can find the files. You can then simply query them using datasette. On a Mac, I can simply run:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History

and I can then start to query my browser history using the datasette browser based UI. Here’s an ‘inspect your browser history’ tutorial to get you started…

Ingesting Data

Applications such as franchise allow you to upload a CSV or Excel datafile and automatically import it into a SQLite database so that it can be queried using SQLite. The datasette ecosystem also includes and application for uploading CSV files and ingesting them into a SQlite database: datasette publish.

Behind the scenes of that application is a Python command line utility called csvs-to-sqlite. A simple command lets yoiu convert a CSV file to a SQLite database:

csvs-to-sqlite myfile.csv mydatabase.db

Whilst csvs-to-sqlite focusses on the conversion of CSV files into a SQLite database, the more general sqlitebiter command line utility can convert CSV, Excel, HTML tables (eg from a URL), JSON, Jupyter Notebooks, Markdown, TSV and Google-Sheets to a SQLite database file using a similar command format.

Using SQLite from the Commandline

Natively, SQLite comes with its own command line shell that allows you to connect to and query a SQLite database from the command line.

As well as command line tools for converting data contained in various file formats into a SQLite database, several command line tools embed that conversion within a command line interface that allows you convert a data file to an in-memory SQLite database and query it directly.

For example, using the csvsql command from csvkit:

csvsql --query "select * from iris as i join irismeta as m on (i.species = m.species)" examples/iris.csv examples/irismeta.csv

Or textql:

Or the simply named q:

q -H "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"

Querying SQLite Databases from Programming Languages Such as Python and R

If you are developing your own data-centric reproducible research pipelines, it’s likely that you will be using a programming language such as R or the Python pandas library.

There are two main ways for using SQLite in this context. The first is to connect to the database from the programming language and then query it from within the language. For example, in R, you might use the RSQLite package. In Python, you can connect to a SQLite database using the base sqlite3 package.

The other approach is to use SQLite as an in-memory database that provides a SQL query interface to a dataframe. In R, this can be achieved using the sqldf package:

library(sqldf)
sqldf("select * from iris limit 5")

In Python/pandas, you can use the pandasql package:

from pandasql import sqldf, load_births
pysqldf = lambda q: sqldf(q, globals())
births = load_births()
print(pysqldf("SELECT * FROM births LIMIT 10;").head())

In many respects, sqldf and pandasql behave like programming language native versions of command-line utilities such as csvsql, textql and q, although rather than importing a data file into a SQLite database so that it can be queried, they import the contents of the referenced dataframe.

pandas also provides native support for adding dataframes as tables to a connected SQLite database, as well as reading results from queries onto the database back into a dataframe. Once you start getting into the swing of putting data into a database, and then running JOINed queries over multiple tables, you’ll start to wonder why you spent so much time getting hassled by VLOOKUP. As an example, here’s a way of Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes.

Using SQLite in Your Own Applications

If you are comfortable with using programming code to manipulate your data, then you may want to explore ways of using SQLite to create your own data driven applications.

One way yo start is to use SQLite completely within the browser. Accessing desktop applications from a webpage is typically a no-no because of browser security restrictions, but SQLite is quite a light application, so it can – and has been – compiled to Javascript so that it can be imported as a Javascript library and run from within a webpage: sql.js. You can see an example of how it can be used to provide a simple browser based, SQLite powered data explorer, running solely within a browser here: official demo or SQLite Viewer.

As well as running SQLite in a browser, SQLite can also be used to power an API. One of the advantages of running a datasette service is that it also exposes a datasette API. This means you can publish your datasette to a web host then other computers can querying it programmatically.

If you are working in a Python Jupyter environment, it’s simple enough to use the Jupyer kernel gateway to create your own APIs. Here’s an example of building a service to allow the lookup of ONS codes from a simple SQLite database: Building a JSON API Using Jupyter Notebooks in Under 5 Minutes.

Another way of using SQLite databases in a Jupyter environment is to use Scripted Forms to For example, here’s one of my own recipes for Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter that shows how to create a simple interactive form for querying a SQLite database containing descriptions of images used in OpenLearn courses.

SQLite Database Admin Tools

As well as providing a simple explorer and query interface, the sqlitebrowser tool also supports a range of other SQLite database administration functions such as the ability to create, define, modify and delete tables and indexes, or edit, add and delete individual records.

The browser based sqlite-web application provides a similar range of utulities via a browser based, rather than desktop client, UI.

Summary

SQLite is lightweight, in-memory and file based database that allows you to run SQL queries over a wide range of tabular datasets. If you work with data, knowing how to write even simple SQL queries can add a powerful tool to your toolbox. SQLite, and the associated tools created around it, is almost certainly one of the easiest ways in to using this most versatile, portable, and personal data power tool.

PS Here are some more ways of using SQLite:

Generating Text From An R DataFrame using PyTracery, Pandas and Reticulate

In a couple of recent posts (Textualisation With Tracery and Database Reporting 2.0 and More Tinkering With PyTracery) I’ve started exploring various ways of using the pytracery port of the tracery story generation tool to generate variety of texts from Python pandas data frames.

For my F1DataJunkie tinkerings I’ve been using R + SQL as the base languages, with some hardcoded Rdata2text constructions for rendering text from R dataframes (example).

Whilst there is a basic port of tracery to R, I want to make use of the various extensions I’ve been doodling with to pytracery, so it seemed like a good opportunity to start exploring the R reticulate package.

It was a bit of a faff trying to get things to work the first time, so here on some notes on what I had to consider to get a trivial demo working in my RStudio/Rmd/knitr environment.

Python Environment

My first attempt was to use python blocks in an Rmd document:

```{python}
import sys
print(sys.executable)
````

but R insisted on using the base Python path on my Mac that was not the path I wanted to use… The fix turned out to be setting the engine…

```{python, engine.path ='/Users/f1dj/anaconda3/bin/python' }
import sys
print(sys.executable)
````

This could also be done via a setting: opts_chunk$set(engine.path = '/Users/f1dj/anaconda3/bin/python')

One of the problems with this approach is that a Python environment is created for each chunk – so you can’t easily carry state over from one Python chunk to another.

So I had a look at a workaround using reticulate instead.

Calling pytracery from R using reticulate

The solution I think I’m going for is to put Python code into a file, call that into R, then pass an R dataframe as an argument to a called Python function and gett a response back into R as an R dataframe.

For example, here’s a simple python test file:

import tracery
from tracery.modifiers import base_english
import pandas as pd

def pandas_row_mapper(row, rules, root,  modifiers=base_english):
    ''' Function to parse single row of dataframe '''
    row=row.to_dict()
    rules=rules.copy()

    for k in row:
        rules[k] = str(row[k])
        grammar = tracery.Grammar(rules)
        if modifiers is not None:
            if isinstance(modifiers,list):
                for modifier in modifiers:
                    grammar.add_modifiers(modifier)
            else:
                grammar.add_modifiers(modifiers)

    return grammar.flatten(root)

def pandas_tracery(df, rules, root, modifiers=base_english):
  return df.apply(lambda row: pandas_row_mapper(row, rules, root, modifiers), axis=1)

def pdt_inspect(df):
  return(df)

def pdt_test1(df):
  return type(df)

def pdt_demo(df):
  return pandas_tracery(df, _demo_rules, "#origin#",  modifiers=base_english)

#Create example rule to apply to each row of dataframe
_demo_rules = {'origin': "#code# was placed #position#!",
         'position': "#pos.uppercase#"}

We can access a python environment using reticulate:

library(reticulate)

#Show conda environments
conda_list("auto")

#Use a particular, name conda environment
use_condaenv(condaenv='anaconda3', required=T)

#Check the availability of a particular module in the environment
py_module_available("tracery")

Now we can load in the python file – and the functions it defines – and then call one of the  loaded Python functions.

Note that I seemed to have to force the casting of the R dataframe to a python/pandas dataframe using r_to_py(), although I’d expected the type mapping to be handled automatically? (Perhaps there is a setting somewhere?)

```{r}
source_python("pd_tracery_demo.py")
df1=data.frame(code=c('Jo','Sam'), pos=c('first','Second'))
df1$result = pdt_demo(r_to_py(df1, convert=T))
df1
```

#Displays:
Jo	first	Jo was placed FIRST!
Sam	Second	Sam was placed SECOND!

(Note: I also spotted a gotcha – things don’t work so well if you define an R column name called name… )

So now I can start looking at converting sports reporting tropes like these:

into tracery story models I can call using my pandas/pytracery hacks:-)

PS here’s a quick demo of inlining Python code:

library(reticulate)
rvar=1

#Go into python shell - this persists
repl_python()
#Access R variables with r.
pyvar=r.rvar+1

#Return to R shell
exit

#Access Python variable with py$
py$pyvar

Sketch – Data Trivia

A bit more tinkering with F1 data from the ergast db, this time trying to generating trivia / facts around races.

The facts are identified using SQL queries:

#starts for a team
q=paste0('SELECT d.code, COUNT(code) AS startsforteam, c.name AS name FROM drivers d JOIN races r JOIN results rs JOIN constructors c WHERE c.constructorId=rs.constructorId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND d.code IN (',driversThisYear_str,') ',upto,' GROUP BY d.code, c.name HAVING (startsforteam+1) % 50 = 0')
startsTeammod50=dbGetQuery(ergastdb, q)

#looking for poles to date modulo 5 
q=paste0('SELECT d.code, COUNT(code) AS poles FROM drivers d JOIN qualifying q JOIN races r WHERE r.raceId=q.raceId AND d.code IN (',driversThisYear_str,') AND d.driverId=q.driverId AND q.position=1',upto,' GROUP BY code HAVING poles>1 AND (poles+1) % 5 = 0')
lookingpolesmod5=dbGetQuery(ergastdb, q)

Some of the queries also embed query fragments, which I intend to develop further…

upto=paste0(' AND (year<',year,' OR (year=',year,' AND round<',round,')) ')

I'm using knitr to generate Github flavoured markdown (gfm) from my Rmd docs – here’s part of the header:

---
output:
  md_document:
    variant: gfm
---

The following recipe then takes results from the trivia queries and spiels the output:

if (nrow(startsTeammod50)>0) {
  for (row in 1:nrow(startsTeammod50)) {
    text = '- `r startsTeammod50[row, "code"]` is looking for their `r toOrdinal(startsTeammod50[row, "startsforteam"]+1)` start for `r startsTeammod50[row, "name"]`'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

if (nrow(lookingpolesmod5)>0) {
  for (row in 1:nrow(lookingpolesmod5)) {
    text = '- `r lookingpolesmod5[row, "code"]` is looking for their `r toOrdinal(lookingpolesmod5[row, "poles"]+1)` ever pole position'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

We then get outputs of the form:

  • BOT is looking for their 100th race start
  • HAM is looking for their 100th start for Mercedes

See more example outputs here: Bahrain F1 2018 – Race Trivia.

This is another recipe I need to work up a little further and add to Wrangling F1 Data With R.

Some More Rally Result Chart Sketches

Some more sketches, developing / updating one of the charts I first played with last year (the stage chart and tinkering with something new.

First the stage chart – I’ve started pondering a couple of things with this chart to try to get the information density up a bit.

At a first attempt at updating the chart, I’ve started to look at adding additional marginal layers. In the example above:

  • vertical dashed lines separate out the different legs. As soon as I get the data to hand, I think it could make sense to use something like a solid line to show service, maybe a double solid line to show *parc fermé*; I’m not sure about additionally separating the days? (They’re perhaps implied by *parc fermé*? I need to check that…)
  • I added stage names *above* the chart  – this has the benefit of identifying stages that are repeated;
  • stage distances are added *below* the chart. I’ve also been wondering about adding the transit distances in *between* the stages;
  • driver labels – and positions – are shown to the left and the right.

As a second attempt, I started zooming in to just the stages associated with a particular leg. This encouraged me to start adding more detailed layers. These can be applied to the whole chart, but it may start to get a bit cluttered.

Here’s an example of a chart that shows three stages that make up a notional leg:

You’ll notice several additions to the chart:

  • the labels to the left identify the driver associated with each line. The number is currently the overall position of the driver at the end of the first stage in the leg, but I’m not sure if it should be the position at the end of the previous stage so it carries more information. The time is the gap to the overall leading driver at the end of the first stage;
  • the labels to the right show the overall positions and gap to overall leader at the end of the leg. The position label is in bold font if the driver position has improved over the leg (a switch lets you select whether this is a class rank improvement or an overall position improvement). Thinking about it, I could use italics for class improvement and bold for overall improvement to carry both pieces of information in the same label. The position is actually redundant (you can count…) so maybe it’d make more sense to give a position delta from the start of the leg (that is, the position at the end of the stage prior to the first stage shown in the current leg). The time delta is given in bold if it is better than at the start of the leg.
  • the red dots depict that the gap to the overall leader had *increased* for a driver by the end of the stage compared to the end of the previous stage. So a red dot means the car is further behind the leader at the end of the stage than they were at the end of the previous stage; this indicator could be rebased to show deltas between a target (“hero”) car and the other cars on the stage. The green dot shows that the time to the leader did not increase;
  • the grey labels at the top are a running count of the number of “wins in a row” a driver has had. There are options to choose other running counts (eg stage wins so far), and flags available for colouring things like “took lead”, “retained lead”, “lost lead”.

As well as the stage chart, I started wondering about an “ultimate stage report” for each stage, showing the delta between each driver and the best time achieved in a sector (that is, the time spent between two splits).

Here’s what I came up with at a first attempt. Time delta is on the bottom. The lower level grey bar indicates the time a driver lost relative to the “ultimate” stage. (The bar maxes out at the upper limit of the chart to indicate “more than” – I maybe need to indicate this visually eg with a dashed / broken line at the end of a maxed out bar.)

Within each driver area is a series of lollipop style charts. These indicate the gap between a driver and the best time achieved on the sector (first sector at the top of the group, last at the bottom). The driver label indicates the driver who achieved the best sector time. This chart could be rebased to show other gaps, but I need to think about that… The labels are coloured to indicate sector, and transparent to cope with some of the overlapping issues.

It’s also possible to plot this chart using a log scale:

This makes it easier to see the small gaps, as well as giving a far range on the delta. However, the log scale is harder to read for folk not familiar with them. It might be handy to put in a vertical dashed line for each power of 10 time (so a dashed line at 1s and 10s; the limit is 100s). It might also make sense to add a label to the right of the total delta bar to show what the actual delta time is.

So… tinkering… I was hoping to start to pull all the chart types I’ve been playing with together in a Leanpub book, but Leanpub is not free to play anymore unless you have generated over $10k of royalties (which I haven’t…). I’ve started looking at gitbook, but that’s new to me so I need to spend some time getting a feel for how to use it and to come up with a workflow /toolchain around it.

Tilted Axes and a Reminder About “Glanceability”

Via the Flowing Data blog (Roger Federer career in rankings and wins) this pretty chart from an SRF story, 20 Jahre, 20 Titel.

The rotated axes balance the chart and make the achievements of Connors, Federer and Lendl more “central” to the story. (SRF are a Swiss news organisation…)

I quickly copied and rotated the chart, and here’s how it look with the axes arranged more traditionally:

The composition feels unbalanced, and at a glance feels like there is less of a story in the chart. (Glanceability… that takes me back…. I first heard the phrase from James Cridland – Learnin’ from Virgin – then, as I noted in Powerpoint Presentations That Support Glanceability, Lorcan Dempsey dug into it a little bit more: Glanceability.)

It also reminds me of “banking to 45 degrees”, “the idea that the average line slope in a line chart should be 45º. This has been dubbed banking to 45º and has turned into one of the bits of common wisdom in visualization as determining the ideal aspect ratio [although it more specifically relates to] the comparison between the slopes of two lines, and the slope is the average between those two lines. So if the goal is to be able to compare the rates of change between lines, the 45º average slope makes sense as a rule” Robert Kosara blog post on Aspect Ratio and Banking to 45 Degrees.

The original statement comes from Cleveland, W.S., McGill, M.E. and McGill, R., (1988) The shape parameter of a two-variable graph, Journal of the American Statistical Association, 83(402), pp.289-300 [JSTOR],  and I think I should probably read it again…

By the by, SRF are one the best news orgs I know for sharing their working – the recipes for the above story can be found on their Gihub repo at srfdata/2018-01-roger-federer – so I could probably have recreated the unrotated chart directly from that source, if I’d had the time to play.

PS see also: Five ways to read a scatterplot on the Datawrapper blog.

Reproducible Notebooks Help You Track Down Errors…

A couple of days ago on the Spreadsheet Journalism blog, I came across a post on UK Immigration Raids:.

The post described a spreadsheet process for reshaping a a couple of wide format sheets in a spreadsheet into a single long format dataset to make the data easier to work with.

One of my New Year resolutions was to try to look out for opportunities to rework and compare spreadsheet vs. notebook data treatments, so here’s a Python pandas reworking of the example linked to above in a Jupyter notebook: UK Immigration Raids.ipynb.

You can run the notebook “live” using Binder/Binderhub:

Look in the notebooks folder for the UK Immigration Raids.ipynb notebook.

A few notes on the process:

  • there was no link to the original source data in the original post, although there was a link to a local copy of it;
  • the original post had a few useful cribs that I could use to cross check my working with, such as the number of incidents in Bristol;
  • the mention of dirty data made me put in a step to find mismatched data;
  • my original notebook working contained an error – which I left in the notebook to show how it might propagate through and how we might then try to figure out what the error was having detected it.

As an example, it could probably do with another iteration to demonstrate a more robust process with additional checks that transformations have been correctly applied to data along the way.

Anyway, that’s another of my New Year resolutions half implemented: *share your working, share your mistakes(.

Fragment – Carillion-ish

A quick sketch of some companies that are linked by common directors based on a list of directors seeded from Carillion PLC.

The data was obtained from the Companies House API using the Python chwrapper package and some old code of my own that’ll I’ll share once I get a chance to strip the extraneous cruft out of the notebook it’s in.

The essence of the approach / recipe is an old one that I used to use with OpenCorporates data, as described here: Mapping corporate networks-with Opencorporates.

Note the sketch doesn’t make claims about anything much. The edges just show that companies are linked by the same directors.

A better approach may be to generate a network based on control / ownership registration data but I didn’t have any code to hand to do that (it’s on my to do list for my next company data playtime!).

One way of starting to use this sort of structure is to match companies that appear in the network with payments data to see the actual extent of public body contracting with Carillion group companies. For other articles on Carillion contracts, see eg Here’s the data that reveals the scale of Carillion’s big-money government deals.

Asking Questions of CSV Data, Using SQL In the Browser, With Franchise

Notebook style interfaces, in which content blocks are organised in separate cells that can be moved up or down a document, are starting to look as if their time may have come. Over the last week, I’ve come across two examples.

The first, an early preview of the OU’s under development OpenCreate authoring environment uses an outliner style editor to support the creation of a weekly study planner and topics within each week, and a notebook style interface for editing the topical content pages. I would show screenshots but I’ve pre-emptively been told to not to post videos or screenshots…

The second is an open project – a live demo and the code repository are available – and it comes in the form of Franchise, a simple app service that delivers a rich, browser based SQL query engine for querying simple data files (read the strapline and the name makes punful sense!).

Launching the service provides you with an interface that lets you connect to a database, or load in a data file, either by selecting it from a file browser or just dragging it onto the launch page.

Uploading a CSV document creates a SQLite3 database containing the data in a single table

Selecting the data table generates a SQL query that reveals the column names. Running the query generates a preview of the data table and also makes the output queryable as a “tagged” table.

The resulting table can then be queried using the tag name:

You can also use the download button to download the results table in a variety of formats:

If the resulting table has numerical columns, you can display the table using a range of charts, such as a bar chart.

For now, it looks as if the charts are quite simplistic – for example, we can’t stack or group the bars:

Other charts are available depending in a context sensitive way. For example, if there are two numerical columns we can plot a scatter char. Line charts are also available.

If the dataset contains latitude and longitude date, we we can use the data to plot points on a map.

For those new to SQL, there’s a handy cribsheet a the bottom of the page:

Franchise20

(If SQL is new to you, you might also find things like this handy: Asking Questions of Data – Garment Factories Data Expedition.)

We can also add textual commentary to the notebook in the form of markdown cells.

The markdown is styled responsively – but I couldn’t see how to go to to “preview” mode where the styling is applied but the markdown modifiers are hidden?

Cells are archived rather than deleted:

Although they can be deleted, as well as restored, from the archive.

Cells can also be reordered – click on the right hand siadebar of a cell to drag it into a slot above or below another cell, or alongside one.

Cells can also be duplicated, in which case they appear alongside the cloned cell.

The side by side view allows you to look at the effect of a changing a query compared to its original form.

I was quite excited by the idea that you could download the notebook:

and export it as an HTML file:

I had expected this to generated a standalone HTML file, but that appears not to be the case, at least for now. Instead, the cell data is packed into a JSON object:

and then passed to either a local Franchise server, or the web based one.

As a quick tool for querying data, Franchise looks to be pretty handy, although you soon realise how lacking in control it is over chart styles and axis labelling, for example (at least in int’s early form). If you could export standalone HTML, it would also make it more useful as an asset generating tool, but I guess it’s still early days.

According to a release thread – Franchise – An Open-Source SQL Notebook (franchise.cloud) – it looks as  if a standalone electron app version is on the way. (In the meantime, I followed the build instructions from the repo README to produce a quick docker container: psychemedia/franchise.)

The ability to get started querying data using SQL without the need to install anything offers a way of having a quick chat with a file based dataset. (I couldn’t get it to work with Excel or JSON files, and didn’t try a SQL file or connecting to a separate database server.)

At the moment, I don’t think you can connect to a Google spreadsheet, so you have to download one , although a SQL like API is available for Google Sheets (eg I used it for this simple SQL query interface to Google spreadhseets way back when).

From a getting started with data conversations perspective, though, this offers quite a nice on ramp to a SQL query environment without the need to worry about the DBA (database admin) chores of setting up a database, defining tables, importing the data and so on.

I also wonder if it might act as a gateway to more aggressive and powerful query engines that are capable of querying over large and multiple datasets contained in local files? Things like Apache Drill, for example?

See also:

 

Responsibilities and Required Skills in Recent “Data Journalist” Job ads…

A quick peek at a couple of recent job ads that describe what you might be expected to do…

From the BBC – Senior Data Journalist. In part, the role responsibilities include:

  • generating ideas for data-driven stories and for how they might be developed and visualized
  • exploring those ideas using statistical tools – and presenting them to wider stakeholders from a non-statistical background
  • reporting on and analysing data in a way that contributes to telling compelling stories on an array of news platforms
  • collaborating with reporters, editors, designers and developers to bring those stories to publication
  • exploring and summarizing data using relational database software
  • visualizing and to find patterns in spatial data using GIS software
  • using statistical tools to identify significant data trends
  • representing the data team and the Visual Journalism team at editorial meetings
  • leading editorially on data projects as required and overseeing the work of other data team colleagues
  • using skills and experience to advise on best approaches to data-led storytelling and the development and publication of data-led projects

Required technical skills include “a good understanding of statistics and statistical analysis; a strong grasp of how to clean, parse and query data; a good knowledge of some of the following: spreadsheet software, SQL, Python and R; demonstrable experience of visualising data, using either tools or scripts; experience of GIS or other mapping software; experience of gathering information from Freedom of Information requests”.

Desirable skills include “knowledge of basic scripting and HTML, as it might pertain to data visualization or data analysis and knowledge of several of the following; Carto, D3, QGIS, Tableau”.

And over at Trinity Mirror, there’s an open position for a data journalist, where role responsibilities include:

  • Having ideas for data-based stories and analysis, on a range of topics, which would be suitable for visualisation in regional newspapers across the group.
  • Working with a designer and the head of data journalism to come up with original and engaging ways of visualising this content.
  • Writing copy, as required, to accompany these visualisations.
  • Working on the data unit’s wider output, for print and web, as required by the head of data journalism.

As far as technical skills go, these “should include a broad knowledge of UK data sources, an ability to quickly and effectively interrogate data using spreadsheets, and an understanding of the pros and cons of different methods of visualising data”.  In addition, “[a]n ability to use scraping software, and some proficiency in using R, would be an advantage”.