Generating and Visualising JSON Schemas

If you’re presented with a 2D tabular dataset (eg a spreadsheet or CSV file), it;s quite easy to get a sense of the data by checking the column names and looking at a few of the values in each column. You can also use a variety of tools that will “profile” or summarise the data in each column for you. For example, a column of numeric values might be summarised with by the mean and standard deviation of the values, or a histogram, etc. Geographic co-ordinates might be “summarised” by plotting them onto a map. And so on.

If you’re presented with the data in a JSON file, particularly a large data file, getting a sense of the structure of the dataset can be much harder, particularly if the JSON data is “irregular” (that is, if the records differ in some way; for example, some records might have more fields than other records).

A good way of summarising the structure of a JSON file is via its schema. This is an abstracted representation of the tree structure of the JSON object that extracts the unique keys in the object and the data type of any associated literal values.

One JSON dataset I have played with over the years is rally timing and results data from the WRC live timing service. The structure of some of the returned data objects can be quite complicated, so how can we get a handle on them?

One way is to automatically extract the schema and then visualise it, so here’s a recipe for doing that.

For example:

#%pip install genson
from genson import SchemaBuilder

# Get some data
import requests

season_url = "https://api.wrc.com/contel-page/83388/calendar/active-season/"
jdata = requests.get(season_url).json()

# Create a schema object
builder = SchemaBuilder()
builder.add_object(jdata )

# Generate the schema
schema = builder.to_schema()

# Write the schema to a file
with open("season_schema.json", "w") as f:
    json.dump(schema, f)

The schema itself can be quite long and hard to read…

{'$schema': 'http://json-schema.org/schema#',
 'type': 'object',
 'properties': {'seasonYear': {'type': 'integer'},
  'seasonImages': {'type': 'object',
   'properties': {'format16x9': {'type': 'string'},
    'format16x9special': {'type': 'string'},
    'timekeeperLogo': {'type': 'string'},
    'timekeeperLogoDark': {'type': 'string'}},
   'required': ['format16x9',
    'format16x9special',
    'timekeeperLogo',
    'timekeeperLogoDark']},
  'rallyEvents': {'type': 'object',
   'properties': {'total': {'type': 'integer'},
    'items': {'type': 'array',
     'items': {'type': 'object',
      'properties': {'id': {'type': 'integer'},
       'name': {'type': 'string'},

...

To get a sense of the schema, we can visualise it interactively using the json_schema_for_humans visualiser.

from json_schema_for_humans.generate import generate_from_filename

generate_from_filename("season_schema.json",
                      "season_schema.html")

We can open the generated HTML in a web browser (note, this doesn’t seem to render correctly in JupyterLab via IPython.display.HTML; however, we can open the HTML file from the JupyterLab file browser, and as long as we trust it, it will render correctly:

With the HTML trusted, we can then explore the schema;

Wondering: are there other JSON-schema visualisers out there, particularly that work either as a JupyterLab extension, IPython magic or via an appropriate __repr__ display method?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: