Automatically generating SQL equivalents to dataframe method chains

Back to the “genAI is not necessarily the best alternative” thing again, this time in the context of generating SQL code that performs a particular sort of query. Natural language may be a starting point for this, but code might also provide the entry point, as for example if you have prototyped something you know that works using in-memory pandas or R dataframes, and you know want to move that over to set of operations performed on corresponding relational tables inside a SQL database.

And if you have code that works, why would you want a statistical genAI process to have a guess at some code that might work for you, rather than using a deterministic process and mechanically translating something that works into something that is logically equivalent (even if it’s not the most idiomatic or optimal equivalent)?

In the R world dbplyr is a handy utility that will convert your dplyr code (transformations applied over a dataframe using dplyr methods) to equivalent, thoug possibly not optimal or idiomatic, SQL code. The dplyr package works with several different db backends, including MariaDB, PostgreSQL and SQLite, so presumably any dialect differences are accommodated if you provide an appropriate connection or db description?

From the docs:

# lazily generates query
summary <- mtcars2 %>% 
  group_by(cyl) %>% 
  summarise(mpg = mean(mpg, na.rm = TRUE)) %>% 
  arrange(desc(mpg))

# see query
summary %>% show_query()
#> <SQL>
#> SELECT `cyl`, AVG(`mpg`) AS `mpg`
#> FROM `mtcars`
#> GROUP BY `cyl`
#> ORDER BY `mpg` DESC

If required, you can also explicitly inject SQL into a dplyr expression (example; I don’t think the code will also try to execute this on the R side, eg by dumping the query into SQLite…).

On the question of producing appropriate SQL dialects, I note the SQLGlot Python package, a “no-dependency SQL parser, transpiler, optimizer, and engine” that “can be used to format SQL or translate between 21 different dialects. As far as the optimiser goes, I assume this means it could accept suboptimal SQL code generated using dbplyr and then return something more efficient?

Whilst the Python pandas package is happy to connect to a database backend, and can read and write tables from and to a connected database, it doesn’t try to generate SQL equivalent queries to a chained set of pandas dataframe manipulating methods.

Although pandas was the original widely used dataframe package on the Python block, several other alternatives have appeared over the years, with improved efficiency and with a syntax that resembles the pandas syntax, although it may not match it exactly. If you know the various operations that SQL supports, you’ll have a fair idea of what verbs are available for manipulating dataframes in any dataframe package; so if you’ve worked with dplyr, or pandas, dask (built with parallelism in mind), or polars (a Python package powered from Rust), there’s a good chance you’ll be able to make sense of any of the others.

At least one of the pandas alternative dataframes-in-Python packages does seem to have given the “conversion-to-SQL” thing some serious consideration: ibis (repo). From the docs:

con = ibis.connect("duckdb://")

t = con.read_parquet("penguins.parquet")

g = t.group_by(["species", "island"]).agg(count=t.count()).order_by("count")
ibis.to_sql(g)

which gives:

SELECT
  `t1`.`species`,
  `t1`.`island`,
  `t1`.`count`
FROM (
  SELECT
    `t0`.`species`,
    `t0`.`island`,
    COUNT(*) AS `count`
  FROM `ibis_read_parquet_t2ab23nqsnfydeuy5zpg4yg2im` AS `t0`
  GROUP BY
    1,
    2
) AS `t1`
ORDER BY
  `t1`.`count` ASC NULLS LAST

Actually that looks a bit cryptic and shorthand (t0, t1, 1, 2?) and could perhaps have been made to look a bit more friendly?

There’s also the rather nifty option of combining method chains with SQL code:

sql = """
SELECT
  species,
  island,
  COUNT(*) AS count
FROM penguins
GROUP BY species, island
""".strip()

If you sometimes think in pandas and sometimes in SQL, then this sort of approach might be really handy… I note that the DuckDB docs give it a nod, possibly because it uses DuckDB as the default backend? But I don’t have much sense about the trajectory it’s on in terms of development (first, second, and/or third party), adoption, and support.

The ibis machinery is built using Substrait, a logical framework for supporting interoperability across different structured data processing platforms. I wonder if that sort of conceptualisation might be a useful framing in an educational context? Here’s their primer document. The repo show some level of ongoing activity and engagement, but I’m not sure where the buy-in comes from or how committed it is. In terms of process, it seems the project it lloking to getting its foundations right, as Apache Arrow did; and that really seems to have taken off as a low level columnar in-memory and inter-process communication messaging serialisation format with excellent support for things like parquet file-reading and writing.

PS hmm… so I wonder…. might we expect something like a substrait sql2pandas utility? A pandoc for sql2sql, sql2dfFramework, dfFramework2sql and dfFramework2dfFramework conversions? Maybe I need to keep tabs on this PyPi package: substrait (repo). I note other packages in the substrait-io organisation for other languages…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.