sql notebooks with quarto

SQL Notebooks with Quarto

19 June 2024

Notebooks are a cornerstone of exploratory data analysis in data science (and sometimes also in production for some godforgiven reason?). Notebooks get a fair share of (justified) criticism, but I believe they are useful if used right, particularly in the education sphere where you can get immediate feedback on the code you wrote. The best-known notebook platforms are Jupyter for Python (although it can actually R and Julia too) and R Markdown. However, since 2022 a natural successor to R Markdown was released called Quarto that natively supports Python, R, and Julia. I’ve become quite an advocate for Quarto since it was released since it works well for both Python and R out-of-the-box without any extensions necessary. It’s also Git compatible, which is one of the other major drawbacks from Jupyter in my opinion. One of the main languages that currently does not have a widely used notebook platform is SQL. Although some languages and companies have made attempts. For example Snowflake and SQL Server have a (pseudo)notebook support in their IDEs and Meta uses an internal tool that supports analyses in SQL notebooks. That’s why I want to show you today how you can turn Quarto into a notebook for SQL analyses that can be rendered into webpages, PDFs, markdown documents, or online books and documentation.

Background

I use Quarto documents for all of my blogposts to render Markdown files that Hugo builds into webpages

As mentioned, Quarto supports both R and Python, although it is developed by Posit (formerly RStudio) so it’s more popularl with R users since Jupyter remains the gold standard for Python. However, Posit aims to branch out into other languages and has massively improved support in their IDE for Python and Julia. I’ve used Quarto for Python a fair bit and I think it ticks some boxes that Jupyter does not, while also having some quirks that Jupyter avoids. So while I was learning about how Quarto supports both the R and Python kernels, I learned that the list of officially supported engines (R, Python, Julia, and Javascript) and unofficially supported engines are different.

Yihui Xie, the main developer of knitr deserves to get more praise than he does for his contributions

Quarto inherits the supported language engines from it’s predecessor R Markdown, which relies on the knitr package to handle different languages. knitr supports a host of engines including Octave, Fortran, C, Stan, and SQL. So since the knitr engine is also used to do the rendering of R code in Quarto, these other languages kinda come for free even though the Quarto development team currently has not built in support for these other languages directly (although people have asked on GitHub about this). But SQL code chunks will render nicely in Quarto documents if you set it up correctly. That’s what I will show here now: how to use Quarto for creating SQL notebooks.

Setup

I found that the version on Homebrew might not be the latest version from the Quarto website, so you might want to install using the GUI

So how do you get this up. So obviously, you first need to install Quarto. If you have a Mac (with Homebrew installed), you can install it directly from there with the following command:

brew install quarto

You’ll also need to install the R package DBI which will handle the database interfacing and the driver for whatever database you’re using. In my case this is duckdb (more about duckdb below). The driver for duckdb is easily accessed through the duckdb package. Drivers for other database management systems are of course available in their respective packages. Creating these SQL notebooks with just Python directly is not possibly as of now as far as I know since Quarto uses the Jupyter engine to handle Python code chunks and as mentioned we need the knitr engine for SQL support.

RDBMS: relational database management system

We also need a database to connect to for demonstration purposes. I’ve become a fervent advocate for duckdb for smaller projects (although support for collaborative projects seem to be coming). Duckdb is an open-source RDBMS that is incredibly fast and memory efficient. I’ve been using it for a while, but I hope it’ll become more widely popular now that the v1.0.0 version has been released in June 2024. It is lightweight and fairly easy to install and works on all major systems. Again with Homebrew you can install it like this:

brew install duckdb

To use as examples in this post we’ll need some data. In this case I used a dataset I found on GitHub that contains data on fictional superheroes from a variety of publishers (see the metadata here). I imported this data to duckdb using its import functionality. For the second example I’ll the example dataset included in the dbplyr R package that contains data from the nycflights13 package on all flights that departed from New York City in 2013 that I also imported to duckdb.

Creating the SQL notebook

Let me first show how to set up this SQL notebook using Quarto and then I’ll show some examples of what a SQL notebook would look like when rendered. First, in order to create a SQL code chunk we need to create a connection the way we would normally in R through the dbConnect() function from the DBI package. This connection object we will then provide as an argument to the code chunk. It’s possible to use multiple connections at the same time, but (as far as I know), it’s only possible to run queries on a single connection at the time without connecting to the table from R directly.

So here’s how to connect to the different databases using the relevent driver and the connection URL to the database, which could be either local or hosted elsewhere (e.g. AWS RDS, PlanetScale, Supabase, or Motherduck). In this case it’s two local files for simplicity. I’ll also connect to to them as read-only to avoid accidentally overwriting anything.

con_superheroes <- DBI::dbConnect(
  drv = duckdb::duckdb(),
  dbdir = "./data/superheroes.db",
  read_only = TRUE
)

con_flights <- DBI::dbConnect(
  drv = duckdb::duckdb(),
  dbdir = "./data/flights.db",
  read_only = TRUE
)

You can then provide this connection object directly to the connection argument in the SQL code chunk. You can also provide other arguments (like label and echo and some others) to configure the functionality and output of each chunk, but note that not all options work in SQL chunks. You’ll have experiment a bit to see which ones work and which ones don’t. A typical SQL code chunk in Quarto might look something like this:

Note that the semicolon is not strictly necessary for functionality, but it is good practice to include it

```{sql}
#| label: get-carriers
#| connection: con_flights

SELECT name, carrier FROM airlines LIMIT 10;
```

Which will provide the following output:

SELECT name, carrier FROM airlines LIMIT 10;

name	carrier
Endeavor Air Inc.	9E
American Airlines Inc.	AA
Alaska Airlines Inc.	AS
JetBlue Airways	B6
Delta Air Lines Inc.	DL
ExpressJet Airlines Inc.	EV
Frontier Airlines Inc.	F9
AirTran Airways Corporation	FL
Hawaiian Airlines Inc.	HA
Envoy Air	MQ

Displaying records 1 - 10

Unfortunately dot commands are not supported at the moment

Obviously, you can run any SQL commands since the code chunk is performing the command as if it were done in the SQL prompt directly. The output is generated as set up. In this example the code is presented in a Markdown table which is then styled in the same format as the rest of the tables on this website using the CSS style everything else uses. If I were to render to HTML, it would come out as an HTML table, the same for rendering to PDF and outputting LaTeX tables. Also, if you write multiple queries inside a single chunk (separated by a semicolon), only the last one will be rendered. For example, if you wanted to show all tables in the superheroes database the code chunk would look like this:

Let’s also hide the code chunk itself and only show the output with the echo option

```{sql}
#| label: get-tables-in-superheroes-db
#| echo: false
#| connection: con_superheroes

SHOW TABLES;
```

And the output looks like this:

name
alignment
attribute
colour
gender
hero_attribute
hero_power
publisher
race
superhero
superpower

Displaying records 1 - 10

Exploratory data analysis example

From here, you’re only limited by your imaginations. So go wild to create queries and output as simple or complex as your analysis demands. I’ll include some examples with the output below to show the duckdb functionality and how different queries might look when rendered, starting with a query in the flights database.

SELECT
  airlines.name,
  ROUND(MEAN(flights.dep_delay), 2) AS mean_delay,
  ROUND(STDDEV(flights.dep_delay) / SQRT(COUNT(1)), 2) AS sem_delay
FROM flights
INNER JOIN airlines ON flights.carrier = airlines.carrier
GROUP BY airlines.name
ORDER BY mean_delay DESC;

name	mean_delay	sem_delay
Frontier Airlines Inc.	20.22	2.23
ExpressJet Airlines Inc.	19.96	0.20
Mesa Airlines Inc.	19.00	2.01
AirTran Airways Corporation	18.73	0.92
Southwest Airlines Co.	17.71	0.39
Endeavor Air Inc.	16.73	0.34
JetBlue Airways	13.02	0.16
Virgin America	12.87	0.62
SkyWest Airlines Inc.	12.59	7.61
United Air Lines Inc.	12.11	0.15

Displaying records 1 - 10

And then some CTEs using the superheroes database. For example looking at the gender of the superheroes and the proportion for each publisher.

WITH pub_count AS (
  SELECT
    publisher_id,
    COUNT(publisher_id) AS total_sh
  FROM superhero
  GROUP BY publisher_id
),

gender_count AS (
  SELECT 
    publisher_id,
    gender_id,
    COUNT(gender_id) AS number
  FROM superhero
  GROUP BY publisher_id, gender_id
)

SELECT
  pub.publisher_name AS publisher,
  gen.gender AS gender,
  gc.number,
  ROUND(100 * gc.number / total_sh, 2) AS proportion
FROM gender_count AS gc
INNER JOIN pub_count AS pc ON gc.publisher_id = pc.publisher_id
INNER JOIN publisher AS pub ON gc.publisher_id = pub.id
INNER JOIN gender AS gen ON gc.gender_id = gen.id
WHERE publisher IN ('Marvel Comics', 'DC Comics')
ORDER BY publisher, gender;

publisher	gender	number	proportion
DC Comics	Female	63	28.13
DC Comics	Male	160	71.43
DC Comics	N/A	1	0.45
Marvel Comics	Female	111	28.68
Marvel Comics	Male	252	65.12
Marvel Comics	N/A	24	6.20

6 records

Which made me curious which superheroes don’t have a gender or are non-binary.

SELECT
  sh.superhero_name AS 'superhero name',
  pub.publisher_name AS 'publisher'
FROM
  superhero AS sh
INNER JOIN gender AS gen ON sh.gender_id = gen.id
INNER JOIN publisher AS pub ON sh.publisher_id = pub.id
WHERE gen.gender = 'N/A'
ORDER BY pub.publisher_name;

superhero name	publisher
Darkside
Godzilla
Parademon	DC Comics
Man of Miracles	Image Comics
Bird-Brain	Marvel Comics
Captain Universe	Marvel Comics
Cecilia Reyes	Marvel Comics
Clea	Marvel Comics
Cypher	Marvel Comics
Ego	Marvel Comics

Displaying records 1 - 10

Concluding remarks

I believe the ability to create SQL notebooks this way offers some serious advantages by allowing analysts to annotate their queries using Markdown and comment on the results and describe their methodology and interpretations. This would introduce the same functionality to SQL that already exists in Jupyter Notebooks and R Markdown/Quarto notebooks that have aided their popularity. I really do hope RStudio has plans to further develop this SQL integration into Quarto, because if they do I believe they may have a good shot at creating the ultimate dialect-agnostic SQL notebook available open source. If we were allowed to dream I’d love for them to support linting with for example sqlfluff for each chunk as well same as it does with lintr for R code. In addition, it would make Quarto perhaps the ultimate notebook software since it would then fully support R, Python, SQL, and Julia which would make it incredibly powerful and versatile, offering the ability to switch between languages within a single notebook. I hope to have inspired you have instilled some enthusiasm for Quarto and their concept of SQL notebooks and I hope you give this implementation of a SQL notebook a try!