Why I built the python-bigquery-validator package

Conall Daly
3 min readJan 8, 2022

I started this project my goal was to quickly and easily verify Jinja templated SQL queries that I was writing for Apache Airflow.

For those who are not aware, a Jinja templated query is a SQL query where double curly brackets are placed in a SQL query and rendered at execution time by Apache Airflow when running a BigQuery Operator.

# Templated Query
SELECT count(*) AS nrows
FROM `{{ params.project }}.samples.github_timeline`
# Rendered Query
SELECT count(*) AS nrows
FROM `bigquery-public-data.samples.github_timeline`

My work process for developing BigQuery queries in Airflow is as follows

  • create regular SQL query in the BigQuery console
  • replace hardcoded variables with Jinja params in IDE
  • get rendered template of query from Airflow UI
  • validate rendered template query in BigQuery console

This process has frustrated me on more than one occasion. The lack of automation and number of steps required to verify work can easily turn a small piece of work into a task that eats up a considerable amount of time.

The starting point this project was to build a package that allows the validation to be completed in a single step.

This became the first of two main modules with the package, the BigQueryValidator. This module is focused on testing syntax and the cost of queries, it can be ran without incurring any cost from BigQuery.

The second module, BigQueryResult, returns the result of a query along with additional metadata that allows you to easily check the total number of rows and columns returned by a query along with the columns, null values in each column and value counts of each column.

The BigQueryValidator is used to check that the syntax of a query is correct and will run successfully when executed by the BigQuery engine.

This includes support for Jinja templated queries used by Airflow. A JSON file must be passed to the module contained the expected values for templated parameters so that templated queries are rendered correctly.

Source code for BigQueryValidator

These queries do not incur a cost as the dry run flag is set to true. Dry run queries do not incur a cost as the estimated cost is returned instead of the result of the query.

This is similar to when you update a query in the BigQuery console and the validity of your query and estimated cost are displayed in the UI.

The python-bigquery-validator package includes support for running the module using the Python CLI.

Example of how to run python-bigquery-validator using the CLI

By running the package from the command line and selecting the auto_validate_query_from_file function you can replicate the behaviour of the BigQuery console while having Jinja templated variables included in your query.

Expected behaviour is that each update overrides the existing text in the console. I discovered that it no longer works while writing this article…

The package can also be run as part of unit tests and triggered by CI/CD pipelines so that queries are automatically validated on each commit.

Example of how to run the python-bigquery-validator using Python Unit Tests

The BigQueryResult class is used to check that the output of a query matches what the user expects.

The result of a query can be returned exactly as is returned from BigQuery or as a Pandas Dataframe. Additional data about the dataset (unique columns, null values per column, value count per column, etc.) are also available so that you can check that your data contains expected values.

Some of the most common checks I use are that

  • there is a unique user id associated with each row of data
  • the query cost does not exceed a certain limit
  • all or a certain amount of column values are not null
  • the dataset contains the expected columns

An additional benefit of python-bigquery-validator is that it allows us to easily and explicitly define rules that we expect data to match.

For example when a new engineer asks what is the maximum cost that a query should require the answer is usually “it depends”.

A small change like using the wrong date filter on a partitioned table can easily be overlooked and greatly increase the cost of a query. The price of a query can be checked to ensure this does not go unnoticed.

Creating a test case to ensure query costs don’t exceed a limit is an easy way of defining the expected cost of a query for team members that are not familiar with a query or the datasets being used.

--

--