Recently in Measure Slack a question was asked about how to send notifications from a data quality check in BigQuery for a job that was being executed by a service account. This question – or variants of it – seem to get asked every now and then so I thought I’d write up a blog post of how this can be achieved in a reasonably simple manner using built in GCP services. Some options go straight to using Cloud Functions – but we don’t really need that in this case as we can use Cloud Logging and Monitoring to achieve the same outcome.
In this post we’ll aim to:
- Write a query to perform a basic data quality check in BigQuery
- If the check succeeds, do nothing (other than execute the query)
- If the check fails then send an email
To get set up ensure you have a GCP project provisioned, a BigQuery table you’d like to query and ideally have billing enabled / configured.
Writing a query
In this query we are going to check that the total number of rows in a table (example_table) is not zero.
BEGIN DECLARE row_count INT64; SET row_count = ( SELECT COUNT(*) FROM example_table ); IF row_count = 0 THEN SELECT 1/0; -- force an error to be raised END IF; -- force an error to be raised EXCEPTION WHEN ERROR THEN RAISE USING MESSAGE = 'This table has 0 rows!'; -- custom message to overwrite stack trace END;
You’ll notice that rather than just writing the query straight out we’re wrapping it with some stored procedure functions. A variable for storing our number of rows and an EXCEPTION to allow us to reraise an exception (which will automatically be logged in Cloud Logging). If our row_count is as expected the query completes, if it is zero we instead divide by zero (forcing an error to occur) and then re-raise using a custom message we can filter on later. Unfortunately we can’t raise an exception directly here which is why we force an error to occur.
Filtering out logs
As part of BigQuery audit logging each query is logged into the Cloud Logging service. It’s reasonably straightforward to find these logs and we’re going to apply a simple filter to find these instances
resource.type="bigquery_resource" protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error.message=~"This table has 0 rows"
This log expression filters to logs with a bigquery_resource type and an error message containing that string (we avoid an exact quality match as you can see that line numbers are included in the stack trace).
Creating an alert
Next we can create an alert every time we see an instance that matches this line. You can do this by selecting ‘Create alert’ in the user interface above.
I’ve set a reasonably short time between notifications (in case you want to check the table frequently) and an autoclose policy (3o minutes) but you can tweak this as you see fit.
Finally under ‘Who should be notified’ you can set a Notification channel (either an existing one or create a new one). You will have various options here including email, SMS, Slack, PagerDuty and webhooks (for other services).
It’s as simple as that! You can now run your query as a one off, as a scheduled query or triggered externally i.e., from Airflow or composer.
Why should I use this approach?
I’m a big fan of this approach and it’s simplicity due to:
- We can compose our data quality checks entirely in SQL, which gives us a great deal of expressiveness
- We don’t require the setup of additional infrastructure or code. No cloud functions or scheduling is required.
- This takes advantage of pre-existing low cost services already present in GCP – cloud logging and a logs-based alerting policy.
- Cloud Monitoring gives us the ability to notify via multiple channels with minimal setup – these include Slack, email, SMS and webhooks (for other systems not supported out of the box)
- Although I’ll be working through the UI all of these commands can be performed using the gcloud CLI or an IaaC provider like Terraform
More complex alerts can be created using logs-based metrics which can count the number of alerts over a certain period or extract specific values from a log entry and create them as labels.
If you are looking at more complex data quality checks I’d highly encourage having a look at dedicated data quality frameworks like Monte Carlo, Great Expectations or Elementary Data (dbt specific). These tools can be particularly useful if you are performing more complex checks – like testing statistical distributions, regular expression testing or different notification levels (information, warning, error etc) as well as recording the historical state of tests.
Not using BigQuery?
You can achieve something similar in Snowflake using SYSTEM$SEND_EMAIL() if running on Snowflake in AWS (in specific regions)