Measuring and validating Core Web Vitals using Snowplow and Great Expectations in GCP

In this post we’ll look at how to collect, enrich, model and finally validate some core web vital metrics that are critical in measuring the performance of pages on your site.

Why does this matter?

These measures effectively produce a rating for user performance on your website based on how fast / performant / interactive the site is. User performance matters for a number of reasons – it’s been demonstrated to increase conversion rates (particularly in ecommerce), it results in more favourable SEO rankings and critically it makes the end user feel like they are using a modern website – not a Geocities page over a 56K modem.

Core web vitals are currently a set of three measures designed to represent end user experience on the site and can be collected from visitors to evaluate loading (Largest Contentful Paint), interactivity (First Input Delay) and visual stability (Cumulative Layout Shift).

To do this we’ll be using Google Tag Manager (GTM), Snowplow, and Great Expectations.

Collection

In Google Tag Manager we’ll be using a template from Simo Ahava for capturing these metrics – however you can also use unpkg directly, or package this code with your own deployment process.

1. Templates > Tag templates > Search Gallery > Search for ‘Core Web Vitals’ and hit ‘Add to Workspace’
2. This template pulls down from unpkg (a CDN hosting the Javascript that emits these metrics) so will use the latest version – if you’d like greater control over this consider deploying this JS to your own site using NPM.
3. Next let’s add a trigger (All Pages) to the tag (you can find ‘Core Web Vitals’ under Custom tags in the UI)
4. This tag will push a ‘coreWebVitals’ event into the dataLayer each time a metric becomes available that will look similar to the following (this may happen more than once on the same page for the same metric)

window.dataLayer.push({
  event: 'coreWebVitals',
  webVitalsMeasurement: {
    name: 'FID',
    id: 'v1-123-123',
    value: 1200.00,
    delta: 1200.00
    valueRounded: 1200,
    deltaRounded: 1200
  }
});

5. Now we can create a few dataLayer variables so that we can later reference them from our analytics tag.
6. I’ll create one for vital_name, vital_value, and vital_delta. For the moment we are not going to use id (we can use the Snowplow page view id for this) and we don’t need the rounded values – as there’s no issue here in sending them through as floats.

7. Now we can setup the corresponding trigger which will be invoked for custom events with event name ‘coreWebVitals’.

8. And finally our tag – referencing our global Javascript variable (window.snowplow) to send the event.

<script type="text/javascript">
window.snowplow('trackSelfDescribingEvent', {
    schema: 'iglu:dev.web/core_web_vitals/jsonschema/1-0-0',
    data: {
    name: {{DLV - vitalName}},
    value: {{DLV - vitalValue}},
    delta: {{DLV - vitalDelta}}
    }
});
</script>

To summarise we should now have the following resources in our Workspace
Variables – three variables for name, value and delta

Triggers – one trigger for the ‘coreWebVitals’ events
Tags – two tags: the core web vitals tag triggering on page view and our Snowplow self describing event which should trigger on the ‘coreWebVitals’ dataLayer events

You can test this in preview mode before you publish – you may need to wait a few seconds, and interact with the page if you’d like all 3 different metrics to fire (trying loading the page, clicking / scrolling, and changing the tab to background / foreground).

Now that the our work in GTM is complete, let’s setup a schema in Snowplow to capture this data – which will be validated, enriched and sent to BigQuery in real time.

Schemas

I’ve opted for a single schema (JSON schema) to encompass all three metrics but an approach that has one schema per metric is equally valid, but for our data model it’s easier having all of this information within a single column. A single schema also allows us to evolve this over time to collect additional web vitals as the recommendations evolve such as time to first byte and first contentful paint.

{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for action context",
"self": {
    "vendor": "dev.web",
    "name": "core_web_vitals",
    "format": "jsonschema",
    "version": "1-0-0"
},
"type": "object",
"properties": {
    "name": {
        "type": "string",
        "description": "Name of the core web vital metric - largest contentful paint, first input delay, and cumulative layout shift",
        "enum": ["LCP", "FID", "CLS"]
    },
    "value": {
        "type": "number",
        "description": "Value of the metric, milliseconds for LCP and FID, a calculated score for CLS"
    },
    "delta": {
        "type": "number",
        "description": "Delta from the last recorded value"
    }
},
"additionalProperties": false
}

This is a reasonably simple schema that accepts the name of the metrics (as one of three values defined in an enum) as well as the value, and delta as a number – which allows us to store a fractional component.

Once you have deployed these schema to your pipeline and published your GTM changes you should now see a new column in your BigQuery table with this data.

Data modelling

We’ve validated that we are now collecting this data, however in order to make it more useful let’s create a simple data model that distills this down into the dimensions and aggregations that we require.

In order to do so let’s first recall what the thresholds are for good page performance:

  • LCP should occur within 2.5 seconds (2500 ms) of when the page first starts loading
  • FIP pages should have a FID of less than 100 milliseconds
  • For a good user experience CLS should be maintained at a value of less than 0.1

“For each of the above metrics, … a good threshold to measure is the 75th percentile of page loads, segmented across mobile and desktop devices … consider a page passing if it meets the recommended targets at the 75th percentile for all of the above three metrics.”

Given these requirements we should construct a data model that:

– excludes any bot or spider traffic
– aggregates each page view together with a column for each of three metrics
– adds a column for device type (mobile, desktop) and page url – which will allow for identification of poorly performance pages by device and specific url

Knowing this we can put together a data model that combines these requirements into a relatively simple statement as follows (below, in BigQuery)

WITH events AS (
    SELECT
        derived_tstamp,
        page_urlhost || page_urlpath AS url,
        contexts_nl_basjes_yauaa_context_1_0_1[SAFE_OFFSET(0)].device_class AS device_type,
        contexts_com_snowplowanalytics_snowplow_web_page_1_0_0[SAFE_OFFSET(0)].id AS web_page_id,
        unstruct_event_dev_web_core_web_vitals_1_0_0 AS cwv
    FROM
        rt_pipeline_prod1.events
    WHERE
        collector_tstamp >= '2021-02-01' -- because who doesn't want to forget 2020?
    AND
    event_name = 'core_web_vitals' -- just the events we care about
    AND
    contexts_com_iab_snowplow_spiders_and_robots_1_0_0[SAFE_OFFSET(0)].spider_or_robot = False -- exclude any bot traffic
    AND
    contexts_nl_basjes_yauaa_context_1_0_1[SAFE_OFFSET(0)].device_class IN ('Desktop', 'Phone') -- limit to desktop and mobile devices only
)
SELECT * FROM (
SELECT
    url,
    device_type,
    web_page_id,
    MIN(CASE WHEN cwv.name = 'LCP' THEN cwv.value END) OVER (PARTITION BY web_page_id ORDER BY derived_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS lcp,
    MIN(CASE WHEN cwv.name = 'CLS' THEN cwv.value END) OVER (PARTITION BY web_page_id ORDER BY derived_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS cls,
    MIN(CASE WHEN cwv.name = 'FID' THEN cwv.value END) OVER (PARTITION BY web_page_id ORDER BY derived_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS fid
    FROM
        events
)
GROUP BY 1, 2, 3, 4, 5, 6

A single page may generate the same metric more than once (e.g., when losing and regaining focus) – so we’re taking the first instance we receive for each metric – though it’s easy enough to change this logic for whatever your use case is – particularly if you want to include subsequent metrics or deltas.

Our data model now gives us one row per page view for recent views with and spiders or robots deliberately excluded – as these may impact our core vitals and Great Expectations does not yet feature a Voight-Kampff test (however pull requests are welcome).

Now that we have a statement for a data model (and hypothetically we’ve automated it using something like dbt) we need to start thinking about how to make some assertions about the data within our newly created model.

Recall our guidelines for good performance:

– Largest contentful paint (LCP) should occur within 2500 milliseconds
– First input delay (FID) should be less than 100 milliseconds
– Cumulative layout shift (CLS) should be less than 0.1

In addition the documentation suggests using the 75th percentile, segmented by device type (mobile or desktop) and a page should be considered to have a passing grade if it meets all three of these criteria.

Testing data

All the courage in the world cannot alter fact. For this reason alone it’s a good idea to make assertions about a data and assert that our expectations of the world conform to reality – thankfully Great Expectations helps us to do so.

First, we’ll create a new folder and initialise a project which will allow us to configure a data context.

mkdir expect
cd expect
great_expectations init

You will now be launched into an interactive prompt to setup your data source – in this case we’ll opt for BigQuery.

Select option 2 to connect to a relational (SQL) database, and option 5 for BigQuery.

Specify a BigQuery project with the connecting string

bigquery://gcp-project-id

If everything goes well at this stage you’ll see

Great Expectations connected to your database!

Now that we are setup with a connection to our dataset we can start creating our expectations.

great_expectations suite scaffold core_web_vitals

Select option 2 to enter a custom query and we’ll use the query below.

Recall that our data model has one row per page_id, whereas we need to now aggregate this information by url and device type and return the 75th percentile for each group. We can do that with the following query:

SELECT * from (
    SELECT
        url,
        device_type,
        percentile_cont(cls, 0.75) OVER(PARTITION BY url) AS cls_75p,
        percentile_cont(fid, 0.75) OVER(PARTITION BY url) AS fid_75p,
        percentile_cont(lcp, 0.75) OVER(PARTITION BY url) AS lcp_75p
    FROM
        derived.core_web_vitals
)
GROUP BY 1, 2, 3, 4, 5

which will tell Great Expectations to define a view with this query that we can then reference in our subsequent expectations.

This view produces one row for each distinct url and device type and their corresponding 75th percentiles for each of our three core metrics.

Great Expectations will create a new Expectation Suite core_web_vitals and store it – as well as open up a new Jupyter notebook in which we can create and test our expectations.

Running the first cell will create a dataset that samples our BigQuery view and outputs the first 5 rows by calling the head method.

Let’s insert a new cell and create a simple test expectation. We know that our cumulative layout shift value should always be between 0 and 1 (inclusive) so let’s assert that this is the case.

To do so we can use expect_column_values_to_be_between and write the following expectation

batch.expect_column_values_to_be_between('cls_75p', 0, 1)

We can run this cell which will yield an output indicating whether the test has passed or not.

Although this is a simple test this is a good sanity check to ensure that we’re dealing with a percentile value that is expected – something outside of this bounds may indicate a problem with data collection or our model.

With that out of the way we can write some expectations that evaluate our aggregrated metrics

# a value outside of this range indicates a page / device_type combination that exceeds the 75th percentile
batch.expect_column_values_to_be_between('lcp_75p', 0, 2500) # between 0 and 2500 milliseconds
# FID 75th percentile between 0 and 100 milliseconds
batch.expect_column_values_to_be_between('fid_75p', 0, 100)
# CLS 75th percentile between 0 and 0.1 (no unit)
batch.expect_column_values_to_be_between('cls_75p', 0, 0.1)

Each expectation will assert that our three core web metrics fall within the acceptable bounds for the 75th percentile (for more advanced users you can also achieve the same thing using mostly in Great Expectations without calculating our percentiles ahead of time).

Now that we’ve written our expectations we can rerun them at regular intervals as well as saving the resulting output as checkpoints (include link here) – and our validation results (HTML output) to something like Google Cloud Storage. At the moment I run expectations on a regular basis using Cloud Run – which is well suited to short-lived tasks like this (where Cloud Functions may timeout).

Finally – there’s not much use in having these expectations if we don’t know when they fail. Great Expectations offers a number of validation actions that enable notifications for failures that you can customise including Slack and OpsGenie amongst others.

We’ve only just scraped the surface of what is possible with Great Expectations, so go forth and test – you can’t hold the tide back with a broom.

Published by Mike Robins

CTO at Poplin Data

Popular posts like this

Accurate time spent: A killer feature of Snowplow Analytics

Modelling your Snowplow event data: Part 1

Modelling your Snowplow event data: Part 2 Pageviews