
In the first four parts of this series, we modelled out:
- Pageviews: accurate time spent incorporating page pings and maximum scroll depth
- Sessions: traffic sources and initial landing details
- Users: filtering internal traffic and looking up things we know about the users
Now it’s time to move the model into production and start using it. To do this we’ll use Snowplow’s SQL Runner tool at the end of each Snowplow batch load into Redshift.
Schema for models
In my previous posts I’ve created a series of lookup_*
tables with my modelled data. To keep this clean you probably want to create a new schema specifically for your modelled data. This simplifies the separation of raw, unopinionated event-level data with modelled, opinionated data that applies your business rules. Once everyone has got used to it, you can remove access to the raw data for most of your users, ensuring everyone is working from the same business rules.
By convention, Snowplow uses scratch
and derived
schemas for data modelling steps.
scratch
is for ephemeral data used in intermediate processing steps but not actually part of the output. It should be discarded as a final processing step.derived
is for your model output data.
SQL Runner
SQL Runner is a pretty simple tool that takes a series of SQL scripts and executes them on command against a specified database with specified credentials. The important piece of information from the documentation is the difference between :queries:
and :steps:
.
Queries within a Step will execute in parallel while individual steps are executed in the order in which they appear. If your models build on one another, you’ll need to separate them into Steps to operate in sequence. The model I’ve build out in this series doesn’t have any dependencies between the steps so you can easily have them run in parallel within the same Step.
Configuring your scripts
The syntax for SQL Runner’s playbooks is pretty simple. For our model playbook we’re just executing the following:
:targets:
- :name: "Poplin data model"
:type: redshift
:host: redshift-endpoint # The endpoint as shown in the Redshift console
:database: snowplow # Name of database
:port: 5439 # Default Redshift port
:username: datamodeling
:password: {{secret "redshift-datamodeling-password"}}
:ssl: true # SSL disabled by default
:steps:
- :name: "Rebuild pageview, session and user models in parallel"
:queries:
- :name: pageview
:file: model-pageview.sql
- :name: session
:file: model-session.sql
- :name: user
:file: model-user.sql
Automation
Managed Service customers
If you’re a Snowplow Managed Service customer, SQL Runner is handled for you. The process is quite simple:
- Clone your pipeline’s
snowplow-proservices
pipeline repository - Create an
sql-runner
directory injobs/main
- Place the playbook YAML file in a directory called
playbooks
- Place your SQL scripts in a directory called
sql
- Commit your changes to a named branch on the repository and push back upstream
- Raise a support ticket with Snowplow advising of the branch you’ve created and the schedule on which you’d like the new data model to run, for example “daily at UTC midnight” or “after every batch”
- Snowplow will implement your new data model scripts
From then on, any changes you make on the master
branch for existing playbooks will update within about 45 minutes. You’ll need to follow the process above again if you create a new playbook running on a different schedule.
Open Source users
If you’re using an Open Source Snowplow pipeline, you’ll have to set up and manage scheduling yourself, probably within the same script you’re kicking off to do enrichment and loads into the database from cron
now. A good place to start is the Guide for devops.
Optimisations
Scope down to a smaller time range
My scripts execute across the entire events dataset we have. Our site isn’t high traffic so this isn’t a huge imposition, executing in about a minute on our Redshift database. If you have decent traffic volumes, you’ll want to scope down the reprocessing of models somewhat. If your queries involve any kind of lookback, that will take some careful tuning of your query windows. Be particularly careful if you’re calculating sessions in some way different to COUNT(DISTINCT domain_sessionid)
as a session can easily cross a batch boundary.
Creating useful SQL views
The joins you need to make for this to be useful are a bit complicated to do routinely, so it makes sense to create a nice simple view for your pageview
events that incorporates these elements on every row. These views become something you can expose to a visualisation tool like Tableau to enable easy exploration while incorporating engagement metrics, your business rules for traffic source attribution and information about users.
Below is the view I’ve created for our data set. For convenience I’m also including the timezone conversion we routinely use to convert the query to Sydney time. For performance reasons, you might end up writing out the result of this view into a table itself as an additional step. There would be a storage space cost but it should be more performant too.
CREATE VIEW derived.view_pageviews AS (
SELECT
CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp) AS derived_tstamp_sydney,
ev.*,
pv.pageview_id,
pv.time_spent_seconds,
pv.max_pp_yoffset_max,
pv.max_doc_height,
pv.scroll_depth_percent,
se.first_refr_urlhost,
se.first_page_referrer,
se.first_refr_urlscheme,
se.first_refr_urlport,
se.first_refr_urlpath,
se.first_refr_urlfragment,
se.first_refr_medium,
se.first_refr_source,
se.first_refr_term,
se.first_mkt_campaign,
se.first_mkt_medium,
se.first_mkt_source,
se.first_page_urlpath,
se.last_page_urlpath,
se.channel,
us.first_session,
us.last_session,
us.last_domain,
us.last_name,
us.last_legal_name,
us.last_description,
us.last_type,
us.last_linkedin_handle,
us.last_logo,
us.last_metrics_alexa_global_rank,
us.last_metrics_annual_revenue,
us.last_metrics_employees,
us.last_metrics_market_cap,
us.last_crunchbase_handle,
us.user_internal
FROM
atomic.events ev,
atomic.com_snowplowanalytics_snowplow_web_page_1 pg,
derived.pageview pv,
derived.session se,
derived.user us
WHERE
ev.app_id = 'poplindata.com'
AND ev.event = 'page_view'
AND ev.event_id = pg.root_id
AND pg.id = pv.pageview_id
AND ev.domain_sessionid = se.domain_sessionid
AND ev.domain_userid = us.domain_userid
) WITH NO SCHEMA BINDING
Querying the data
Now let’s use the models and views we’ve built and start analysing some content.
Engagement metrics
This query gives you some handy engagement metrics including bucketing pageviews into scroll depth groups.
SELECT
page_urlpath,
SUM(CASE WHEN scroll_depth_percent = 0 THEN 1 END) AS "0%",
SUM(CASE WHEN scroll_depth_percent > 0 AND scroll_depth_percent <= 0.25 THEN 1 END) AS "25%",
SUM(CASE WHEN scroll_depth_percent > 0.25 AND scroll_depth_percent <= 0.5 THEN 1 END) AS "50%",
SUM(CASE WHEN scroll_depth_percent > 0.5 AND scroll_depth_percent <= 0.75 THEN 1 END) AS "75%",
SUM(CASE WHEN scroll_depth_percent > 0.75 THEN 1 END) AS "100%",
ROUND(MEDIAN(scroll_depth_percent)*100) AS "median scroll depth percent",
AVG(time_spent_seconds) AS "average time spent seconds",
SUM(time_spent_seconds) AS "total time spent"
FROM derived.view_pageviews
WHERE
page_urlpath LIKE '/blog/data-modeling%'
AND user_internal IS NOT TRUE
GROUP BY 1
ORDER BY COUNT(*) DESC


Looking at identified companies
As our site is aimed at B2B marketing, identifying the companies coming to our site is an interesting thing to do. I’m excluding traffic from ourselves and our friends at Snowplow and might increase the exclusions as we keep exploring the data.
SELECT
COALESCE(last_domain, user_ipaddress) AS last_domain,
last_crunchbase_handle,
last_linkedin_handle,
last_name,
SUM(time_spent_seconds) AS time_spent_seconds,
COUNT(*) AS pageviews,
COUNT(DISTINCT domain_sessionid) AS sessions,
COUNT(DISTINCT domain_userid) AS unique_browsers,
ROUND(AVG(scroll_depth_percent) * 100) AS avg_scroll_depth
FROM
derived.view_pageviews
WHERE
user_internal IS NOT TRUE
AND COALESCE(last_domain, '') NOT IN ('snowplowanalytics.com')
AND derived_tstamp_sydney > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '90 days'
GROUP BY 1, 2, 3, 4
ORDER BY 8 DESC, 6 DESC, 5 DESC
Next steps?
So that’s the end of our giant odyssey into Snowplow data models. We haven’t covered everything that can go into a data model here. In particular, one of the best uses of data models is to map identities across all instances where you see a particular domain_userid
back in history. To do this you’d use the same technique as I’ve used here for companies, using whatever identification point (or points!) you have. We, unfortunately, don’t have an easy way to reliably identify people.
You might now like to have a read of the Snowplow web data model, which can be a bit daunting at first but probably makes a whole lot more sense now that you’ve looked through a simpler model.