
In the first part of this series on data modelling we went through the background for building a data model. In this edition we’ll go through the steps to create a basic pageview model that incorporates page pings so we can see accurate time spent and scroll depth for each pageview.
Using this model you can then aggregate the pageview rows and create reports to explore the traffic to your site and build a report like the one shown above which gives us Pageviews, Sessions, Unique Browsers, Average Time Spent and Average Scroll Depth metrics for each page.
Create a lookup table on pageview_id
This model takes advantage of the web page context configuration option of the JavaScript tracker. This automatically creates a UUID for each pageview which is attached to all events within the pageview that subsequently shows up in the table atomic.com_snowplowanalytics_snowplow_web_page_1
. We’ll create a table that aggregates all the page_ping data against the pageview_id
.
-- DROP only needed when you recreate after the first time
DROP TABLE lookup_pageview;
CREATE TABLE lookup_pageview
DISTKEY(2)
SORTKEY(2, 1)
AS (
SELECT
atomic.com_snowplowanalytics_snowplow_web_page_1.id AS pageview_id,
(COUNT(*) * 5)-5 AS time_spent_seconds,
MAX(pp_yoffset_max) AS max_pp_yoffset_max,
MAX(doc_height) AS max_doc_height,
CASE WHEN MAX(pp_yoffset_max) = 0 THEN 0 ELSE (MAX(CAST(pp_yoffset_max AS float)) / MAX(CAST(doc_height AS float))) END AS scroll_depth_percent
FROM atomic.events
JOIN atomic.com_snowplowanalytics_snowplow_web_page_1
ON atomic.events.event_id = atomic.com_snowplowanalytics_snowplow_web_page_1.root_id AND
atomic.events.collector_tstamp = snowplow.atomic.com_snowplowanalytics_snowplow_web_page_1.root_tstamp
WHERE
app_id = 'poplindata.com'
AND event IN ('pageview', 'page_ping')
GROUP BY 1
);
Let’s go through what’s happening here.
We create a table lookup_pageview
using a query that joins atomic.com_snowplowanalytics_snowplow_web_page_1
to atomic.events
keyed by event_id
and collector_tstamp
, the usual way of joining context tables to the core event table.
The output table has a time_spent_seconds column with a time spent calculation based off the page ping setting we use of window.snowplow('enableActivityTracking', 5, 5);
Next we ge the maximum values for the pageview of the scroll depth and document height in pixels, then make a calculation of the percentage that means.
That’s really it! Quite simple really.
Reporting
To build the report output shown above, you can then do a query with a three-way join between atomic.events
, atomic.com_snowplowanalytics_snowplow_web_page_1
and our just-created lookup_pageview pv
.
SELECT
ev.page_urlpath,
SUM(CASE WHEN ev.event = 'page_view' THEN 1 ELSE 0 END) AS pageviews,
COUNT(DISTINCT ev.domain_sessionid) AS sessions,
COUNT(DISTINCT ev.domain_userid) AS UBs,
AVG(pv.time_spent_seconds) AS time_spent_average,
SUM(pv.time_spent_seconds) AS time_spent_total,
ROUND(AVG(pv.scroll_depth_percent)*100) AS scroll_depth_percent_average
FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days'
AND ev.event_id = pg.root_id
AND pg.id = pv.pageview_id
AND ev.app_id = 'poplindata.com'
GROUP BY 1
ORDER BY 2 DESC
Eventually you would write out new data tables that replace the need to ever hit atomic.events directly, containing the items you want from atomic.events
and the identifiers for the applicable pageview, session and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table. Again, you might create a view that does the joins to give you one row per pageview with everything an analyst could need.
Below is a query to bucket page views by scroll depth quartile, handy for getting a broader view of user interactions:
SELECT
ev.page_urlpath,
SUM(CASE WHEN scroll_depth_percent = 0 THEN 1 ELSE 0 END) AS "0%",
SUM(CASE WHEN scroll_depth_percent > 0 AND scroll_depth_percent <= 0.25 THEN 1 ELSE 0 END) AS "25%",
SUM(CASE WHEN scroll_depth_percent > 0.25 AND scroll_depth_percent <= 0.5 THEN 1 ELSE 0 END) AS "50%",
SUM(CASE WHEN scroll_depth_percent > 0.5 AND scroll_depth_percent <= 0.75 THEN 1 ELSE 0 END) AS "75%",
SUM(CASE WHEN scroll_depth_percent > 0.75 THEN 1 ELSE 0 END) AS "100%"
FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days'
AND ev.event_id = pg.root_id
AND pg.id = pv.pageview_id
AND ev.app_id = 'poplindata.com'
AND event = 'page_view'
GROUP BY 1
ORDER BY COUNT(*) DESC
Automation and moving to production
Let us know your thoughts
If you find any bugs or have any feature requests for the debug too, please create an issue in GitHub.
You probably want to create a schema especially for data models instead of dropping this table into atomic
. We use a models
schema for this.
The script above recreates for all pageviews every time it runs. To automate this, you’ll want to trigger it off SQL Runner after each batch runs. We don’t have enormous traffic volumes so recreating every time isn’t a big deal. If you have high traffic volumes, you might want to use a smaller window and keep old data around.
To do this you’ll need to use a look-back window, something like 24 hours, to account for pageviews that are still happening across the batch boundary and only update those within that scope. This would have the helpful side effect of doing less crunching on every batch which is probably preferable. I wanted to keep the example simple.
Next steps
In the next chapter of this series we’ll look at a session model. Sessions are interesting because they have a traffic source: how the user got to your site. The classifications of these sources are very company-specific, so you often end up with a massive CASE statement to put them in appropriate buckets, much like Google Analytics has its Channel Groupings.
Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.