In the first part of this series on data modelling we went through the background for building a data model. In this edition we’ll go through the steps to create a basic pageview model that incorporates page pings so we can see accurate time spent and scroll depth for each pageview.
Using this model you can then aggregate the pageview rows and create reports to explore the traffic to your site and build a report like the one shown above which gives us Pageviews, Sessions, Unique Browsers, Average Time Spent and Average Scroll Depth metrics for each page.
Create a lookup table on
atomic.com_snowplowanalytics_snowplow_web_page_1. We’ll create a table that aggregates all the page_ping data against the
-- DROP only needed when you recreate after the first time DROP TABLE lookup_pageview; CREATE TABLE lookup_pageview DISTKEY(2) SORTKEY(2, 1) AS ( SELECT atomic.com_snowplowanalytics_snowplow_web_page_1.id AS pageview_id, (COUNT(*) * 5)-5 AS time_spent_seconds, MAX(pp_yoffset_max) AS max_pp_yoffset_max, MAX(doc_height) AS max_doc_height, CASE WHEN MAX(pp_yoffset_max) = 0 THEN 0 ELSE (MAX(CAST(pp_yoffset_max AS float)) / MAX(CAST(doc_height AS float))) END AS scroll_depth_percent FROM atomic.events JOIN atomic.com_snowplowanalytics_snowplow_web_page_1 ON atomic.events.event_id = atomic.com_snowplowanalytics_snowplow_web_page_1.root_id AND atomic.events.collector_tstamp = snowplow.atomic.com_snowplowanalytics_snowplow_web_page_1.root_tstamp WHERE app_id = 'poplindata.com' AND event IN ('pageview', 'page_ping') GROUP BY 1 );
Let’s go through what’s happening here.
We create a table
lookup_pageview using a query that joins
atomic.events keyed by
collector_tstamp, the usual way of joining context tables to the core event table.
The output table has a time_spent_seconds column with a time spent calculation based off the page ping setting we use of
window.snowplow('enableActivityTracking', 5, 5);
Next we ge the maximum values for the pageview of the scroll depth and document height in pixels, then make a calculation of the percentage that means.
That’s really it! Quite simple really.
To build the report output shown above, you can then do a query with a three-way join between
atomic.com_snowplowanalytics_snowplow_web_page_1 and our just-created
SELECT ev.page_urlpath, SUM(CASE WHEN ev.event = 'page_view' THEN 1 ELSE 0 END) AS pageviews, COUNT(DISTINCT ev.domain_sessionid) AS sessions, COUNT(DISTINCT ev.domain_userid) AS UBs, AVG(pv.time_spent_seconds) AS time_spent_average, SUM(pv.time_spent_seconds) AS time_spent_total, ROUND(AVG(pv.scroll_depth_percent)*100) AS scroll_depth_percent_average FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days' AND ev.event_id = pg.root_id AND pg.id = pv.pageview_id AND ev.app_id = 'poplindata.com' GROUP BY 1 ORDER BY 2 DESC
Eventually you would write out new data tables that replace the need to ever hit atomic.events directly, containing the items you want from
atomic.events and the identifiers for the applicable pageview, session and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table. Again, you might create a view that does the joins to give you one row per pageview with everything an analyst could need.
Below is a query to bucket page views by scroll depth quartile, handy for getting a broader view of user interactions:
SELECT ev.page_urlpath, SUM(CASE WHEN scroll_depth_percent = 0 THEN 1 ELSE 0 END) AS "0%", SUM(CASE WHEN scroll_depth_percent > 0 AND scroll_depth_percent <= 0.25 THEN 1 ELSE 0 END) AS "25%", SUM(CASE WHEN scroll_depth_percent > 0.25 AND scroll_depth_percent <= 0.5 THEN 1 ELSE 0 END) AS "50%", SUM(CASE WHEN scroll_depth_percent > 0.5 AND scroll_depth_percent <= 0.75 THEN 1 ELSE 0 END) AS "75%", SUM(CASE WHEN scroll_depth_percent > 0.75 THEN 1 ELSE 0 END) AS "100%" FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days' AND ev.event_id = pg.root_id AND pg.id = pv.pageview_id AND ev.app_id = 'poplindata.com' AND event = 'page_view' GROUP BY 1 ORDER BY COUNT(*) DESC
Automation and moving to production
Let us know your thoughts
If you find any bugs or have any feature requests for the debug too, please create an issue in GitHub.
You probably want to create a schema especially for data models instead of dropping this table into
atomic. We use a
models schema for this.
The script above recreates for all pageviews every time it runs. To automate this, you’ll want to trigger it off SQL Runner after each batch runs. We don’t have enormous traffic volumes so recreating every time isn’t a big deal. If you have high traffic volumes, you might want to use a smaller window and keep old data around.
To do this you’ll need to use a look-back window, something like 24 hours, to account for pageviews that are still happening across the batch boundary and only update those within that scope. This would have the helpful side effect of doing less crunching on every batch which is probably preferable. I wanted to keep the example simple.
In the next chapter of this series we’ll look at a session model. Sessions are interesting because they have a traffic source: how the user got to your site. The classifications of these sources are very company-specific, so you often end up with a massive CASE statement to put them in appropriate buckets, much like Google Analytics has its Channel Groupings.
Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.