
In the first three parts of this series, we looked at modelling out pageview events to include accurate time spent and scroll depth, then classifying sessions based on what we know about where the user came from. Now we’re going to look at what we know about users.
One of the best things about having raw behavioural event data in an SQL database is that you have the ability to update your knowledge about a users’ behaviour after the behaviour is recorded. The classic case for this is when a user logs in or otherwise identifies themselves: now know the identity behind the previous behaviour. If your application has some way of identifying users–such as forcing login or email click-throughs–you can merge user behaviour across multiple devices, such as a desktop browser and a mobile browser.
In our company’s case we don’t have a good way to identify individual users. At some point we’ll start a newsletter and drip feed nurturing process which will allow us to pass identifiers into the click-through URLs. Until then, most of the information we can get is based on Clearbit data which is created through a lookup of the IP address and takes a stab at guessing company behind the browser.
Bug in sessions post
A quick aside: while writing this component on users I discovered a bug in the sessions piece. The part of my big CASE
statement dealing with sessions that don’t have a referrer should be at the bottom of the big statement, otherwise the clause looking for utm
querystring components (mtk_*
columns) won’t ever execute in cases without a referrer. Using UTM strings is specifically designed to deal with lack of a referrer situations.
I’ve updated the post so if you want to look for it, look at the location of this piece of code:
WHEN first_refr_urlhost IS NULL
THEN 'Unknown'
Create a lookup table on domain_userid
For this installment we create a lookup_users
table keyed off the first-party cookie which is stored indomain_userid
.
Labelling internal traffic
The first component looks for any users who’ve behaved in a manner that might be considered “internal” traffic. For us that means they’ve sent in data with a host of localhost
or they’ve appeared from one of our internal IP addresses.
In larger companies you might also put a data collection pixel on your intranet homepage, internal systems and the like. Any browser that’s ever visited those locations is considered “internal”. You want to tackle labelling internal browsers from multiple angles to be sure you capture everything. It’s entirely likely that developers will use a browser that never sees the intranet or internal systems to develop and test, so you also want to look at IP ranges and any other opportunities too. If you use them, think about any automated tests running and how you can label those.
Clearbit company details
The second component looks for the latest Clearbit data available for the user. This isn’t going to be perfect. I looked up my own domain_userid
and saw a bunch of different places I’d been including client sites and hotels but it will give you some idea. More complex logic might score users with multiple Clearbit identifiers. Potentially you could just give a count of unique Clearbit companies for the user to drill down when there’s uncertainty.
DROP TABLE IF EXISTS lookup_users;
CREATE TABLE lookup_users
DISTKEY(1)
SORTKEY(1)
AS (
WITH internal_users AS (
SELECT DISTINCT
domain_userid,
TRUE user_internal
FROM atomic.events
WHERE page_urlhost IN ('127.0.0.1', 'localhost')
OR
user_ipaddress IN ('52.123.123.123', '150.123.123.123')
),
clearbit_data AS (
SELECT
DISTINCT
domain_userid,
FIRST_VALUE(domain_sessionid)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) first_session,
LAST_VALUE(domain_sessionid)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_session,
LAST_VALUE(domain IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_domain,
LAST_VALUE(name IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_name,
LAST_VALUE(legal_name IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_legal_name,
LAST_VALUE(description IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_description,
LAST_VALUE(type IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_type,
LAST_VALUE("linkedin.handle" IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_linkedin_handle,
LAST_VALUE(logo IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_logo,
LAST_VALUE("metrics.alexa_global_rank" IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_alexa_global_rank,
LAST_VALUE("metrics.annual_revenue" IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_annual_revenue,
LAST_VALUE("metrics.employees" IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_employees,
LAST_VALUE("metrics.market_cap" IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_market_cap,
LAST_VALUE("crunchbase.handle" IGNORE NULLS)
OVER (
PARTITION BY domain_userid
ORDER BY collector_tstamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_crunchbase_handle
FROM atomic.events
LEFT JOIN atomic.com_clearbit_enrichment_company_1
ON event_id = root_id
AND collector_tstamp = root_tstamp
WHERE app_id = 'poplindata.com'
ORDER BY collector_tstamp
)
SELECT *
FROM clearbit_data
LEFT JOIN internal_users
USING (domain_userid)
)
Pulling it all together
Here’s a simple query that shows recent site visitors ranked by uniques and time spent on the site, traffic source categories and any Clearbit identification information we’ve been able to get for the IP address. You can see the output here with the company details blurred out to protect the innocent. This query pulls together data from all the lookup tables we’ve created in this series: pageviews, sessions and users.

SELECT ev.geo_country, ev.geo_city, NVL(us.last_domain, ev.user_ipaddress) AS last_domain, us.last_name, se.channel, SUM(pv.time_spent_seconds) AS time_spent_seconds, COUNT(*) AS pageviews, COUNT(DISTINCT ev.domain_sessionid) AS sessions, COUNT(DISTINCT ev.domain_userid) AS unique_browsers, ROUND(AVG(pv.scroll_depth_percent) * 100) AS avg_scroll_depth
FROM
atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv, lookup_sessions se, lookup_users us
WHERE
ev.app_id = 'poplindata.com'
AND ev.event = 'page_view'
AND CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
AND ev.event_id = pg.root_id
AND pg.id = pv.pageview_id
AND ev.domain_sessionid = se.domain_sessionid
AND ev.domain_userid = us.domain_userid
AND us.user_internal IS NOT TRUE
GROUP BY 1, 2, 3, 4, 5
ORDER BY 9 DESC, 6 DESC
Next steps
So we’ve now got techniques to build up a custom view of our audience behaviours using the three different scopes of interest. In the next post we’ll look at automating this so that it updates every time a new batch of data is added to Redshift, then some convenience views to make querying simple for everyday use and to expose to your end users.