Quantifying content velocity in Snowplow

Adam Greco is something of a legend in the Adobe Analytics space. I’ve been reading his blog posts and learning from him since I first started using Omniture back in 2007 or so. He literally wrote the book on Omniture and then Adobe SiteCatalyst. The reason his blog was so useful is that very few people were writing about advanced digital analytics at the time. Between Adam and Ben Gaines, I learnt much of what I know about the then-emerging discipline.

Adam also recently come on board as a strategic advisor for Snowplow. That’s super exciting because now there’s someone with extensive experience helping the Snowplow team refine their offering.

Content velocity

A few weeks ago Adam posted a description of a really interesting report he’d built to use Adobe Analytics data. Quantifying Content Velocity is a really great approach to understanding how quickly different types of content get in front of an audience.

I love this view because you can clearly see how quickly your content finds an audience. Some content will be taken up really quickly, spreading around the Internet like a virus: “news” type content, or something that manages to hit the Zeitgeist and become a viral hit. Other content takes time to build an audience and become an established resource for whatever it’s talking about: long-tail SEO content that gets more traffic over time.

Adam’s approach is a bit clunky in a few ways so it got me thinking about how we’d build something similar to use Snowplow data.

Adobe Analytics approach

There’s a couple of problems I have with the Adobe approach that I’d like to improve upon.

Client-side clock

The first problem is a long-term problem I’ve had with Adobe Analytics. The time a particular event is recorded isn’t available as an explicit dimension in Adobe Analytics. This problem has been around forever with Adobe and before that Omniture. Apparently these dimensions are available in Analytics Workspace, but not in the rest of the platform.

There are a bunch of ways you can use a timestamp dimension. You can build a time-of-day report to look at when users are visiting: hour of day, day of week etc. The only way to do this is to use an old hack, I’m not sure if it was Adam or Ben who originally came up with it, that pushes the day part you’re interested in into a dimension for later use. There’s two problems with this: every day part you want consumes a custom dimension (props and eVars) and it relies on the clients having an accurate clock. I’m sorry, but clients don’t have accurate clocks in the real world!

I’ve seen the clock thing at scale with large companies and the client clocks can be miles out. A quick query of our own site shows a few, but I know from querying much larger sites that this is very common. In real-world examples I’ve seen clocks that are years wrong, in both directions.

SELECT DATEDIFF('day', collector_tstamp, dvce_sent_tstamp), COUNT(DISTINCT domain_userid)
FROM atomic.events
WHERE app_id = 'poplindata.com'
  AND (DATEDIFF('day', dvce_sent_tstamp, collector_tstamp) > 1
  OR DATEDIFF('day', dvce_sent_tstamp, collector_tstamp) < -1)
GROUP BY 1
ORDER BY 1

You can see in this query that we’re taking advantage of some of the many timestamp fields Snowplow records, so this will be handling any issues with queuing of events to send later. See this post for more information about the timestamp fields.

For the record, Google Analytics has a pretty good set of time-based dimensions that you can use for this and not consume custom dimensions.

Automation

Another problem with Adam’s approach is that it’s kinda clunky. I’m sure that with a bit of work you could automate this and make it a bit neater but it’s very manual. Snowplow has the advantage of being able to use the very expressive SQL query language and then push the result into whatever visualisation tool you want to use.

With a somewhat complex SQL query we can actually work out the publish date for a piece of content automatically by looking for the first time we see Snowplow data for it. To make this truly accurate though, we need to filter out any hits that come in while the author is working on the piece but prior to publishing. We can take advantage of the user_internal column I created in the blog post on data modelling users.

Snowplow approach

Time parting

Here’s an example of a simple time parting report showing the day of the week people consume our content. You can see we’re pretty popular on weekdays and less so on weekends.

SELECT DATE_part(dayofweek, CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp)) AS day_of_week,
    COUNT(*) AS page_view
FROM atomic.events
WHERE app_id = 'poplindata.com'
    AND event = 'page_view'
GROUP BY 1
ORDER BY 1

Content velocity

I was a bit daunted by the query I needed to make here so I turned to the ever-reliable Jethro to recreate Adam’s report in SQL. Working together we came up with this query. It contains two nested SELECTstatements to find the first instance of a particular URL being seen to work out the publish date, then calculates a value for how long the individual pageview event is after the publish date.

SELECT page_urlpath,
       velocity_bucket,
       page_views
FROM (SELECT page_urlpath, CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp) :: date - published_date :: date velocity_bucket, count(*) page_views
                  FROM atomic.events
                         LEFT JOIN derived.user USING (domain_userid)
                         LEFT JOIN (SELECT page_urlpath, CONVERT_TIMEZONE('Australia/Sydney', min(derived_tstamp)) published_date
                                    FROM atomic.events
                                           LEFT JOIN derived.user USING (domain_userid)
                                    WHERE user_internal IS NOT TRUE
                                    GROUP BY 1) published USING (page_urlpath)
                  WHERE app_id = 'poplindata.com'
                    AND event = 'page_view'
                    AND page_urlpath LIKE '/blog/data-modeling%'
                    AND user_internal IS NOT TRUE
                  GROUP BY 1, 2
                  ORDER BY 1, 2)
GROUP BY page_urlpath, velocity_bucket, page_views
ORDER BY page_urlpath, velocity_bucket

From here I was able to create a nice line chart in Tableau by using velocity_bucket as the X axis.

Optimization

This is a very expensive query, going through every pageview row in the database to find the first instance of the URL. Rather than calculate this every time you run the report, you could merge it into the data model for each pageview. The approach of using the first pageview to signify publishing is also problematic unless you very carefully exclude all pageviews while the content is being written and edited. A better approach would be to derive the publish date from the CMS metadata, along with other metadata about each article.

You’d then make this available within the pageviews view I created in the final part of the data modelling blog series.

Popular posts like this

Accurate time spent: A killer feature of Snowplow Analytics

Modelling your Snowplow event data: Part 1

Modelling your Snowplow event data: Part 2 Pageviews