- Browser classifications in
atomic.eventsare deprecated (there’s now a better way)
- Use the UA Parser enrichment to classify browsers
- Use the IAB Bots & Spiders enrichment to find bad bots
- You’ll still need to look at your own data carefully to remove bots that aren’t detected
Finding the robots
The Internet has a lot of robots crawling around. Some are well behaved: they follow the robots.txt rules for whether or not to crawl your site and identify themselves as robots in the useragent header sent with their requests.
Others are not so well behaved.
Most site owners want a reliable way to exclude robot traffic from analytics data.
atomic.events browser data
atomic.events table has a bunch of browser classification fields that are filled from the User Agent Utils enrichment. This enrichment is now deprecated and unmaintained so it’s out-of-date for the latest browsers. You should stop using the fields in atomic.events to classify browsers.
The replacement for
atomic.events, ua-parser-enrichment, uses the open source ua-parser library which is actively maintained. It works a little differently to the old model and requires you to join in an external table to the core events table like any other custom data model.
Using this enrichment, you might want to exclude things based on
device_family values of “Spider” and perhaps look deeper and exclude
useragent_family values of “HeadlessChrome” and “PhantomJS”.
Use the IAB Bots & Spiders list
Of course not all bots and spiders are so well behaved as to announce their presence. Snowplow has another enrichment for this purpose, the IAB/ABC International Bots & Spiders list client. This enrichment uses the list curated by the IAB and UK ABC organisations and used extensively in the ad tech space. As well as user-agent strings it also keeps a list of known bot IP address ranges.
Subscribing to the list alone can be quite expensive but it is available at a discount to our Snowplow Managed Service customers. Just create a ticket for it and we’ll get it set up for you.
Querying the data
Bringing these two data sources together we can create a query that shows the different classifications of the browser and user. Below is an example query you can use to amend your data model.
SELECT derived_tstamp, page_urlpath, useragent, category, primary_impact, reason, spider_or_robot, useragent_family, device_family FROM atomic.events LEFT JOIN atomic.com_iab_snowplow_spiders_and_robots_1 ON atomic.events.event_id = atomic.com_iab_snowplow_spiders_and_robots_1.root_id AND atomic.events.collector_tstamp = snowplow.atomic.com_iab_snowplow_spiders_and_robots_1.root_tstamp LEFT JOIN atomic.com_snowplowanalytics_snowplow_ua_parser_context_1 ON atomic.events.event_id = atomic.com_snowplowanalytics_snowplow_ua_parser_context_1.root_id AND atomic.events.collector_tstamp = atomic.com_snowplowanalytics_snowplow_ua_parser_context_1.root_tstamp WHERE app_id = 'poplindata.com' AND derived_tstamp > '2018-10-01' AND spider_or_robot = true ORDER BY 1
You’ll notice there’s no
dvce_type field in the UA Parser table. This is because the UA Parser project deems only information inside the User-Agent header to be relevant. To break out into devices like desktop, mobile, tablet, TV, game you’d need to create your own lookup table using something like the
device_family column. Unfortunately that column isn’t enormously consistent and you’d need to keep monitoring what values pop up there to be sure you don’t miss anything. Tablet detection might also need some screen size dimension in there too.
Bots that aren’t captured
There are plenty of bots that won’t be identified using these mechanisms. Bad actors, particularly in ad fraud, have strong financial incentives to get around blocking mechanisms and actively work to get around the IAB list. Some perfectly legitimate bots and spiders operate in narrow niches or regions that might not have made it onto these lists, which are built primarily to service online advertising.
You’ll need to dig through your own data and explore to find bot-like patterns of behaviour and make your own classifications to catch these.
Device categorisation in Snowplow (and other platforms) isn’t perfect, but it has improved. Snowplow are working on new enrichments all the time and have plans to use some commercially-available classification systems to get even better at classification.
If there’s anything else you’d like us to help with, let us know.