Better bot detection analytics data using reCaptcha

Snowplow already provides a few tools for identifying events from non-human devices (“crawlers”, “bots”, etc.) however most of these function based on metadata collected with events, like the User Agent string and IP address (e.g., the IAB Bots and Spiders enrichment) which is also used to flag bot traffic in Google Analytics.

As User Agents are progressively frozen to include less descriptive information and bots become more sophisticated, these methods are becoming ineffective over time.

Breaking existing CAPTCHAs is now a well known and inexpensive problem with more exotic methods like generating convincing mouse trajectories to fool newer CAPTCHAs starting to enter the wild.

This new plugin for v3 of the Snowplow JavaScript Trackers adds a new level of non-human traffic detection for your Snowplow data, leveraging the risk-analysis powers of Google’s reCAPTCHA project. The plugin uses the “passive” or “invisible” CAPTCHA techniques to score web users on their probability of being human on each page view. These scores can then be used to segment your event data for cleaner, more accurate datasets or to identify bad actors and potential fraud.

Unlike more traditional CAPTCHAs which may require the user to interactively identify text, traffic lights, boats or decode The Da Vinci Code, the passive capture uses background signals from the users’ browser including typing speed and mouse movements to determine just how human a user is. As reCAPTCHA has a wide deployment (several million sites) Google uses these signals combined with machine learning to return a score from 0.0 to 1.0 that can now be attached to your Snowplow events.

From a technical perspective reCAPTCHA v3 is requested in the background and requests a short-lived (2 minute) token which is then attached as an entity to a Snowplow event. This can be attached to all events – or only a subset of events if required.

Once this token passes through the enrichment process the Snowplow Javascript enrichment makes a call to the Google reCAPTCHA API using this token and exchanges it for a score (along with some other additional information including reasons) that is returned and attached to the event as an assessment.

This plugin supports both the free and enterprise version of v3 reCAPTCHA with the Enterprise version returning more granular scores as well as a reason that the score was assigned.

As with many methods this detection is probabilistic so there is always the possibility of false positives and false negatives so we recommend considering combining this technique with additional detection methods such as intrusion detection systems and UI honeypots.

You can find the Iglu schemas and the code for this enrichment in our Github repository. As always if you have any questions or comments please feel free to reach out to us on our contact page.

Published by Mike Robins

CTO at Poplin Data

Popular posts like this

Server Side Tracking – What’s Old is New Again

Snowplow Inspector v0.2.15 Released

Establishing data lineage practices with Google Tag Manager analytics events