Read Our Blog

View category:

Three

The 2021 database showdown: BigQuery vs Redshift vs Snowflake

Last updated: 16th June 2021 For updates, changes and errata please see the Changelog at the end of this post. What is this guide? - Designed as a fair feature comparison between the different products - An up to date...

Setting up AWS Athena datasource in JetBrains DataGrip

Download the JDBC driver from AWS and place it in the DataGrip JDBC driver directory. On Linux this was ~/.DataGrip2018.1/config/jdbc-drivers/. File > Data Sources to open the Data Sources panel and click ‘+ > Driver’. Name it AWS Athena. Here’s the confusing bit: skip...

JetBrains DataGrip + Google BigQuery

Setting up a BigQuery datasource in Jetbrains DataGrip

DataGrip is one of the most valuable tools for our engineers for exploring and querying a myriad of different database technologies. DataGrip doesn’t yet come bundled with a BigQuery driver so in this post we’ll explore how to setup a...

The Digital Analytics Hierarchy of Needs

A few weeks ago I discoverd Monica Rugati’s fantastic Data Science Hierarchy of Needs. It’s a data science-centric riff on Maslow’s Hierarchy of Needs, a classic concept in pyschology. I’ve found myself using Rugati’s diagram and the concept in conversations with colleagues,...

A quick script to ZSTD all your shredded tables

Mike’s recent post about compressing Snowplow tables works great for atomic.events, with clients seeing compression down to 30% of the original size or so. But what about all your shredded tables? For now you have to manually convert the output from igluctl while we wait for our pull...

Make big data small again with Redshift ZSTD compression

A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening. In late 2016 Facebook...

Decoding Snowplow real-time bad rows (Thrift)

In this tutorial we’ll look at decoding the bad rows data that comes out of Snowplow real time. In the real time pipeline bad rows that are inserted into Elasticsearch (and S3) are stored as base64’d binary serialized Thrift records....

Monitoring Snowplow bad rows using Lambda and Cloudwatch

In this tutorial we’ll use Amazon Lambda and Amazon Cloudwatch to set up monitoring for the number of bad rows that are inserted into Elasticsearch over a period of time. This allows us to set an alert for the threshold...