Analytics Data Modelling Data Warehouses Infrastructure

The 2020 database showdown: BigQuery vs Redshift vs Snowflake

Last updated: 23rd September 2020

For updates, changes and errata please see the Changelog at the end of this post.

What is this guide?

– Designed as a fair feature comparison between the different products

– An up to date guide (hopefully with regular updates as new features are released or changed)

– I’ve attempted to make the information as accurate as possible, but some details may be condensed for simplicity. Where applicable I’ve noted if a feature or functionality is in alpha / beta / in-preview. I can’t mention any features that are currently under NDA but will update as soon as they become available.

What is it not?

– A recommendation to pick a specific product or a comprehensive evaluation of all features. All 3 products have unique feature sets but we don’t have enough space to cover all of it (unfortunately).

– This doesn’t cover every possible database you can (or could) use for analytics. That’s a mammoth effort but is something I’d love to do at some point.

– A performance benchmark. Each provider publishes their own TPC-H / TPC-DS benchmarks – have a look for these. Fivetran has also posted a performance comparison (with some caveats that they note). Spoiler alert: the performance differences are minimal.

– A database battle royale

 

If you feel strongly that I’ve missed or misrepresented something, have made a grievous error or would just like to have a chat – please get in touch.

Let’s get started!

Compute layer

BigQuery: Runs on distributed compute. This runs on Borg, a cybernetic life-form that Google has imprisoned inside conveniently located data centers in various regions.

Redshift: Proprietary fork of ParAccel (which was partly forked from Postgres) running on AWS virtual machines. Don’t let the proximity to Postgres fool you, it’s more of a distant second cousin.

Snowflake: Proprietary compute engine with intelligent predicate pushdown + smart caching running on commodity virtual machines (AWS, GCP or Azure) depending on your cloud choice.

 

Storage layer

All 3 databases have implementations of hot / warm / cold storage.

BigQuery: Proprietary, stored on the Colossus filesystem using ColumnIO as a storage format. Complete separation of distributed compute and storage.

Redshift: Proprietary but is generally SSD (dc1, dc2) / HDD (ds1, ds2) or mixed including S3 based (for RA3) using a proprietary columnar format. RA3 separates compute and storage, whilst all other node types colocalise your compute and storage together.

Snowflake: Proprietary columnar format, in-memory / SSD / object store running on compute / object storage in your cloud of choice.

Compression

BigQuery: proprietary compression that is opaque to the user and handled by the ColumnIO columnar format (Colossus under the hood). BigQuery compresses data under the hood for you on an ongoing basis but your queries are still billed as if you are scanning uncompressed bytes.

Redshift: Redshift achieves transparent compression by implementing open algorithms e.g., LZO, ZStandard. It has recently released its own proprietary compression algorithm (AZ64) but your choice of data types here is a little more limited for the moment. You can select which and how you would like columns to be compressed. ANALYZE COMPRESSION will get you some of the way in generating sane recommendations but we’ve found it’s best to select compression according to your query patterns and the data you intend to store in a column.

Snowflake: Snowflake also provides its own compression layer that is opaque to the user. Unlike BQ you aren’t billed for bytes scanned but it does mean the query planner can take advantage of the compression and table statistics to scan less data (and reduce compute cost). There’s little you can do to tune this externally – however that’s a pro and not a con.

 

Deployment

All of these databases follow cloud-only deployments. If you have to run on-premise you’re going to be a little stuck for the moment unless you plan to migrate that data into the cloud.

BigQuery: Cloud only – within Google Cloud Platform (Anthos won’t help you here Anthos *will* now help you here – see BigQuery Omni). Each dataset can be distributed across a region (US or EU) or you pick a specific region from a list.

Redshift: Cloud only – within Amazon Web services. You can pick from pretty much every AWS region in a list of 25 regions at a cluster level.

Snowflake: Cloud only – within Amazon Web Services / Google Cloud Platform / Azure depending on your preferred flavour of cloud at an account level. Note: Snowflake supports a subset of the regions of each of the public clouds, not all regions. You can check what regions are currently supported on their pricing page.

 

Pricing

I’ve included the factors that should typically be considered when evaluating pricing for each database below. Most of the products have either a fixed cost component or a variable cost component depending on which option or plan you go down.

BigQuery: storage (~2 cents per GB / month for warm data, 1 cent per GB / month for colder data), bytes scanned (on demand), slots (fixed / flex) and streaming inserts. The on-demand pricing model is based on bytes scanned whilst purchasing fixed slots (reservations) will have a defined cost per 500 / 1000 slots. Flex slots introduces the concept of shorter term slots that can be purchased as a sort of burst compute capacity. Costs vary by region. It’s very difficult to know how much on-demand will cost however you can dry run queries (no cost incurred but returns bytes scanned) to estimate what certain queries or workloads will cost.

Redshift: node type (ds2 / dc2 / RA3, avoid d*1 node types), number of nodes, reservations (if you purchased / plan on purchasing any). The use of certain features (Redshift Spectrum, concurrency scaling) may incur additional costs. Node cost will vary by region.

Snowflake: the Snowflake plan (which corresponds to certain product features), the number of warehouses and the size of these warehouses. Using these numbers you will get a cost per credit (abstracted unit of compute). The cost per credit will vary based on plan, cloud provider and geographic location of the warehouse(s). It is also possible to purchase via capacity storage upfront if you’re after some more predictable long term pricing. Certain optional features (e.g., Snowpipe) are priced separately. Cost will vary by cloud and region.

 

Enterprise pricing

For all three providers it is possible to get discounts that are lower than the publicly advertised list prices. You aren’t able to ask for this pricing, enterprise pricing is typically bestowed upon you if the vendor deems you worthy of discounts / credits. Your mileage will vary depending on the size of your OPEX budget and the length of the vendor lock-in commitment.

 

Scalability

Yes*, with an emphasis on the *.

BigQuery: You are only charged for the queries you run (or for whoever has setup a 15 minute refreshing dashboard that uses SELECT *). There’s also the fixed / flexi slot model mentioned in the section above which gives you more of a provisioned compute capability rather. You can expect most of the queries you run to return within 30 seconds, if not faster. You can set cost / quota levels in BQ but this feature suffers from a lack of granularity (e.g., only being able to set quotas globally at a user / service account / project level). 

 

Redshift: The recently introduced RA3 node type allows you to more easily decouple compute from storage workloads but most customers are still on ds2 (dense storage) / dc2 (dense compute) node types. These nodes types offer both elastic resize or classic resize. Elastic resizes generally complete quite quickly – around about the time it takes light to travel from Mars to Earth (3 minutes) with the caveat that you can only scale up or down by a factor of 2. Classic resizes which are required if you want to resize outside of the aforementioned scaling factor have a much higher variance – hours to days. Redshift introduced pause / resume semantics as of March 2020 in which you can ‘shut down’ a cluster for a period of time and reduce cost. This can be done manually through the API or console or scheduled at certain times. At the moment there is no smart sleep / smart resume functionality based on workloads. Pause / resume typically completes in the amount of time it takes to make a cup of tea (4-10 minutes, from our testing) depending on your node type and snapshot size. Most node types will scale up to 100 nodes – anything beyond that is asking for trouble. Most queries will mostly complete within milliseconds – hours, but this is getting better with features like workload management, concurrency scaling, short query acceleration (SQA) and advanced query accelerator (on RA3 nodes only). There’s a whole blog post in this alone.

 

Snowflake: Pause, resume semantics (both manual and automated based on workload). Pause / resume is typically completed in less than 60 seconds and can be tuned to shut wake or sleep based on workload activity. Queries typically completed within seconds to < 5 minutes. It’s fast to change between warehouse types (e.g., small (S) => medium (M)) which roughly correlate to the number of vCPUs / memory you are getting and cluster size manually if required. Scaling policies can adjust the number of clusters automagically according to running workloads in either standard or economy modes. It’s a particularly elegant implementation for a scaling mechanism.

 

Note – regardless of the scaling of these databases you’ll still want to run sensible queries. Certain query patterns that push compute to a single node / leader risk failing regardless of scale with the dreaded resources exceeded error.

 

Data access

Data warehouses aren’t much use unless you can access the data easily. Typically databases have relied on ODBC / JDBC since the dawn of time but increasingly we’re starting to see the proliferation of access through first class user interfaces as well as additional APIs that are often suited to newer forms of data retrieval and manipulation (e.g., DAG runners, serverless functions).

BigQuery:

  • ODBC / JDBC access via Simba drivers
  • BigQuery user interface
  • BigQuery connections API (for federated sources)
  • BigQuery Jobs API (asynchronous)
  • BigQuery Storage API
  • Bq CLI (command line tool)

 

Snowflake:

  • ODBC / JDBC access via drivers
  • Access via Spark plugin (spark-snowflake)
  • Access via Kafka (both Confluent and open source)
  • Python / Node.js / Go / .NET drivers for specific languages
  • SnowSQL (command line tool)
  • Snowsight (some features are in-preview)

 

Redshift:

  • ODBC / JDBC via AWS provided drivers
  • Redshift user interface in the AWS console for some node types (ra3.*, ds2.8xlarge, dc2.large, dc2.8xlarge, dc1.8xlarge only)
  • Asynchronously via the new (September 2020) Data Access API
  • Access via the aws cli (aws redshift-data – uses the data access API under the hood)
  • psql (Postgres command line tool)

Encryption

All products offer some method of encryption at rest and in transit (depending on ingestion method and source).

 

BigQuery: encrypted using Google managed encryption key management service (KMS) or customer keys using CMEK (also via KMS).

Snowflake: encryption is end to end by default (in transit, at rest). Virtual private Snowflake (VPS) offers the ability to store an encryption key in memory on dedicated servers. Certain features (e.g., periodic rekeying, customer managed keys) are only available on higher tier plans.

Redshift: encrypted at rest using AWS key management service. This can either use an Amazon managed key or alternately a AWS customer managed key (CMK).

 

Support for third party tools (visualisation, data modelling)

Most good third party tools have support for all databases. The quality of support may vary but for the most part you can expect your visualisation / data modelling toolkit to work here given all three players have been in the market for a few years.

AWS and Google Cloud Platform both have their own general access visualisation / data modelling software (QuickSight, Looker / Data Studio respectively).

Snowflake, in their tradition of putting ‘Snow’ in front of every feature – has Snowsight (in-preview, June 2020) which can perform some basic data visualisations. I suspect some of the functionality we’ll see here will be heavily influenced by their acquisition of Numeracy last year (March 2019). This is a welcome addition and follows a continuing trend of bringing query editors and visualisation into the same cohesive interface.

 

Query language

You can’t have a great database without a great query language (this is mostly true).

BigQuery: BigQuery offers two main dialects, Legacy (aka BigQuery) SQL and Standard SQL. GCP would prefer you forget Legacy SQL exists entirely – but there’s still some useful functions there that don’t yet have an equivalent in Standard SQL. Standard SQL is ANSI 2011 compliant and it has a complement of additional features to support the BQ nested data model as well as some neat machine learning (BQ ML) functionality that is unique.

Redshift: This SQL syntax is also ANSI compliant and will feel familiar if you have experience with Postgres. Beyond some basic JSON parsing functions Redshift struggles with nested data due to a lack of native types. Redshift Spectrum tries to address some of these issues but offers little condolences for data stored natively on disk. Query compilation speeds have increased recently (August 2020) due to offloading of query compilation to a serverless compilation farm and an increased (unlimited) cache for compiled objects.

Snowflake: Also ANSI compliant, the Snowflake query language is pretty simple to work with. The language has clearly been designed from the ground up and working with nested data is a breeze for the most part. You won’t generally find any weird implementation quirks here – but the notion of warehouses, stages and pipes introduce some non-traditional (but not unwelcome) concepts to the database lexicon.

 

BigQuery and Redshift (additional geospatial functions added in September 2020) both have good support for querying geospatial data. Snowflake currently have in-preview support for geospatial data (June 2020). All implementations use native types (GEOGRAPHY in the case of BigQuery and Snowflake, GEOMETRY in the case of Redshift) which allows you to either use a native type or convert between text representations (such as GeoJSON).

The analytics SQL functions are generally pretty similar between databases but you will find irritating edge cases that require creative workarounds – for example many BigQuery window functions (e.g., LAG, LEAD) don’t support IGNORE NULLs as part of the window (why?).

 

User defined functions (UDFs)

UDFs are a great way to introduce additional flexibility in the operations you can perform in a database. Who needs Docker when I can just deploy applications in my database? UDFs may be a saviour.

BigQuery: BQ supports writing user defined functions in SQL and Javascript. For Javascript functions you can include external libraries from Google Cloud Storage which gives you a lot of flexibility when writing more complex UDFs. Under the hood these functions run on V8 Javascript engine so they are pretty snappy and there’s a reasonably simple type mapping between BigQuery and Javascript types.

Redshift: You can write UDFs in either SQL or Python. Similar to BQ you can easily package your function with external libraries and dependencies and host the bundled asset on S3. There’s no support for Python 3 (it’s been almost 12 years folks) which pains me given that Python 2 was sunsetted on January 1 this year.

Snowflake: Support for functions in both SQL and Javascript. However you aren’t currently able to import external libraries into your UDFs so if you have external dependencies you’ll need to consider bundling it into the function definition itself.

Snowflake (June 2020) have in preview external functions which offer the capability to write a function (in any language you choose) external to the Snowflake database itself. This opens up a far wider range of possibilities when compared to traditional UDFs which are typically sandboxed to the environment you are in. Of course there are some limitations here (i.e., the endpoint must currently be a AWS API Gateway) but the features lends itself to some particularly creative use cases (e.g., enrichment, data matching).

 

Federated queries

Federated queries are a reasonably “new” feature designed to bridge the gap between traditional analytics (OLAP) style databases and your typical row-based, transactional style databases (Postgres, MySQL) that are more likely to contain customer records rather than behaviour where access by index matters more than aggregation-by-column.

BigQuery: supports federated queries via Cloud SQL which includes support for Postgres and MySQL databases. Other sources include Google Sheets and Google Cloud Storage.

Redshift: Started supporting federated queries (April 2020) for Amazon RDS (Postgres, MySQL and Aurora). Redshift Spectrum (S3) is covered later in data sources.

Snowflake: currently no support for federated queries.

Materialised views

Materialised views are often useful when building tables that need to be built incrementally, particularly when building these data models it can save a lot of query time when running against a materialised view rather than the equivalent query. Increasingly the providers are starting to support materialised views on external tables, which means that data sources that have an external dependency can be gracefully refreshed.

BigQuery: Currently in GA as of August 2020. There’s some limitations (e.g., no joins currently) but it’s a good starting point for keeping data fresh. Expect to see additional functionality for materialised views over the next couple of months.

Redshift: Has good support for materialised views. Support for external tables (via Spectrum) was added in June 2020.

Snowflake: Full support for materialised views, however you’ll need to be on the Enterprise Edition.

Caching

BigQuery: caches queries and has an adjustable intermediate cache (BI Engine) for querying from Data Studio that provides a hot in-memory cache. Cached queries do not incur any cost. BI Engine (July 2020) is now available not just for Data Studio but also BigQuery – allowing for a smart intermediate cache between your queries and BQ – regardless of query source.

Redshift: caches queries and results (depending on node type and available storage in memory / on disk).

Snowflake: contains hot and warm query caches in intermediate storage that are separated from cold data storage.

 

Streaming

Inserting data quickly and reliably (within seconds) into an analytics database is a hard problem and it shows.

BigQuery: native streaming, streaming quotas vary if you are using an insert id but they are pretty generous. There’s some magic under the hood to reconcile the streaming buffer with partitioned data in a given table to look like a single table – but this is mostly abstracted from the user.

Redshift: you can microbatch into Redshift (from S3) but there’s no native streaming functionality. You can technically run small INSERT statements directly but please don’t do this. The easiest method to start microbatching is via Kinesis Firehose directly into Redshift. Now that Firehose supports Parquet format – this can also help with a data lake strategy (for archival data on S3 and warmer data in Redshift).

Snowflake: you can microbatch via Snowpipe (from data sitting on Amazon S3 / Google Cloud Storage / Azure Blob Storage) at relatively high frequency but there’s no native streaming functionality. Microbatching via Snowpipe for GCS and cross-cloud loads is in preview as of August 2020.

 

Data sources

Some products allow you to query data without having data stored in the database itself. This can be useful, but doesn’t tend to be nearly as performant as it can’t take advantage of data localisations and optimised data structures.

BigQuery – you can setup connections to some external data sources including Cloud Storage, Google Drive, Bigtable and Cloud SQL (through federated queries).

Redshift: you can connect to data sitting on S3 via Redshift Spectrum – which acts as an intermediate compute layer between S3 and your Redshift cluster. You can also query RDS (Postgres, MySQL or Aurora) if you have federated queries setup.

Snowflake: Like BigQuery and Redshift, for best performance you’d ideally have the data within Snowflake (stage / database / warehouse) in order to query it. However Snowflake provides external table functionality (in preview) which allows for connections out to object storage on the major clouds (Amazon S3, Google Cloud Storage and Azure Blob Storage) without having to import data into Snowflake.

You can also query other users (and companies) data sources through the Snowflake Data Marketplace if they are part of the program.

 

Maintenance

BigQuery: You’ll likely spend more time optimising your queries (for cost / bytes scanned) than any sort of traditional database maintenance. Some time should be spent designing tables to use the correct partition keys / clustering / range partitioning to minimise any excessive scans based on your query patterns. BigQuery partitions and sorts data in the background – there’s no manual operations required here.

Redshift: Some operations that used to be manual (VACUUM DELETE, VACUUM SORT, ANALYZE) are now conditionally run in the background (2018, 2019). This is a vast improvement from needing to schedule these actions to run regularly. However this doesn’t address the elephant in the room of if your data is too large to vacuum sort on disk then you are stuck with having to resize the cluster or live with unsorted data. The query planner still relies on table statistics heavily so make sure these stats are updated on a regular basis – though this should now happen in the background.

Snowflake: Other than choosing the size of your warehouse and setting up some scaling and auto-suspend policies there’s little to maintain here which appears to be a very deliberate choice. Snowflake supports automatic clustering but you can define your own clustering keys on larger tables – which tends not to be a bad idea if you are trying to optimise what micro-partitions are being scanned. Snowflake has deprecated manual clustering in May 2020.

 

Summary

If you’ve gotten this far hopefully you’ve found this information useful or learned something new. This is by no means a comprehensive play-by-play of every feature from each product but may aid you in making a decision.

There’s a lot I haven’t had a chance to cover including:

  • access control
  • data loss prevention
  • compliance
  • machine learning (matrix factorization, time series models)
  • query language in details (transactions, stored procedures, materialized views etc). You can now find materialised views above!
  • time travel (BigQuery and Snowflake)

However if there’s enough interest I’d love to put them in another post.

If you’d like to have a chat to us (we’re a AWS, GCP and Snowflake partner) we’d love to hear from you.

 

Changelog:

23rd September 2020 – Updated with Fivetran data warehouse performance comparison, Redshift Geospatial updates

15th September 2020 – New section on data access for all 3 data warehouses

7th September 2020 – Updates on Redshift query compilation, microbatching. No release notes yet for Snowflake / Redshift for September.

13th August 2020 – Updates on minimum BQ slots, BQ Omni, Snowflake Snowpipe, other general details.

11th July 2020 – Updated with information about materialised views

17th June 2020 – Updated additional information for in preview functions for Snowflake (Geospatial, external functions, external tables, Snowsight).

16th June 2020 – first published.