Last updated: 3rd February 2021
For updates, changes and errata please see the Changelog at the end of this post.
What is this guide?
– Designed as a fair feature comparison between the different products
– An up to date guide (hopefully with regular updates as new features are released or changed)
– I’ve attempted to make the information as accurate as possible, but some details may be condensed for simplicity. Where applicable I’ve noted if a feature or functionality is in alpha / beta / in-preview. I can’t mention any features that are currently under NDA but will update as soon as they become either public.
- This guide covers: compute, storage, compression, deployment, pricing, scalability, data access, encryption, support for third party tools, query language, user defined functions, federated queries, materialised views, caching, streaming, data sources, maintenance and scheduling. I’m in the process of adding more sections over time!
What is it not?
– A recommendation to pick a specific product or a comprehensive evaluation of all features. All 3 products have unique feature sets but we don’t have enough space to cover all of it (unfortunately).
– This doesn’t cover every possible database you can (or could) use for analytics. That’s a mammoth effort but is something I’d love to do at some point.
– A performance benchmark. Each provider publishes their own TPC-H / encryption, support for third party tools and TPC-DS benchmarks – have a look for these. Fivetran has also posted a performance comparison (with some caveats that they note). Spoiler alert: the performance differences are marginal between the different providers.
– A database battle royale
If you feel strongly that I’ve missed or misrepresented something, have made a grievous error or would just like to have a chat – please get in touch.
Let’s get started!
BigQuery: Runs on distributed compute. This runs on Borg, a cybernetic life-form that Google has imprisoned inside conveniently located data centers in various regions.
Redshift: Proprietary fork of ParAccel (which was partly forked from Postgres) running on AWS virtual machines. Don’t let the proximity to Postgres fool you, it’s more of a distant second cousin.
Snowflake: Proprietary compute engine with intelligent predicate pushdown + smart caching running on commodity virtual machines (AWS, GCP or Azure) depending on your cloud choice. Hybrid columnar system inspired by C-Store, MonetDB among others.
All 3 databases have implementations of hot / warm / cold storage – this pertains to internal storage rather than external storage (e.g., external tables, federated sources).
BigQuery: Proprietary, stored on the Colossus filesystem using ColumnIO as a storage format. Complete separation of distributed compute and storage.
Redshift: Proprietary but is generally SSD (dc1, dc2) / HDD (ds1, ds2) or mixed including S3 based (for RA3) using a proprietary columnar format. RA3 separates compute and storage, whilst all other node types colocalise your compute and storage together. RA3 nodes include both a hot query cache and an extensive metadata cache.
Snowflake: Proprietary columnar format, in-memory / SSD / object store running on compute / object storage in your cloud of choice. Data is stored in a hybrid columnar format (PAX) with aggressive metadata caching.
BigQuery: proprietary compression that is opaque to the user and handled by the ColumnIO columnar format (Colossus under the hood). BigQuery compresses data under the hood for you on an ongoing basis but your queries are still billed as if you are scanning uncompressed bytes.
Redshift: Redshift achieves transparent compression by implementing open algorithms e.g., LZO, ZStandard. It has recently released its own proprietary compression algorithm (AZ64) but your choice of data types here is a little more limited at the moment. You can select which and how you would like columns to be compressed. ANALYZE COMPRESSION will get you some of the way in generating sane recommendations but we’ve found it’s best to select compression according to your query patterns and the data you intend to store in a column. It’s now possible to dynamically change the compression of a column without requiring a copy in-place.
Snowflake: Snowflake also provides its own compression layer that is opaque to the user. Unlike BQ you aren’t billed for bytes scanned but it does mean the query planner can take advantage of the compression and table statistics to scan less data (and reduce compute cost). There’s little you can do to tune this externally – however that’s a pro and not a con.
All of these databases follow cloud-only deployments. If you have to run on-premise you’re going to be a little stuck for the moment unless you plan to migrate that data into the cloud.
Redshift: Cloud only – within Amazon Web services. You can pick from pretty much every AWS region in a list of 25 regions at a cluster level.
Snowflake: Cloud only – within Amazon Web Services / Google Cloud Platform / Azure depending on your preferred flavour of cloud at an account level. Note: Snowflake supports a subset of the regions of each of the public clouds, not all regions. You can check what regions are currently supported on their pricing page.
I’ve included the factors that should typically be considered when evaluating pricing for each database below. Most of the products have either a fixed cost component or a variable cost component depending on which option or plan you go down.
BigQuery: storage (~2 cents per GB / month for warm data, 1 cent per GB / month for colder data), bytes scanned (on demand), slots (fixed / flex) and streaming inserts. The on-demand pricing model is based on bytes scanned whilst purchasing fixed slots (reservations) will have a defined cost per 100 / 500 / 1000 slots. Flex slots introduces the concept of shorter term slots that can be purchased as a sort of burst compute capacity. Costs vary by region. It’s very difficult to know how much on-demand will cost however you can dry run queries (no cost incurred but returns bytes scanned) to estimate what certain queries or workloads will cost.
Redshift: node type (ds2 / dc2 / RA3, avoid d*1 node types), number of nodes, reservations (if you purchased / plan on purchasing any). The use of certain features (Redshift Spectrum, concurrency scaling) may incur additional costs. Node cost will vary by region.
Snowflake: the Snowflake plan (which corresponds to certain product features), the number of warehouses and the size of these warehouses. Using these numbers you will get a cost per credit (abstracted unit of compute). The cost per credit will vary based on plan, cloud provider and geographic location of the warehouse(s). It is also possible to purchase via capacity storage upfront if you’re after some more predictable long term pricing. Certain optional features (e.g., Snowpipe) are priced separately. Cost will vary by cloud and region.
For all three providers it is possible to get discounts that are lower than the publicly advertised list prices. You aren’t able to ask for this pricing, enterprise pricing is typically bestowed upon you if the vendor deems you worthy of discounts / credits. Your mileage will vary depending on the size of your OPEX budget and the length of the
vendor lock-in commitment.
Yes*, with an emphasis on the *.
BigQuery: You are only charged for the queries you run (or for whoever has setup a 15 minute refreshing dashboard that uses SELECT *). There’s also the fixed / flexi slot model mentioned in the section above which gives you more of a provisioned compute capability rather. You can expect most of the queries you run to return within 30 seconds, if not faster. You can set cost / quota levels in BQ but this feature suffers from a lack of granularity (e.g., only being able to set quotas globally at a user / service account / project level).
Redshift: The recently introduced RA3 node type allows you to more easily decouple compute from storage workloads but most customers are still on ds2 (dense storage) / dc2 (dense compute) node types. These nodes types offer both elastic resize or classic resize. Elastic resizes generally complete quite quickly – around about the time it takes light to travel from Mars to Earth (3 minutes) with the caveat that you can only scale up or down by a factor of 2. Classic resizes which are required if you want to resize outside of the aforementioned scaling factor have a much higher variance – hours to days. Redshift introduced pause / resume semantics as of March 2020 in which you can ‘shut down’ a cluster for a period of time and reduce cost. This can be done manually through the API or console or scheduled at certain times. At the moment there is no smart sleep / smart resume functionality based on workloads. Pause / resume typically completes in the amount of time it takes to make a cup of tea (4-10 minutes, from our testing) depending on your node type and snapshot size. Most node types will scale up to 100 nodes – anything beyond that is asking for trouble. Most queries will mostly complete within milliseconds – hours, but this is getting better with features like workload management, concurrency scaling, short query acceleration (SQA) and advanced query accelerator (on RA3 nodes only). AQUA – which produces significantly better query performance for RA3 nodes is expected to be GA in January 2021. There’s a whole blog post in this alone.
Snowflake: Pause, resume semantics (both manual and automated based on workload). Pause / resume is typically completed in less than 60 seconds and can be tuned to shut wake or sleep based on workload activity. Queries typically completed within seconds to < 5 minutes. It’s fast to change between warehouse types (e.g., small (S) => medium (M)) which roughly correlate to the number of vCPUs / memory you are getting and cluster size manually if required. Scaling policies can adjust the number of clusters automagically according to running workloads in either standard or economy modes. It’s a particularly elegant implementation for a scaling mechanism.
Note – regardless of the scaling of these databases you’ll still want to run sensible queries. Certain query patterns that push compute to a single node / leader risk failing regardless of scale with the dreaded resources exceeded error.
Data warehouses aren’t much use unless you can access the data easily. Typically databases have relied on ODBC / JDBC since the dawn of time but increasingly we’re starting to see the proliferation of access through first class user interfaces as well as additional APIs that are often suited to newer forms of data retrieval and manipulation (e.g., DAG runners, serverless functions).
- ODBC / JDBC access via Simba drivers
- BigQuery user interface (updated to new UI as of January 2021)
- BigQuery connections API (for federated sources)
- BigQuery Jobs API (asynchronous)
- BigQuery Storage API
- Bq CLI (command line tool)
- ODBC / JDBC via AWS provided drivers
- Redshift user interface in the AWS console for some node types (ra3.*, ds2.8xlarge, dc2.large, dc2.8xlarge, dc1.8xlarge only)
- Asynchronously via the new (September 2020) Data Access API
- Access via the aws cli (aws redshift-data – uses the data access API under the hood)
- psql (Postgres command line tool)
- ODBC / JDBC access via drivers
- Access via Spark plugin (spark-snowflake)
- Access via Kafka (both Confluent and open source)
- Python / Node.js / Go / .NET / PHP PDO drivers for specific languages, Snowflake also offer dedicated connectors e.g, Python, PHP connectors
- SnowSQL (command line tool)
- Snowsight (some features are in-preview)
All products offer some method of encryption at rest and in transit (depending on ingestion method and source).
BigQuery: encrypted using Google managed encryption key management service (KMS) or customer keys using CMEK (also via KMS).
Redshift: encrypted at rest using AWS key management service. This can either use an Amazon managed key or alternately a AWS customer managed key (CMK).
Snowflake: encryption is end to end by default (in transit, at rest). Virtual private Snowflake (VPS) offers the ability to store an encryption key in memory on dedicated servers. Certain features (e.g., periodic rekeying, customer managed keys) are only available on higher tier plans.
Support for third party tools (visualisation, data modelling)
Most good third party tools have support for all databases. The quality of support may vary but for the most part you can expect your visualisation / data modelling toolkit to work here given all three players have been in the market for a few years.
Snowflake, in their tradition of putting ‘Snow’ in front of every feature – has Snowsight (in-preview, June 2020) which can perform some basic data visualisations. I suspect some of the functionality we’ll see here will be heavily influenced by their acquisition of Numeracy in March 2019. This is a welcome addition and follows a continuing trend of bringing query editors and visualisation into the same cohesive interface.
You can’t have a great database without a great query language (this is mostly true).
BigQuery: BigQuery offers two main dialects, Legacy (aka BigQuery) SQL and Standard SQL. GCP would prefer you forget Legacy SQL exists entirely – but there’s still some useful functions there that don’t yet have an equivalent in Standard SQL. Standard SQL is ANSI 2011 compliant and it has a complement of additional features to support the BQ nested data model as well as some neat machine learning (BQ ML) functionality that is unique.
Redshift: This SQL syntax is also ANSI compliant and will feel familiar if you have experience with Postgres. Redshift has historically struggled with JSON data – which has been stored as VARCHARs but this is likely to change with in-preview support for both JSON data and semi-structured data with a new schemaless data type, SUPER. Query compilation speeds have increased recently (August 2020) due to offloading of query compilation to a serverless compilation farm and an increased (unlimited) cache for compiled objects. Support for the open source PartiQL query language is designed to ease the process of querying structured and unstructured data at once. Paired with the SUPER type this is in preview as of January 2021.
Snowflake: Also ANSI compliant, the Snowflake query language is pretty simple to work with. The language has clearly been designed from the ground up and working with nested data is a breeze for the most part. You won’t generally find any weird implementation quirks here – but the notion of warehouses, stages and pipes introduce some non-traditional (but not unwelcome) concepts to the database lexicon. The query language for the most part is extremely mature – you’ll find almost all operations that you’d like to perform (DDL, DML, permission management, stage management etc) within the query language itself rather than buried in a user interface.
BigQuery and Redshift (additional geospatial functions added in September 2020) both have good support for querying geospatial data. Snowflake currently have in-preview support for geospatial data (June 2020). All implementations use native types (GEOGRAPHY in the case of BigQuery and Snowflake, GEOMETRY in the case of Redshift) which allows you to either use a native type or convert between text representations (such as GeoJSON).
The analytics SQL functions are generally pretty similar between databases but you will find irritating edge cases that require creative workarounds – for example many BigQuery window functions (e.g., LAG, LEAD) don’t support IGNORE NULLs as part of the window (why?).
User defined functions (UDFs)
UDFs are a great way to introduce additional flexibility in the operations you can perform in a database. Who needs Docker when I can just deploy applications in my database? UDFs may be a saviour.
Redshift: You can write UDFs in either SQL or Python. Similar to BQ you can easily package your function with external libraries and dependencies and host the bundled asset on S3. There’s no support for Python 3 (it’s been almost 12 years folks) which pains me given that Python 2 was sunsetted on January 1 this year. However there is some light at the end of the tunnel – the October 2020 (1.0.19506) Redshift build introduces Lambda UDFs which introduce the ability to invoke a Lambda function as a UDF. This is a great feature that helps avoid some of the constraints that were present within locally written UDFs (such as a lack of network access, and no real file system).
Snowflake (June 2020) have in preview external functions which offer the capability to write a function (in any language you choose) external to the Snowflake database itself. This opens up a far wider range of possibilities when compared to traditional UDFs which are typically sandboxed to the environment you are in. The features lends itself to some particularly creative use cases (e.g., enrichment, data matching). As of November 2020 external functions are now in preview for Azure, meaning it is now possible to use Azure API Management in addition to AWS API Gateway. As of December 2020 (in preview) you can use asynchronous external functions which are particularly useful for higher latency, or longer running computations and return a future, rather than blocking and waiting for a response.
Federated queries are a reasonably “new” feature designed to bridge the gap between traditional analytics (OLAP) style databases and your typical row-based, transactional style databases (Postgres, MySQL) that are more likely to contain customer records rather than behaviour where access by index matters more than aggregation-by-column.
BigQuery: supports federated queries via Cloud SQL which includes support for Postgres and MySQL databases (but not MSSQL). Other sources include Google Sheets and Google Cloud Storage.
Redshift: Started supporting federated queries (April 2020) for Amazon RDS (Postgres, MySQL and Aurora). Redshift Spectrum (S3) is covered later in data sources.
Snowflake: no support for federated queries against other databases – but you can now run federated queries against external tables (Azure Blob Storage, Amazon S3, Google Cloud Storage) as of January 2021. Streams (notifications for new objects / keys) makes it easier to manipulate and work with these files.
Materialised views are often useful when building tables that need to be built incrementally, particularly when building these data models it can save a lot of query time when running against a materialised view rather than the equivalent query. Increasingly the providers are starting to support materialised views on external tables, which means that data sources that have an external dependency can be gracefully refreshed.
BigQuery: Currently in GA as of August 2020. There’s some limitations (e.g., no joins currently) but it’s a good starting point for keeping data fresh. Expect to see additional functionality for materialised views over the next couple of months.
Redshift: Has good support for materialised views. Support for external tables (via Spectrum) was added in June 2020. Automatic refresh (and query rewrite) of materialised views was added in November 2020 and tables can now be configured to refresh automatically (December 2020). Queries can now be automatically rewritten as materialised views (January 2021).
Snowflake: Full support for materialised views, however you’ll need to be on the Enterprise Edition. As of December 2020 Snowflake now supports automatic query rewrites with materialised views. This aims to ensure that the query optimiser can plan for the view to be materialised in the most efficient way possible.
BigQuery: caches queries and has an adjustable intermediate cache (BI Engine) for querying from Data Studio that provides a hot in-memory cache. Cached queries do not incur any cost. BI Engine (July 2020) is now available not just for Data Studio but also BigQuery – allowing for a smart intermediate cache between your queries and BQ – regardless of query source.
Redshift: caches queries and results (depending on node type and available storage in memory / on disk).
Snowflake: contains hot and warm query caches in intermediate storage that are separated from cold data storage. Query results are cached for 24 hours (both on local and remote “disks”).
Inserting data quickly and reliably (within seconds) into an analytics database is a hard problem and it shows.
BigQuery: native streaming, streaming quotas vary if you are using an insert id but they are pretty generous. There’s some magic under the hood to reconcile the streaming buffer with partitioned data in a given table to look like a single table – but this is mostly abstracted from the user.
Redshift: you can microbatch into Redshift (from S3) but there’s no native streaming functionality. You can technically run small INSERT statements directly but please don’t do this. The easiest method to start microbatching is via Kinesis Firehose directly into Redshift. Now that Firehose supports Parquet format – this can also help with a data lake strategy (for archival data on S3 and warmer data in Redshift).
Snowflake: you can microbatch via Snowpipe (from data sitting on Amazon S3 / Google Cloud Storage / Azure Blob Storage) at relatively high frequency but there’s no native streaming functionality. Microbatching via Snowpipe for GCS and cross-cloud loads is in preview as of August 2020.
Some products allow you to query data without having data stored in the database itself. This can be useful, but doesn’t tend to be nearly as performant as it can’t take advantage of data localisations and optimised data structures.
BigQuery – you can setup connections to some external data sources including Cloud Storage, Google Drive, Bigtable and Cloud SQL (through federated queries).
Redshift: you can connect to data sitting on S3 via Redshift Spectrum – which acts as an intermediate compute layer between S3 and your Redshift cluster. You can also query RDS (Postgres, Aurora Postgres) if you have federated queries setup. Querying RDS MySQL or Aurora MySQL entered preview mode in December 2020.
Snowflake: Like BigQuery and Redshift, for best performance you’d ideally have the data within Snowflake (stage / database / warehouse) in order to query it. However Snowflake provides external table functionality (in preview) which allows for connections out to object storage on the major clouds (Amazon S3, Google Cloud Storage and Azure Blob Storage) without having to import data into Snowflake.
You can also query other users (and companies) data sources through the Snowflake Data Marketplace if they are part of the program.
Access control determines who can access certain data within a data warehouse. Although access control is most commonly thought of at a database / table level all three major players have some support for this so instead I’ll be looking at access control at the row and column level allowing for far more granular access control configurations.
BigQuery: As of November 2020 BigQuery now provides column level security through Data Catalog and IAM permissions in which columns can be tagged according to a hierarchy which then corresponds to permissions. Row level security is still typically approached through authorised views or tables.
Redshift: Column level security can be enforced through the standard GRANT syntax which allows per user / group access on a certain column rather than relying on views. Row level authorisation is still handled by authorised views.
Snowflake: Column level security is in general access (January 2021) in Snowflake and consists of two separate components that are not found in either BigQuery or Redshift yet. Dynamic data masking is capable of masking a returned result at query time – such as a static redacted value for a column, a hashed equivalent (e.g., SHA256) or a partial mask, such as redacting a portion of an email address or IP address for privacy reasons (via regexp_replace). Dynamic data masking uses Snowflakes role based access control (RBAC) so different users may see different results depending on their role and assigned permissions. External tokenisation is a different approach in which the data is loaded tokenized (effectively redacted at rest as opposed to redacted at query time) into a database and is subsequently ‘detokenized’ at query time. As data is stored in it’s tokenised form at rest this does not require any data masking other than to detokenise the data, which again uses RBAC and external functions to do so (Protegrity is currently the only supported provider).
These features were in preview mode from November 2020 and entered GA in January 2021. Column level security current requires the Enterprise edition of Snowflake.
BigQuery: You’ll likely spend more time optimising your queries (for cost / bytes scanned) than any sort of traditional database maintenance. Some time should be spent designing tables to use the correct partition keys / clustering / range partitioning to minimise any excessive scans based on your query patterns. BigQuery partitions and sorts data in the background – there’s no manual operations required here.
Redshift: Some operations that used to be manual (VACUUM DELETE, VACUUM SORT, ANALYZE) are now conditionally run in the background (2018, 2019). Automatic table optimisation (in-preview, December 2020) is designed to alleviate some of the manual tuning pain by using machine learning to predict and apply the most suitable sort and distribution keys. This is a vast improvement from needing to schedule these actions to run regularly. However this doesn’t address the elephant in the room of if your data is too large to vacuum sort on disk then you are stuck with having to resize the cluster or live with unsorted data (where a deep copy is still likely to exceed available disk space). The query planner still relies on table statistics heavily so make sure these stats are updated on a regular basis – though this should now happen in the background.
Snowflake: Other than choosing the size of your warehouse and setting up some scaling and auto-suspend policies there’s little to maintain here which appears to be a very deliberate choice. Snowflake supports automatic clustering but you can define your own clustering keys on larger tables – which tends not to be a bad idea if you are trying to optimise what micro-partitions are being scanned. Snowflake has deprecated manual clustering in May 2020.
Scheduling queries can be done through automation, for example using Airflow, dbt, Dataform etc or alternately if you are looking for something quick and simple (that isn’t part of a larger DAG process) then scheduling queries within the user interface can be a nice no/low-code way of achieving this.
BigQuery: Queries can be scheduled using the query scheduler which is part of Data Transfer Service. Scheduling is time based (rather than trigger based). As of December 2020 you can also now use Dataform (at no cost) for running data models on BigQuery.
Redshift: Queries can be scheduled using the query scheduler (either in the UI or via API which works via AWS EventBridge) using cron or the run frequency format.
Snowflake: Queries are scheduled using tasks and are based on a fixed time schedule. Tasks can be scheduled according to a cron expression that includes a timezone component. This makes it a little awkward to schedule when dealing with timezones that have offsets that change (DST) but this behaviour is well documented at least.
If you’ve gotten this far hopefully you’ve found this information useful or learned something new. This is by no means a comprehensive play-by-play of every feature from each product but may aid you in making a decision.
There’s a lot I haven’t had a chance to cover including:
- data loss prevention
- query language in details (transactions, stored procedures,
materialized viewsetc). You can now find materialised views above!
- time travel (BigQuery and Snowflake)
However if there’s enough interest I’d love to put them in another post.
If you’d like to have a chat to us (we’re a AWS, GCP and Snowflake partner) we’d love to hear from you.
3rd February 2021 – Updates to Snowflake column level security (GA)
27th January 2021 – Additional information on Snowflake external tables, streams
25th January 2021 – Changes to Redshift materialised views, Snowflake external tables GA, new BigQuery UI rollout
21st December 2020 – Updated to include new PHP PDO driver for Snowflake
16th December 2020 – Updated to include Redshift federated queries (preview) for RDS and Aurora MySQL
14th December 2020 – Updated to include new Redshift features from re:invent, GCP Dataform acquisition
9th December 2020 – Updated to include Snowflake asynchronous external functions, Dataform updates.
2nd December 2020 – Updated to include Snowflake external functions for Azure, new section on Access Control.
13th November 2020 – Updated to include automatic refresh of materialised views for Redshift
12th November 2020 – Updates to a new ‘Scheduling’ section
7th October 2020 – Updates for BigQuery and Redshift user defined functions.
6th October 2020 – Extra information about Snowflake query engine + storage
23rd September 2020 – Updated with Fivetran data warehouse performance comparison, Redshift Geospatial updates
15th September 2020 – New section on data access for all 3 data warehouses
7th September 2020 – Updates on Redshift query compilation, microbatching. No release notes yet for Snowflake / Redshift for September.
13th August 2020 – Updates on minimum BQ slots, BQ Omni, Snowflake Snowpipe, other general details.
11th July 2020 – Updated with information about materialised views
17th June 2020 – Updated additional information for in preview functions for Snowflake (Geospatial, external functions, external tables, Snowsight).
16th June 2020 – Post first published.