A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening.
In late 2016 Facebook open sourced a compression algorithm known as Zstandard that combines Lempel Ziv and tANS to achieve a compression ratio better than many algorithms with a slight tradeoff in speed.
In January 2017 AWS added ZSTD support to Redshift, so you can now use this powerful compression algorithm to reduce the storage size of your data. One immense perk of ZSTD is that it can be applied across all supported data types. It gives exceptional performance on long varchars: perfectly suited for large JSON strings that regularly appear in shredded tables.
We’ve had great success both experimentally and practically by applying ZSTD compression to multiple Snowplow tables. Experimentally across datasets ranging between 10 million and 5 billion rows we’ve achieved a mean compression ratio of ~3 meaning that the newly compressed table takes up approximately a third of the original table on disk when compared to the compression defaults in atomic.events 0.8.0.
In particular we’ve found:
- Negligible impact on speed of queries for
atomic.events(note this will be highly workload dependent so benchmarks here are less useful)
- So far we haven’t experimented with the performance/space savings of compressing SORTKEYs. AWS recommend against compressing SORTKEYs (so use ENCODE RAW for
- ZSTD in almost all instances replaces LZO as the default compression method suggested by
Things to take into account
- Depending on your workload you may want to run
ANALYZE COMPRESSIONon your table which will provide some recommendations by Redshift as to what the suggested column encodings are. One caveat of this approach is that you are limited to sampling 1 billion rows so if possible choose a sample data set that contains representative variability within columns.
- If you have the opportunity we also recommend benchmarking common queries/jobs on an identical sample of data for a) the 0.8.0 compression defaults and b) the newly compressed tables.
- You cannot modify compression on existing columns in a table so consider deep copying the data particularly if a large region of your table is unsorted1 as this will outperform a
VACUUM. If performing a deep copy ensure you have sufficient disk space to complete the action: it’s difficult to know how much space is required but we opt for at least 50% of the cluster storage remaining. You may need to consider resizing the cluster and/or temporarily pausing your pipeline to complete this action.
- If you decide to drop the original atomic.events table ensure that you either
DROP CASCADEor individually drop any dependencies that may rely on this table such as views or foreign key constraints from shredded or derived tables.
- If you’ve dropped any foreign keys ensure that you recreate them and they reference the new atomic.events object. Although Redshift does not enforce these constraints they are used by planner to generate optimised query plans.
- As this is a new object you may also need to regrant permissions to users/groups as well as ensure the table owner is identical to the original table(s) so that Snowplow can continue to load data.
- Analyze compression on your existing table by using
- If Redshift recommends using ZSTD for columns consider benchmarking a sample of data (e.g., 3 months) with the original column compression and ZSTD column compression
- If performance is better or equivalent deep copy data from the original atomic.events table to a new atomic.events_new table. You may need to resize Redshift or free up additional disk space before performing this action.
- Verify that the two tables contain identical data by comparing a total row count as well as a row count per day.
- Drop the original table as well as any references2.
ALTERthe new table to atomic events3. Ensure that this new table has an identical owner.
- Sit back and enjoy that free disk space.
There’s a few gotchas and edge cases involved with doing this. If you’d like help or advice on how to do it feel free to post in the comments below or get in touch.
SELECT "database", "table", diststyle, sortkey1, sortkey1_enc, size, pct_used, unsorted, stats_off, tbl_rows FROM svv_table_info WHERE schema = 'atomic' ORDER BY size DESC
SELECT DISTINCT c_p.oid AS tbloid ,n_p.nspname AS schemaname ,c_p.relname AS NAME ,n_c.nspname AS refbyschemaname ,c_c.relname AS refbyname ,c_c.oid AS viewoid FROM pg_class c_p JOIN pg_depend d_p ON c_p.relfilenode = d_p.refobjid JOIN pg_depend d_c ON d_p.objid = d_c.objid JOIN pg_class c_c ON d_c.refobjid = c_c.relfilenode LEFT JOIN pg_namespace n_p ON c_p.relnamespace = n_p.oid LEFT JOIN pg_namespace n_c ON c_c.relnamespace = n_c.oid WHERE schemaname = 'atomic'
ALTER TABLE atomic.events_new RENAME TO atomic.events;