I am sorry, I encountered an error while trying to access the URL. I will try again.

Amazon Redshift Cheat Sheet

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.

Core Architecture

An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.

Leader Node: The leader node manages communications with client programs and compiles code to execute on the compute nodes. It does not store user data. It is responsible for creating query execution plans and aggregating results from the compute nodes.
Compute Nodes: These nodes store data and execute the compiled query plan segments. The results are sent back to the leader node for aggregation before being returned to the client.
Node Slices: Each compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.

Key Concepts for Performance

1. Columnar Storage

Unlike traditional row-based databases, Redshift stores data by column rather than by row. This has significant advantages for data warehousing workloads:

Improved I/O: Queries only need to read the columns relevant to the query, drastically reducing the amount of data read from disk.
Better Compression: Columns typically contain similar data, which allows for much higher compression ratios. Redshift automatically selects the optimal compression encoding for each column.

2. Massively Parallel Processing (MPP)

Redshift automatically distributes data and query load across all compute nodes. This parallel architecture allows Redshift to execute complex queries on very large datasets with high performance.

3. Data Distribution Styles

When you create a table, you specify a distribution style, which determines how the data is spread across the compute nodes. Choosing the right style is critical for query performance as it minimizes data movement during query execution.

AUTO Distribution: (Default) Redshift automatically assigns the optimal distribution style based on the table size and data. It starts with ALL for small tables and changes to EVEN as the table grows.
EVEN Distribution: The leader node distributes the rows across the slices in a round-robin fashion. This is a good default choice when a table does not participate in joins or when there is no clear choice between the other styles.
KEY Distribution: The rows are distributed according to the values in a specified column (DISTKEY). The leader node places matching values on the same node slice. This is ideal for tables that are frequently joined on the DISTKEY column, as it co-locates the joining rows.
ALL Distribution: A copy of the entire table is distributed to every node. This is best for smaller, slow-moving dimension tables that are frequently joined, as it guarantees the joining data is always available locally.

4. Sort Keys

When you create a table, you can specify one or more columns as SORTKEYs. Redshift stores the data on disk in sorted order according to the sort key columns.

Benefits: The query optimizer can use this sorted order to execute queries more efficiently. It can quickly skip large blocks of data that fall outside the query's range predicate. It also helps with efficient merge joins.
Types of Sort Keys:
- Compound Sort Key: Made up of all the columns listed in the sort key definition, in the order they are listed. It is most useful when a query's filter applies predicates that use a prefix of the sort keys.
- Interleaved Sort Key: Gives equal weight to each column in the sort key. It is most useful when queries use restrictive predicates on secondary sort columns.

Advanced Features

Amazon Redshift Spectrum: A feature that allows you to run SQL queries directly against exabytes of unstructured and structured data in Amazon S3 without needing to load or transform the data. It extends the analytic power of Redshift beyond data stored on local disks in your cluster.
Concurrency Scaling: Redshift automatically adds additional cluster capacity when you need it to handle bursts of concurrent read queries. As concurrency increases, Redshift automatically routes qualifying queries to the new clusters. This happens transparently in seconds.
AQUA (Advanced Query Accelerator): A new distributed and hardware-accelerated cache that enables Redshift to run up to 10x faster than other cloud data warehouses. AQUA automatically boosts certain types of queries by running data-intensive tasks closer to the storage layer, reducing data movement.
Data Sharing: Allows you to securely and easily share live data across Redshift clusters. You can share data at the database, schema, and table level without the complexity and delays associated with data copies and data movement.
Redshift ML: Makes it easy for data analysts and database developers to create, train, and deploy machine learning (ML) models using familiar SQL commands.

Security and Management

Workload Management (WLM): Allows you to manage query priorities within your cluster. You can create different query queues and route queries to the appropriate queue based on user groups or query groups to manage performance.
Snapshots: Redshift provides point-in-time backups of your cluster, which are stored in S3. Snapshots can be automated or manual. You can restore a snapshot to a new cluster.
Encryption: Supports encryption at rest (using AWS KMS or HSM) and in transit (using SSL).
VPC: Redshift clusters are provisioned in a VPC, allowing you to isolate your cluster and connect securely from your applications.

Amazon Redshift

📚 Recommended AWS Resources