Core Architecture & Concepts
Redshift achieves its performance through a Massively Parallel Processing (MPP) architecture.
- Cluster: The primary component, which consists of a Leader Node and multiple Compute Nodes.
- Leader Node: Manages client connections and receives queries. It parses the query, develops an execution plan, coordinates the parallel execution of the plan by the compute nodes, and aggregates the final results to send back to the client.
- Compute Nodes: Execute the compiled query plan in parallel. They store the data and perform the actual computations. The work is distributed among the compute nodes to maximize parallelism.
- Columnar Storage: Unlike traditional row-based databases, Redshift stores data in columns. This dramatically reduces the I/O required for analytical queries, which typically read only a few columns from a table at a time.
- Data Compression: Redshift automatically applies compression encodings to columns, which reduces storage footprint and further minimizes I/O.
Data Distribution Styles
The distribution style of a table determines how its data is spread across the compute nodes. Choosing the right style is critical for query performance as it minimizes data movement between nodes.
Style | Description | Use Case |
---|---|---|
EVEN | (Default) The leader node distributes the rows across the slices in a round-robin fashion. | Use when a table does not participate in joins or when there is no clear choice between KEY and ALL styles. |
KEY | The data is distributed across the nodes based on the values in one column (the DISTKEY ). Rows with the same key are guaranteed to be on the same node. |
Use on the joining columns for large tables that are frequently joined together. This co-locates the joining data, avoiding data redistribution during query execution. |
ALL | A full copy of the entire table is stored on every compute node. | Best for smaller dimension tables that are frequently joined. This ensures the joining data is always local, but it increases storage requirements and load times. |
Sort Keys
Sort keys determine the order in which rows in a table are physically stored. Query performance is improved because the query processing engine can read fewer blocks of data by jumping directly to the relevant range.
-
Compound Sort Key:
- The default type, composed of one or more columns listed in order.
- It is most useful when your query's
WHERE
clause uses a prefix of the sort keys. For example, if the sort key is(region, department, date)
, it excels at queries filtering onregion
orregion and department
.
-
Interleaved Sort Key:
- Gives equal weight to each column in the sort key.
- It is most useful when your queries use restrictive predicates on secondary sort columns (columns that are not the first column in the key). It provides better performance for queries with unexpected filter patterns.
Key Features
-
Redshift Spectrum:
- An engine that allows you to run queries directly against exabytes of data stored in Amazon S3, without needing to load the data into your Redshift cluster.
- You define an external schema and external tables that point to your S3 data lake. This is ideal for querying infrequently accessed data or for joining your "hot" data in Redshift with your "cold" data in S3.
-
Concurrency Scaling:
- When enabled, Redshift automatically adds and removes transient cluster capacity to handle bursts of concurrent read queries.
- This provides consistently fast performance, even with thousands of concurrent users and queries. You pay on a per-second basis for the extra clusters only while they are in use.
-
Federated Query:
- This feature enables you to query and analyze data across your operational databases (Amazon RDS for PostgreSQL and Aurora PostgreSQL), data warehouse (Redshift), and data lake (S3).
- You can query live data in RDS without moving it, integrating it directly with your Redshift data.
-
Workload Management (WLM):
- Allows you to define query queues with different priorities. This helps ensure that short, fast-running queries don't get stuck behind long, complex queries.
Data Loading & Management
- COPY Command: The most efficient way to perform a bulk load of data into Redshift tables. The
COPY
command is highly parallelized and can load data from Amazon S3, DynamoDB, or EMR. - Enhanced VPC Routing: When enabled, Redshift forces all
COPY
andUNLOAD
traffic between your cluster and your data repositories to go through your VPC. This allows you to use VPC features like security groups and flow logs to monitor and control this traffic.
Security
- Network Isolation: Clusters are provisioned in a VPC. Security groups act as a firewall to control network access.
- Encryption:
- At Rest: Data can be encrypted using AWS KMS keys. Encryption is a cluster-level setting.
- In Transit: SSL/TLS is used to encrypt data between clients and the Redshift cluster.
- IAM Integration: Access to Redshift resources and API actions is controlled through IAM policies.
Backup & Disaster Recovery
- Automated Snapshots: Redshift automatically takes periodic snapshots of your cluster and stores them in Amazon S3. The retention period is configurable (1-35 days).
- Manual Snapshots: You can take manual snapshots at any time. These are stored in S3 and are kept until you explicitly delete them.
- Cross-Region Snapshots: You can configure Redshift to automatically copy snapshots (automated or manual) to another AWS Region for disaster recovery purposes.
Sources