AWS Analytics Services

In-Place Querying in AWS

4 min read
Updated June 23, 2025
4,245 characters

| :--- |

| 1. Ingest data into S3 (raw). | 1. Ingest data into S3 (raw). |

| 2. Build an ETL pipeline (e.g., AWS Glue) to transform and load data into Redshift. | 2. (Optional) Use an ETL process (e.g., AWS Glue) to optimize data format (e.g., to Parquet). |

| 3. Query the data inside Redshift. | 3. Query the data directly in S3. |

Foundational Components

This architectural pattern relies on two key foundational services:

  1. Amazon S3 (Simple Storage Service): The de facto storage layer for a data lake on AWS. It's scalable, durable, and cost-effective, capable of storing any amount of structured, semi-structured, and unstructured data in its native format.

  2. AWS Glue Data Catalog: The central metadata repository. It stores the schema and location information about your data files in S3 as a set of logical tables. The query engines use this catalog to understand how to interpret the files in your data lake.

Primary In-Place Query Engines

AWS offers several powerful services that can directly query data stored in Amazon S3.

1. Amazon Athena

  • What it is: A serverless, interactive query service that makes it easy to analyze data in S3 using standard SQL.

  • How it works: You point Athena at your data in S3 (defined as a table in the Glue Data Catalog), and you can start querying. There is no infrastructure to manage.

  • Use Case: The go-to service for ad-hoc, exploratory analysis of your data lake. It's perfect for data scientists and analysts who need to quickly ask questions of the data without setting up a data warehouse.

  • Pricing: Pay-per-query, based on the amount of data scanned from S3.

2. Amazon Redshift Spectrum

  • What it is: A feature of Amazon Redshift that allows you to run SQL queries against data stored in your S3 data lake.

  • How it works: Redshift Spectrum extends your Redshift data warehouse to S3. You can write a single query that joins data stored locally in your Redshift cluster with "external" data stored in S3.

  • Use Case: Ideal when you have frequently accessed, "hot" data in a high-performance Redshift cluster, but also need to query large volumes of "warm" or "cold" data in S3. It's best for more complex, regular, and performance-sensitive queries that are part of established BI dashboards and reports.

  • Pricing: Pay-per-query, based on the amount of data scanned from S3 (separate from the cost of your running Redshift cluster).

Comparison: Athena vs. Redshift Spectrum

| Feature | Amazon Athena | Amazon Redshift Spectrum |

| :--- | :--- | :--- |

| Primary Purpose | Ad-hoc, interactive querying of S3 data. | Extending a Redshift data warehouse to query S3 data. |

| Infrastructure | Fully serverless. No management required. | Part of a provisioned Redshift cluster. |

| Data Joining| Can join tables within S3. | Can join tables in Redshift with tables in S3. |

| Performance | Good for most queries. | Optimized for complex queries and large joins, leveraging the Redshift query planner. |

| Ideal User | Data analysts, data scientists, anyone needing quick insights from the data lake. | BI engineers, analysts using established dashboards connected to Redshift. |

Other In-Place Query Services

  • Amazon S3 Select & Glacier Select: These are not full query engines, but optimization features. They allow you to use a simple SQL expression to retrieve only a subset of data from a single S3 object or Glacier archive, instead of downloading the entire file. This can drastically reduce the amount of data transferred, improving performance and lowering costs for applications that need to access specific parts of large objects.

Best Practices for Performance and Cost

To make in-place querying effective, you must optimize the data stored in S3:

  • Use Columnar Formats: Convert data to Apache Parquet or ORC. This is the single most important optimization.

  • Compress Your Data: Use compression like Snappy or GZIP. Columnar formats often have this built-in.

  • Partition Your Data: Organize your data into logical partitions (e.g., by date) so that query engines can skip irrelevant data.