Amazon Athena Cheat Sheet

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Key Features

Serverless Architecture: No need to set up or manage any servers.
Standard SQL: Uses Presto with standard SQL support, making it familiar for most analysts and developers.
Pay-per-query: You are charged for the amount of data scanned by each query.
S3 Integration: Directly queries data stored in your Amazon S3 buckets.
Multiple Data Formats: Supports a variety of data formats, including CSV, JSON, ORC, Avro, and Parquet.
Federated Queries: Can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

How Athena Works

Point to your data in S3: Define a schema for your data and create a table in Athena. The data remains in your S3 bucket.
Query with standard SQL: Use the Athena console, an API, or a JDBC/ODBC connection to run queries.
Get results: Athena executes the query and returns results, which can be downloaded or visualized with tools like Amazon QuickSight.

Data Types

Athena supports a variety of data types to handle different kinds of data structures.

Simple Data Types

BOOLEAN
TINYINT
SMALLINT
INTEGER
BIGINT
DOUBLE
FLOAT
DECIMAL
VARCHAR (strings)
DATE
TIMESTAMP
BINARY

Complex Data Types

ARRAY: A list of values of the same type.
MAP: A collection of key-value pairs.
STRUCT: A collection of elements of different types.

Athena Federated Queries

This feature allows Athena to query data sources other than Amazon S3. It uses Data Source Connectors that run as AWS Lambda functions to process the queries.

You can query relational, non-relational, and other data stores.
AWS provides pre-built connectors for popular data sources like:
- Amazon DynamoDB
- Amazon Redshift
- Amazon ElastiCache for Redis
- Amazon OpenSearch Service
- MySQL
- PostgreSQL
- Oracle
- SQL Server
You can also build your own custom data source connectors using the Athena Query Federation SDK.

Performance Optimization

To improve query speed and reduce costs, consider the following best practices:

1. Partition Your Data

Partitioning divides your table into parts and keeps the related data together based on column values. Athena can then read only the relevant partitions for a query, drastically reducing the amount of data scanned.

Common partitioning schemes include date, region, or department.
Partitions are structured as "folders" in your S3 bucket (e.g., s3://your-bucket/logs/year=2024/month=06/day=23/).
Use PARTITIONED BY (column_name data_type) in your CREATE TABLE statement.

2. Use Columnar Formats

Convert your data to columnar formats like Apache Parquet or Apache ORC.

Columnar storage stores data by column instead of by row.
Athena can selectively read only the columns needed for a query, reducing I/O and the amount of data scanned.
These formats often provide better compression than row-based formats.

3. Compress Your Data

Compressing your data files can significantly speed up queries by reducing the amount of data that needs to be read from S3.

Supported compression formats include GZIP, ZLIB, SNAPPY, and LZO.
When using columnar formats like Parquet or ORC, compression is often built-in.
Ensure that your files are splittable. Large, unsplittable files can limit parallelism. GZIP files are not splittable, while LZO and BZIP2 are.

4. Optimize File Sizes

Aim for an optimal file size between 200MB and 1GB.

Too many small files can lead to overhead in file processing.
A single, very large file cannot be parallelized effectively.

Cost Control

Since Athena pricing is based on the amount of data scanned, controlling data scan size is key to managing costs.

Use Partitions and Columnar Formats: These are the most effective ways to reduce data scanned.
Set up Workgroup Limits: Create workgroups to separate users, teams, or applications. You can enforce data usage controls for each workgroup:
- Per-query limit: Stop a query if it exceeds a defined data scan threshold.
- Aggregate workgroup limit: Set hourly or daily data usage limits for the entire workgroup.
Monitor your usage: Use Amazon CloudWatch to track metrics like DataScannedInBytes and set up alarms.

Partition Projection

Partition projection is a feature that speeds up query processing of highly partitioned tables. Instead of retrieving partition metadata from a repository like the AWS Glue Data Catalog, it calculates the partition information based on a configuration.

Use Cases: Ideal for tables with a large number of partitions that follow a predictable pattern (e.g., time-series data).
Configuration: You define the partition scheme (e.g., ranges, injected values) in the table properties in the AWS Glue Data Catalog.
Benefits: Reduces the time spent on partition metadata management, resulting in faster query performance for highly partitioned tables. You don't need to run MSCK REPAIR TABLE to add new partitions.

Amazon Athena

📚 Recommended AWS Resources