Amazon Athena Cheat Sheet
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Key Features
-
Serverless Architecture: No need to set up or manage any servers.
-
Standard SQL: Uses Presto with standard SQL support, making it familiar for most analysts and developers.
-
Pay-per-query: You are charged for the amount of data scanned by each query.
-
S3 Integration: Directly queries data stored in your Amazon S3 buckets.
-
Multiple Data Formats: Supports a variety of data formats, including CSV, JSON, ORC, Avro, and Parquet.
-
Federated Queries: Can run SQL queries across data stored in relational, non-relational, object, and custom data sources.
How Athena Works
-
Point to your data in S3: Define a schema for your data and create a table in Athena. The data remains in your S3 bucket.
-
Query with standard SQL: Use the Athena console, an API, or a JDBC/ODBC connection to run queries.
-
Get results: Athena executes the query and returns results, which can be downloaded or visualized with tools like Amazon QuickSight.
Data Types
Athena supports a variety of data types to handle different kinds of data structures.
Simple Data Types
-
BOOLEAN
-
TINYINT
-
SMALLINT
-
INTEGER
-
BIGINT
-
DOUBLE
-
FLOAT
-
DECIMAL
-
VARCHAR
(strings) -
DATE
-
TIMESTAMP
-
BINARY
Complex Data Types
-
ARRAY
: A list of values of the same type. -
MAP
: A collection of key-value pairs. -
STRUCT
: A collection of elements of different types.
Athena Federated Queries
This feature allows Athena to query data sources other than Amazon S3. It uses Data Source Connectors that run as AWS Lambda functions to process the queries.
-
You can query relational, non-relational, and other data stores.
-
AWS provides pre-built connectors for popular data sources like:
-
Amazon DynamoDB
-
Amazon Redshift
-
Amazon ElastiCache for Redis
-
Amazon OpenSearch Service
-
MySQL
-
PostgreSQL
-
Oracle
-
SQL Server
-
-
You can also build your own custom data source connectors using the Athena Query Federation SDK.
Performance Optimization
To improve query speed and reduce costs, consider the following best practices:
1. Partition Your Data
Partitioning divides your table into parts and keeps the related data together based on column values. Athena can then read only the relevant partitions for a query, drastically reducing the amount of data scanned.
-
Common partitioning schemes include date, region, or department.
-
Partitions are structured as "folders" in your S3 bucket (e.g.,
s3://your-bucket/logs/year=2024/month=06/day=23/
). -
Use
PARTITIONED BY (column_name data_type)
in yourCREATE TABLE
statement.
2. Use Columnar Formats
Convert your data to columnar formats like Apache Parquet or Apache ORC.
-
Columnar storage stores data by column instead of by row.
-
Athena can selectively read only the columns needed for a query, reducing I/O and the amount of data scanned.
-
These formats often provide better compression than row-based formats.
3. Compress Your Data
Compressing your data files can significantly speed up queries by reducing the amount of data that needs to be read from S3.
-
Supported compression formats include
GZIP
,ZLIB
,SNAPPY
, andLZO
. -
When using columnar formats like Parquet or ORC, compression is often built-in.
-
Ensure that your files are splittable. Large, unsplittable files can limit parallelism.
GZIP
files are not splittable, whileLZO
andBZIP2
are.
4. Optimize File Sizes
Aim for an optimal file size between 200MB and 1GB.
-
Too many small files can lead to overhead in file processing.
-
A single, very large file cannot be parallelized effectively.
Cost Control
Since Athena pricing is based on the amount of data scanned, controlling data scan size is key to managing costs.
-
Use Partitions and Columnar Formats: These are the most effective ways to reduce data scanned.
-
Set up Workgroup Limits: Create workgroups to separate users, teams, or applications. You can enforce data usage controls for each workgroup:
-
Per-query limit: Stop a query if it exceeds a defined data scan threshold.
-
Aggregate workgroup limit: Set hourly or daily data usage limits for the entire workgroup.
-
-
Monitor your usage: Use Amazon CloudWatch to track metrics like
DataScannedInBytes
and set up alarms.
Partition Projection
Partition projection is a feature that speeds up query processing of highly partitioned tables. Instead of retrieving partition metadata from a repository like the AWS Glue Data Catalog, it calculates the partition information based on a configuration.
-
Use Cases: Ideal for tables with a large number of partitions that follow a predictable pattern (e.g., time-series data).
-
Configuration: You define the partition scheme (e.g., ranges, injected values) in the table properties in the AWS Glue Data Catalog.
-
Benefits: Reduces the time spent on partition metadata management, resulting in faster query performance for highly partitioned tables. You don't need to run
MSCK REPAIR TABLE
to add new partitions.