AWS Lake Formation Cheat Sheet
AWS Lake Formation is a managed service that makes it easy to set up, secure, and manage a data lake in a matter of days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis.
What does Lake Formation do?
Lake Formation simplifies and automates many of the complex manual steps usually required to create a data lake. Its primary functions are:
-
Data Ingestion and Management: Helps ingest data from various sources into an Amazon S3-based data lake.
-
Centralized Security & Governance: Provides a single place to define and enforce fine-grained data access policies for all your data lake users and services.
-
Auditing: Tracks all data access through AWS CloudTrail for comprehensive auditing.
Core Concepts
1. Building the Data Lake
-
Data Lake Locations: You register your Amazon S3 paths with Lake Formation to designate them as part of your data lake. Lake Formation then manages the permissions for the data within these registered locations.
-
Blueprints: Pre-defined templates for ingesting data from common sources into your data lake.
-
Database Blueprint: Ingests data from relational databases (like RDS or on-premises MySQL/PostgreSQL) into S3. It can perform a full initial snapshot and then switch to incremental ingestion for ongoing updates.
-
Log File Blueprint: Ingests data from common log file formats (e.g., Application Load Balancer logs, CloudTrail logs) into S3.
-
-
Workflows: When you create a blueprint, Lake Formation generates an AWS Glue workflow. This workflow is a collection of crawlers, jobs, and triggers that orchestrate the entire data ingestion process.
2. The Data Catalog
-
Lake Formation uses the AWS Glue Data Catalog as its central metadata repository.
-
When you crawl a data source or ingest data using a blueprint, the metadata (schemas, table definitions, etc.) is stored in the Data Catalog.
-
Lake Formation extends the Data Catalog with its own permissions model, turning it into the central point of control for your data lake.
3. Securing the Data Lake
This is the most critical feature of Lake Formation. It provides a simple, centralized permission model that works across multiple AWS analytics services.
-
Lake Formation Permissions: Instead of managing complex S3 bucket policies and separate IAM policies for each service, you grant and revoke permissions on Data Catalog resources (databases, tables, columns) directly within Lake Formation.
-
Fine-Grained Access Control: Lake Formation allows you to control access at the following levels:
-
Database-level
-
Table-level
-
Column-level: Grant a user access to only a subset of columns in a table.
-
Row-level (Data Filtering): Grant access to only specific rows based on filter conditions. For example, a user in the 'Sales-US' group can only see rows where
country = 'USA'
. -
Cell-level: Achieved by combining column-level security and row-level filtering.
-
-
Credential Vending: When a user or service queries data, they request access from Lake Formation. Lake Formation checks its permissions model and, if the user is authorized, provides temporary, short-term credentials that grant access only to the specific S3 data required for that query. This prevents users from bypassing Lake Formation to access data directly in S3.
How it Works: Access Control Flow
-
A user in an integrated AWS service (like Amazon Athena, Redshift Spectrum, or an EMR job) runs a query (e.g.,
SELECT name, email FROM customers WHERE country = 'CA'
). -
The service sends the query to Lake Formation for authorization.
-
Lake Formation checks its permissions for that user. It verifies:
-
Does the user have
SELECT
permission on thecustomers
table? -
Does the user have access to the
name
andemail
columns? -
Are there any row-level filters that apply to this user?
-
-
If the query is authorized, Lake Formation vends temporary credentials back to the service.
-
The service uses these temporary credentials to read the necessary data from S3, applies any required filtering, and returns the result to the user.
Additional Features
-
Governed Tables: A new table type in Lake Formation that supports ACID (Atomicity, Consistency, Isolation, Durability) transactions. This brings database-like reliability to your S3 data lake, enabling multiple users to concurrently and reliably read and write data. It also supports automatic data compaction and time-travel queries.
-
Tag-Based Access Control (LF-TBAC): You can assign tags to your Data Catalog resources (databases, tables, columns) and then define policies based on those tags. This simplifies managing permissions at scale. For example, you can grant an IAM principal access to all tables tagged as "PII" (Personally Identifiable Information).
-
Cross-Account Access: Securely share data lake resources with other AWS accounts without copying data. You create "resource links" in your account that point to shared tables in another account, and Lake Formation manages the cross-account permissions.