AWS Glue Cheat Sheet

AWS Glue is a fully managed, serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning, and application development. It is a foundational service for building data lakes and data pipelines on AWS.

Core Components of AWS Glue

1. AWS Glue Data Catalog

The Glue Data Catalog is a centralized, persistent metadata repository for all your data assets, regardless of where they are located. It acts as a drop-in replacement for an Apache Hive metastore.
Database: A logical grouping of tables in the Data Catalog.
Table: The metadata definition that represents your data's schema. It contains information about columns, data types, partition information, and the physical location of the data (e.g., an S3 path).
Crawler: A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data, and then creates or updates metadata tables in your Glue Data Catalog. Crawlers can run on a schedule to detect and catalog new data and schema changes automatically.
Classifier: Determines the schema of your data. Glue provides classifiers for common file types like CSV, JSON, Parquet, and Avro, and you can also write your own custom classifiers.

2. AWS Glue ETL (Extract, Transform, and Load)

Glue provides a managed environment to create, run, and monitor ETL jobs to transform and move data.
ETL Jobs: A job is the business logic that performs the ETL work. It consists of a transformation script, data sources, and data targets.
- Spark Jobs: For large-scale ETL processing, Glue runs your ETL script in a serverless Apache Spark environment. You can write your script in Python (PySpark) or Scala. Glue automatically provisions, manages, and scales the required resources.
- Python Shell Jobs: For smaller, non-parallel workloads or tasks that don't require Spark, you can run Python shell jobs. These are suitable for simple data transformations or automation scripts.
- Ray Jobs: For large-scale Python workloads, you can use the Ray engine, an open-source framework for scaling Python.
DynamicFrames: A distributed table-like structure that is an abstraction over Apache Spark DataFrames. DynamicFrames are designed to handle messy or evolving schemas. They support features like choosing a specific data type for a column with multiple types.
Glue Studio: A graphical interface that makes it easy to create, run, and monitor ETL jobs. You can visually compose data transformation workflows and Glue Studio generates the code for you.

3. AWS Glue Workflows

Workflows allow you to orchestrate and chain together multiple crawlers, ETL jobs, and triggers to create complex, multi-step ETL pipelines.
You can design workflows that run jobs in sequence or in parallel, and you can define conditional logic based on the success or failure of previous steps.

Key Features

Serverless: There are no servers to provision or manage. AWS Glue handles the underlying infrastructure, allowing you to focus on your data.
Integrated: Tightly integrated with a wide range of AWS services, including S3, Redshift, RDS, and Athena. The Glue Data Catalog is often used as the central metastore for services like Athena, Redshift Spectrum, and EMR.
Data Discovery: Crawlers automate the process of discovering datasets and determining their schemas, significantly reducing the manual effort required to build a data catalog.
Job Bookmarks: Glue can track data that has already been processed during a previous run of an ETL job. This prevents the reprocessing of old data and helps process new data efficiently when jobs are run on a recurring schedule.
Schema Registry: A feature that allows you to centrally discover, control, and evolve data stream schemas. It helps ensure that the data structure produced by data producers matches what consumers expect.
Data Quality: AWS Glue Data Quality helps you evaluate and monitor the quality of your data. You can define rules to check for things like completeness, accuracy, and timeliness, and then take action based on the results.
Sensitive Data Detection: Glue can automatically identify and mask personally identifiable information (PII) and other sensitive data within your data pipeline.

How It Works: A Typical Workflow

Crawl the Data Source: You point an AWS Glue crawler at a data store (e.g., an S3 bucket). The crawler scans the data, infers the schema, and creates a table definition in the AWS Glue Data Catalog.
Author an ETL Job: You use Glue Studio or write a script (Python/Scala) to define your ETL logic. This script reads from a source table in the Data Catalog, applies transformations (e.g., joining, filtering, cleaning), and writes the output to a target.
Run the Job: You run the ETL job on demand or on a schedule. AWS Glue provisions the necessary serverless compute resources, runs your script, and then shuts down the resources once the job is complete. You only pay for the resources used while the job is running.
Query the Data: The transformed data, now in a target location like S3, is also cataloged. You can then use services like Amazon Athena to run interactive SQL queries on this new, curated data.
Orchestrate (Optional): You can use AWS Glue Workflows to chain this process together with other jobs or crawlers to create a complete end-to-end data pipeline.

AWS Glue

📚 Recommended AWS Resources