AWS Glue DataBrew Cheat Sheet
AWS Glue DataBrew is a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data without writing any code. It helps to reduce the time it takes to prepare data for analytics and machine learning (ML) by up to 80%.
Who is it for?
While AWS Glue ETL is primarily for data engineers who write code, DataBrew is designed for non-technical users, such as:
-
Data Analysts
-
Business Analysts
-
Data Scientists
-
BI Professionals
These users can visually explore, clean, and transform raw data directly from the AWS Management Console.
Core Components
DataBrew's workflow is organized around four main components:
-
Dataset:
-
A read-only connection to your source data. DataBrew does not store the data itself.
-
A dataset can be created from a wide variety of sources, including:
-
Amazon S3 (CSV, JSON, Parquet, ORC, etc.)
-
Amazon Redshift
-
Amazon RDS
-
Snowflake
-
and other JDBC-accessible data stores.
-
-
-
Project:
-
A project is your interactive workspace where you profile, clean, and transform your data.
-
You start by selecting a dataset, and DataBrew loads a sample of the data into a visual, spreadsheet-like interface for you to work with.
-
-
Recipe:
-
A recipe is a set of data transformation steps. As you apply transformations (e.g., filtering, merging, pivoting, splitting columns) in your project's visual interface, DataBrew records each action as a step in the recipe.
-
DataBrew offers over 250 pre-built transformations, so you don't need to write code for common data preparation tasks.
-
Recipes are immutable. You can have multiple versions of a recipe, but once created, a version cannot be changed.
-
-
Job:
-
A recipe job runs the transformation steps defined in your recipe on your entire source dataset (not just the sample).
-
You can run a recipe job on demand or on a schedule.
-
The output of a job is written to a target location, typically an Amazon S3 bucket, in a format you choose (e.g., CSV, Parquet).
-
Key Features
-
Visual Interface: An interactive, point-and-click interface that makes data preparation accessible without needing to code.
-
Data Profiling: When you create a dataset or a project, you can run a data profile job. This generates a report that provides a deep understanding of your data, including column statistics, data type distribution, cardinality, and potential data quality issues.
-
250+ Built-in Transformations: A rich library of transformations covers everything from filtering and cleaning data to enriching it with advanced functions like natural language processing (NLP) and computer vision (CV) transformations.
-
Data Lineage: DataBrew provides a visual map of your data's journey, showing you the data sources, how the data was transformed, and where the output was sent. This is crucial for traceability and governance.
-
Scalability: While you work on a sample of your data in the interactive project view, recipe jobs run on the full dataset in a serverless, distributed environment managed by DataBrew. This means it can process datasets of any size, from kilobytes to terabytes.
-
Automation and Scheduling: You can schedule your recipe jobs to run automatically, creating repeatable and automated data preparation pipelines.
How It Works: A Typical Workflow
-
Create a Dataset: Connect to your raw data in a source like Amazon S3.
-
Create a Project: Start a new project based on your dataset. DataBrew loads a sample of the data into its interactive grid.
-
Profile and Clean:
-
Run a data profile to understand your data's characteristics and identify quality issues.
-
Use the visual editor to apply transformations from the built-in library. For example:
-
Remove duplicate rows.
-
Filter out bad data.
-
Split a 'full name' column into 'first name' and 'last name'.
-
Normalize date formats.
-
Merge multiple datasets.
-
-
As you work, DataBrew automatically records each step in a recipe.
-
-
Create and Run a Recipe Job:
-
Once you are satisfied with your recipe, create a recipe job.
-
Configure the job to run on the entire dataset.
-
Specify the output location (e.g., a specific S3 bucket) and format (e.g., Parquet for analytics).
-
-
Use the Prepared Data:
-
Once the job is complete, your clean, analysis-ready data is available at the output location.
-
You can then use this prepared data in other AWS services like Amazon QuickSight, SageMaker, or Athena.
-