AWS Glue Data Quality Cheat Sheet

AWS Glue Data Quality is a feature within AWS Glue that helps you measure and monitor the quality of your data. It provides the capability to define data quality rules, evaluate them against your datasets, and take action to maintain the reliability and integrity of your data pipelines. It is built on the open-source Deequ library developed by Amazon.

Why Use AWS Glue Data Quality?

Data quality is crucial for analytics and machine learning. Poor data quality can lead to inaccurate analysis and flawed business decisions. AWS Glue Data Quality helps prevent this by:

Automating the process of checking data against predefined rules.
Stopping "bad data" from propagating through your ETL pipelines.
Providing insights and scores to understand the health of your datasets over time.
Building trust in your data lake and data warehouse.

Core Concepts

Data Quality Rules

A rule is a specific check that you want to perform on your data to validate its quality.
AWS Glue provides over 25 pre-built rules for common data quality checks.
Examples of Rules:
- IsComplete "column_name": Checks if a column has any NULL values.
- IsUnique "column_name": Checks if all values in a column are unique.
- ColumnLength "column_name" > 5: Checks if the string length in a column is greater than 5.
- ColumnValues "column_name" in ["USA", "CAN", "MEX"]: Checks if column values are within a specified set.
- RowCount > 1000: Checks if the table has more than 1000 rows.

DQDL (Data Quality Definition Language)

DQDL is a simple, domain-specific language used to write sets of data quality rules.
You create a ruleset by combining multiple rules using DQDL. This ruleset is then applied to your dataset to evaluate its overall quality.

Example of a DQDL Ruleset:


Rules = [

    IsComplete "order_id",

    IsUnique "order_id",

    ColumnValues "status" in ["SHIPPED", "PENDING", "CANCELLED"],

    RowCount > 0

]

Evaluation and Scoring

When you run a data quality check, Glue evaluates each rule in your ruleset against the data.
Each rule either passes or fails.
The results are aggregated into a single Data Quality Score, which gives you a quick overview of how well your dataset conforms to your rules.
You can also see the results for each individual rule, helping you pinpoint specific problems.

Actions and Enforcement

You can configure actions to take based on the outcome of the data quality evaluation.
Monitoring: You can choose to only monitor the data quality, allowing all data to pass through while logging the quality results to CloudWatch. This is useful for observing data health without stopping pipelines.
Enforcement: You can enforce data quality by stopping the pipeline if the quality check fails.
- Stop the Job: Configure your AWS Glue ETL job to fail if the data quality score falls below a certain threshold.
- Isolate Bad Data: You can configure the job to write "good" data to one location and quarantine "bad" data (rows that failed the checks) to another S3 location for further inspection.

How to Use AWS Glue Data Quality

You can apply data quality checks in several places within the AWS Glue ecosystem:

1. In AWS Glue ETL Jobs

This is the most common use case.
You add a EvaluateDataQuality transform node to your ETL script or visual job in Glue Studio.
This transform takes your dataset (as a DynamicFrame) and your DQDL ruleset as input.
It outputs the evaluation results and can also split the data into "good" and "bad" sets.
This allows you to build robust pipelines that clean data, check its quality, and then load only the high-quality data into your target data lake or data warehouse.

2. With the AWS Glue Data Catalog

You can associate a DQDL ruleset directly with a table in your Glue Data Catalog.
You can then run data quality evaluations on a schedule or on-demand directly against your data lake tables without needing to set up a full ETL job.
This is useful for continuously monitoring the quality of data at rest in your Amazon S3 data lake.

3. In AWS Glue DataBrew

You can also leverage data quality rules within your AWS Glue DataBrew projects to validate your data as part of the visual data preparation process.

Key Benefits

Serverless: Like other Glue features, it's fully managed. There's no infrastructure to set up or manage.
Pay-as-you-go: You are charged based on the DPU-hours consumed when the data quality tasks run.
Easy to Use: DQDL provides a simple and readable way to define complex data quality checks.
Integrated: Seamlessly works with Glue ETL, the Glue Data Catalog, and DataBrew, providing a unified data integration and quality experience.
Actionable: Provides not just metrics, but the ability to enforce quality and automate responses to bad data.

AWS Glue Data Quality

📚 Recommended AWS Resources