AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.
Core Components of a Pipeline
A pipeline is defined by its components. The pipeline definition specifies the business logic of your data management.
-
Pipeline: The top-level container for a data processing workflow. It defines the schedule, activities, and data sources for your job.
-
Data Nodes: Represents the name and location of your data. This can be input data (what an activity processes) or output data (where the results are written). Supported data nodes include:
-
Amazon S3
-
Amazon RDS
-
Amazon Redshift
-
Amazon DynamoDB
-
JDBC-accessible databases
-
-
Activities: A unit of work to be performed. It defines the action to take, such as moving data, running a SQL query, or executing a Hive job. Common activities include:
-
CopyActivity: Copies data from one data node to another.
-
SQLActivity: Runs a SQL query against a database.
-
HiveActivity: Runs a Hive script on an Amazon EMR cluster.
-
ShellCommandActivity: Runs a custom shell script.
-
-
Resources: The computational resources that perform the work defined by an activity. This is often an Amazon EC2 instance or an Amazon EMR cluster that Data Pipeline launches on your behalf.
-
Schedules: Defines when and how often the pipeline activities should run. You can define a start time, end time, and the period (e.g., run every 15 minutes, every day).
-
Preconditions: A conditional check that must be true before an activity can run. For example, a precondition could check if the source data exists in S3 before starting a copy activity. If the precondition is not met, the activity will wait.
-
Actions: An action that is triggered when certain events occur, such as success, failure, or late completion of an activity. A common action is to send an Amazon SNS notification to alert an operator.
How It Works
-
Define the Pipeline: You create a pipeline definition using a JSON-based format or the visual editor in the AWS Management Console. This definition outlines all the components: data nodes, activities, resources, schedules, and preconditions.
-
Activate the Pipeline: Once you activate the pipeline, Data Pipeline takes over. It manages the scheduling and execution of all the defined tasks.
-
Launch Resources: Based on the schedule, Data Pipeline launches the necessary computational resources (e.g., an EC2 instance or an EMR cluster) to perform the work.
-
Run Activities: A component called Task Runner is installed on the resource. The Task Runner polls the Data Pipeline service for scheduled tasks. When a task is ready, the Task Runner executes it, reports its status, and handles retries if failures occur.
-
Monitor Progress: You can monitor the status of your pipelines and their individual components through the AWS Console, CLI, or API. CloudWatch logs provide detailed information about the execution of your tasks.
-
Shutdown: Once the work is complete, Data Pipeline can automatically shut down the resources to save costs.
Key Concepts
-
Managed Service: Data Pipeline is a managed orchestration service. It handles the launching of resources, dependency management, scheduling, and error handling, allowing you to focus on your business logic.
-
Template-Based: Data Pipeline provides pre-configured templates for common use cases, such as processing log files, archiving data to S3, or running regular jobs against Redshift.
-
Reliability: It is designed for fault tolerance. It automatically retries failed activities and can send notifications upon failure, success, or late activities.
-
On-Premises Integration: While it runs on AWS, the Task Runner can also be installed on your on-premises resources, allowing Data Pipeline to orchestrate workflows that involve both AWS services and your on-premises data stores.
Data Pipeline vs. Newer AWS Services
-
AWS Glue: A fully managed ETL (extract, transform, and load) service that makes it easy to prepare and load your data for analytics. Glue is generally considered the direct successor for most ETL workloads previously handled by Data Pipeline. It offers a data catalog, automatic schema detection, and serverless Spark-based ETL jobs.
-
AWS Step Functions: A serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications. Step Functions is better suited for orchestrating complex, event-driven workflows, especially those involving Lambda functions and microservices.