Amazon EMR (Elastic MapReduce) Cheat Sheet

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. It is used for big data processing, interactive analytics, machine learning, and more.

Core Concepts

Cluster: A collection of Amazon EC2 instances. Each instance in the cluster is called a node.
Node Types: There are three types of nodes in an EMR cluster:
- Master Node: The central component that manages the cluster. It runs software components like YARN ResourceManager and HDFS NameNode to coordinate the distribution of data and tasks among other nodes. Every cluster has one master node.
- Core Node: Runs tasks and stores data in the Hadoop Distributed File System (HDFS) on your cluster. Core nodes are required and a cluster must have at least one.
- Task Node: Only runs tasks and does not store data in HDFS. Task nodes are optional and are a cost-effective way to add computational power to a cluster for parallel tasks like Spark jobs.
EMRFS (EMR File System): An implementation of HDFS that allows EMR clusters to store and access data directly in Amazon S3. This enables you to decouple your storage and compute, allowing you to shut down a cluster to save costs while keeping your data persisted in S3.

Key Features

Easy to Use: EMR simplifies the complexities of setting up, configuring, and managing big data frameworks. You can launch a cluster in minutes.
Elastic: You can easily resize your cluster to add or remove capacity. You can add more instances for peak workloads and remove them when they're no longer needed, paying only for what you use.
Low Cost: EMR pricing is per-second, with a one-minute minimum. You can significantly reduce costs by using Amazon EC2 Spot Instances for task nodes.
Flexible: You have complete control over your cluster. You get root access to every instance and can customize it with a wide range of applications like Spark, Hive, Presto, Flink, and Hudi.
Reliable: EMR monitors your cluster, retrying failed tasks and automatically replacing poorly performing instances.
Secure: EMR integrates with AWS security services like IAM and VPC, and offers features like Kerberos for authentication and encryption at rest and in transit.

How It Works: EMR Architecture

Storage Layer:
- HDFS: The Hadoop Distributed File System runs on the core nodes, providing ephemeral, distributed storage. Data in HDFS is lost when the cluster is terminated.
- EMRFS: Allows you to use Amazon S3 as a durable, persistent storage layer for your data. This is the recommended approach for most use cases.
- Local File System: The underlying EC2 instance store. This is temporary storage.
Resource Management Layer:
- YARN (Yet Another Resource Negotiator): This is the default resource manager for EMR clusters. It is responsible for managing cluster resources and scheduling tasks to be run on the nodes.
Data Processing Frameworks Layer:
- EMR supports multiple frameworks. The most common are Apache Hadoop (MapReduce) and Apache Spark. You select the frameworks and applications you need when you launch the cluster.
Applications and Interfaces:
- You can interact with the applications on your cluster using interfaces like Hive (for SQL-based data warehousing), Presto (for interactive SQL queries), and Spark MLlib (for machine learning).

Submitting Work to an EMR Cluster

Steps: The most common way to submit work is by defining a sequence of steps when you launch the cluster or by adding them to a running cluster. A step is a unit of work that contains instructions to be executed by the software on the cluster. For example, you can submit a Spark job, a Hive script, or a custom program as a step.
Connecting to the Cluster: You can also connect to the master node via SSH to run commands and submit jobs interactively.
EMR Notebooks: Provides a managed, serverless Jupyter Notebook environment. You can use EMR Notebooks to run Spark jobs interactively, visualize data, and collaborate. The notebooks are stored in S3, separate from the cluster.

Scaling and Instance Management

Cluster Scaling:
- Manual Resizing: You can manually add or remove nodes from a running cluster. This is useful for handling predictable variations in workload.
- Auto Scaling: You can configure EMR to automatically scale the number of instances in your cluster based on CloudWatch metrics. This allows the cluster to adapt to workload changes dynamically. You can define scaling policies for core and task node groups.
Instance Fleets: This provides a more advanced and flexible way to provision EC2 instances for your cluster. You can specify a variety of instance types and purchasing options (On-Demand and Spot) and EMR will provision the lowest-priced capacity from your selections.
Instance Groups: The traditional way to configure clusters. You specify a single instance type and purchasing option for each node type (master, core, task).

Best Practices for Cost Optimization

Use EMRFS and S3: Decouple compute from storage. Store your persistent data in S3 and terminate clusters when they are not actively processing data.
Use Spot Instances: Run task nodes on Spot Instances to significantly lower your costs. Task nodes are stateless, making them ideal for handling interruptions.
Choose the Right Instance Types: Select instance types that are optimized for your workload (compute-optimized, memory-optimized, etc.).
Use Auto Scaling: Match cluster capacity to your workload to avoid paying for idle resources.
Use Transient Clusters: For batch processing jobs, launch a cluster, have it execute the steps, and configure it to automatically terminate when the work is complete.

Amazon EMR

📚 Recommended AWS Resources