AWS ParallelCluster Cheat Sheet

AWS ParallelCluster is an open-source cluster management tool that helps you to deploy and manage High Performance Computing (HPC) clusters on AWS. It simplifies the process of setting up the necessary infrastructure for computationally intensive workloads.

Core Concepts

What it is: A command-line tool (and API) that automates the creation of a complete HPC environment, including networking, storage, and compute resources.
Primary Use Case: Running large-scale, complex computational tasks common in scientific research, engineering simulations, financial modeling, and machine learning.
Technology: It uses AWS CloudFormation in the background to provision and configure all the required resources based on a simple text configuration file.

Key Components of a ParallelCluster

When you create a cluster with AWS ParallelCluster, it sets up several key components that work together:

1. Head Node

Purpose: The central management point of the cluster.
Function: Users log into the head node to submit and manage jobs, compile code, and manage the cluster environment. It is a persistent EC2 instance that runs for the life of the cluster.

2. Compute Fleet

Purpose: The worker nodes that execute the computational jobs.
Function: This is a fleet of EC2 instances that can automatically scale up and down based on the number of jobs in the queue. When there are no jobs, the compute fleet can scale down to zero to save costs.
Flexibility: You can configure the compute fleet to use a mix of On-Demand and Spot Instances across different instance types to optimize for cost and performance.

3. Job Scheduler

Purpose: Manages the queue of jobs and allocates compute resources to them.
Function: The scheduler is installed on the head node. It decides when and where jobs run, manages job priorities, and orchestrates the scaling of the compute fleet.
Supported Schedulers: AWS ParallelCluster supports popular HPC schedulers, with Slurm being the most common. It also supports AWS Batch.

4. Shared Storage

Purpose: Provides a common, high-performance file system accessible by the head node and all compute nodes.
Function: This is essential for storing application data, software, and job results that need to be shared across the cluster.
Supported Services:
- Amazon FSx for Lustre: Ideal for high-performance, parallel I/O workloads.
- Amazon FSx for NetApp ONTAP / Amazon FSx for OpenZFS: Fully-featured file systems.
- Amazon Elastic File System (EFS): A simple, scalable file system for general-purpose needs.
- Amazon EBS: You can attach multiple EBS volumes to nodes for persistent block storage.

Key Features

Automatic Scaling: The compute fleet scales automatically based on the job queue, ensuring you only pay for the resources you need.
Simple Configuration: Cluster setup is defined in a single YAML file, making it easy to version control and replicate environments.
Multiple Queues and Instance Types: You can define multiple job queues, each with different instance types and priorities, allowing you to tailor the cluster to various workload requirements.
Custom AMIs: You can use custom Amazon Machine Images (AMIs) to pre-install software and libraries, speeding up the deployment of compute nodes.
Networking: Automatically creates a VPC and subnets or allows you to use an existing VPC for more control over your network environment.
Cost Management: Integration with Spot Instances and automatic scaling helps to significantly reduce the cost of running HPC workloads.

Basic Workflow

Install ParallelCluster: Install the AWS ParallelCluster command-line interface (CLI) on your local machine. pip install "aws-parallelcluster>=3.0.0"
Configure: Create a YAML configuration file (cluster-config.yaml) that defines your cluster's architecture (VPC, head node, scheduler, queues, storage, etc.).
Create Cluster: Use the CLI to create the cluster. AWS ParallelCluster uses this file to launch a CloudFormation stack.
```
pcluster create-cluster --cluster-name my-hpc-cluster --cluster-configuration cluster-config.yaml
```
Connect and Submit: Connect to the head node via SSH. Submit your computational jobs to the scheduler (e.g., sbatch my_job_script.sh for Slurm).
Monitor: Monitor the status of your cluster and jobs using scheduler commands (e.g., squeue) or the ParallelCluster CLI.
Delete Cluster: Once your work is complete, delete the cluster to stop incurring costs. This terminates all associated resources.
```
pcluster delete-cluster --cluster-name my-hpc-cluster
```

AWS ParallelCluster

📚 Recommended AWS Resources