Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Half 2

This can be a visitor publish co-written with Meta’s PyTorch workforce and is a continuation of Half 1 of this sequence, the place we exhibit the efficiency and ease of working PyTorch 2.0 on AWS.

Machine studying (ML) analysis has confirmed that enormous language fashions (LLMs) educated with considerably giant datasets lead to higher mannequin high quality. In the previous couple of years, the dimensions of present era fashions has elevated considerably, they usually require fashionable instruments and infrastructure to be educated effectively and at scale. PyTorch Distributed Knowledge Parallelism (DDP) helps course of information at scale in a easy and strong method, however it requires the mannequin to suit on one GPU. The PyTorch Totally Sharded Knowledge Parallel (FSDP) library breaks this barrier by enabling mannequin sharding to coach giant fashions throughout information parallel employees.

Distributed mannequin coaching requires a cluster of employee nodes that may scale. Amazon Elastic Kubernetes Service (Amazon EKS) is a well-liked Kubernetes-conformant service that vastly simplifies the method of working AI/ML workloads, making it extra manageable and fewer time-consuming.

On this weblog publish, AWS collaborates with Meta’s PyTorch workforce to debate the best way to use the PyTorch FSDP library to realize linear scaling of deep studying fashions on AWS seamlessly utilizing Amazon EKS and AWS Deep Studying Containers (DLCs). We exhibit this by way of a step-by-step implementation of coaching 7B, 13B, and 70B Llama2 fashions utilizing Amazon EKS with 16 Amazon Elastic Compute Cloud (Amazon EC2) p4de.24xlarge cases (every with 8 NVIDIA A100 Tensor Core GPUs and every GPU with 80 GB HBM2e reminiscence) or 16 EC2 p5.48xlarge cases (every with 8 NVIDIA H100 Tensor Core GPUs and every GPU with 80 GB HBM3 reminiscence), reaching close to linear scaling in throughput and finally enabling quicker coaching time.

The next scaling chart reveals that the p5.48xlarge cases supply 87% scaling effectivity with FSDP Llama2 fine-tuning in a 16-node cluster configuration.

Challenges of coaching LLMs

Companies are more and more adopting LLMs for a spread of duties, together with digital assistants, translation, content material creation, and laptop imaginative and prescient, to reinforce the effectivity and accuracy in a wide range of functions.

Nevertheless, coaching or fine-tuning these giant fashions for a customized use case requires a considerable amount of information and compute energy, which provides to the general engineering complexity of the ML stack. That is additionally on account of restricted reminiscence out there on a single GPU, which restricts the dimensions of the mannequin that may be educated, and in addition limits the per-GPU batch dimension used throughout coaching.

To deal with this problem, varied mannequin parallelism strategies equivalent to DeepSpeed ZeRO and PyTorch FSDP have been created to permit you to overcome this barrier of restricted GPU reminiscence. That is achieved by adopting a sharded information parallel method, the place every accelerator holds only a slice (a shard) of a mannequin duplicate as an alternative of all the mannequin duplicate, which dramatically reduces the reminiscence footprint of the coaching job.

This publish demonstrates how you should utilize PyTorch FSDP to fine-tune the Llama2 mannequin utilizing Amazon EKS. We obtain this by scaling out compute and GPU capability to deal with the mannequin necessities.

FSDP overview

In PyTorch DDP coaching, every GPU (known as a employee within the context of PyTorch) holds an entire copy of the mannequin, together with the mannequin weights, gradients, and optimizer states. Every employee processes a batch of knowledge and, on the finish of the backward go, makes use of an all-reduce operation to synchronize gradients throughout completely different employees.

Having a reproduction of the mannequin on every GPU restricts the dimensions of the mannequin that may be accommodated in a DDP workflow. FSDP helps overcome this limitation by sharding mannequin parameters, optimizer states, and gradients throughout information parallel employees whereas nonetheless preserving the simplicity of knowledge parallelism.

That is demonstrated within the following diagram, the place within the case of DDP, every GPU holds an entire copy of the mannequin state, together with the optimizer state (OS), gradients (G), and parameters (P): M(OS + G + P). In FSDP, every GPU holds solely a slice of the mannequin state, together with the optimizer state (OS), gradients (G), and parameters (P): M<partition quantity>(OS + G + P). Utilizing FSDP leads to a considerably smaller GPU reminiscence footprint in comparison with DDP throughout all employees, enabling the coaching of very giant fashions or utilizing bigger batch sizes for coaching jobs.

This, nonetheless, comes at the price of elevated communication overhead, which is mitigated by way of FSDP optimizations equivalent to overlapping communication and computation processes with options like pre-fetching. For extra detailed data, discuss with Getting Began with Totally Sharded Knowledge Parallel (FSDP).

FSDP gives varied parameters that permit you to tune the efficiency and effectivity of your coaching jobs. A number of the key options and capabilities of FSDP embody:

Transformer wrapping coverage
Versatile blended precision
Activation checkpointing
Varied sharding methods to swimsuit completely different community speeds and cluster topologies:
- FULL_SHARD – Shard mannequin parameters, gradients, and optimizer states
- HYBRID_SHARD – Full shard inside a node DDP throughout nodes; helps a versatile sharding group for a full duplicate of the mannequin (HSDP)
- SHARD_GRAD_OP – Shard solely gradients and optimizer states
- NO_SHARD – Just like DDP

For extra details about FSDP, discuss with Environment friendly Giant-Scale Coaching with Pytorch FSDP and AWS.

The next determine reveals how FSDP works for 2 information parallel processes.

Resolution overview

On this publish, we arrange a compute cluster utilizing Amazon EKS, which is a managed service to run Kubernetes within the AWS Cloud and on-premises information facilities. Many purchasers are embracing Amazon EKS to run Kubernetes-based AI/ML workloads, benefiting from its efficiency, scalability, reliability, and availability, in addition to its integrations with AWS networking, safety and different companies.

For our FSDP use case, we use the Kubeflow Coaching Operator on Amazon EKS, which is a Kubernetes-native venture that facilitates fine-tuning and scalable distributed coaching for ML fashions. It helps varied ML frameworks, together with PyTorch, which you should utilize to deploy and handle PyTorch coaching jobs at scale.

Using the PyTorchJob customized useful resource of Kubeflow Coaching Operator, we run coaching jobs on Kubernetes with a configurable variety of employee replicas which permits us to optimize useful resource utilization.

The next are a number of elements of the coaching operator that play a job in our Llama2 fine-tuning use case:

A centralized Kubernetes controller that orchestrates distributed coaching jobs for PyTorch.
PyTorchJob, a Kubernetes customized useful resource for PyTorch, offered by the Kubeflow Coaching Operator, to outline and deploy Llama2 coaching jobs on Kubernetes.
etcd, which is said to the implementation of the rendezvous mechanism for coordinating the distributed coaching of PyTorch fashions. Thisetcdserver, as a part of the rendezvous course of, facilitates the coordination and synchronization of the taking part employees throughout distributed coaching.

The next diagram illustrates the answer structure.

A lot of the particulars might be abstracted by the automation scripts that we use to run the Llama2 instance.

We use the next code references on this use case:

What’s Llama2?

Llama2 is a LLM pre-trained on 2 trillion tokens of textual content and code. It is among the largest and strongest LLMs out there right now You need to use Llama2 for a wide range of duties, together with pure language processing (NLP), textual content era, and translation. For extra data, discuss with Getting began with Llama.

Llama2 is obtainable in three completely different mannequin sizes:

Llama2-70b – That is the most important Llama2 mannequin, with 70 billion parameters. It’s the strongest Llama2 mannequin and can be utilized for essentially the most demanding duties.
Llama2-13b – This can be a medium-sized Llama2 mannequin, with 13 billion parameters. It’s a good steadiness between efficiency and effectivity, and can be utilized for a wide range of duties.
Llama2-7b – That is the smallest Llama2 mannequin, with 7 billion parameters. It’s the best Llama2 mannequin, and can be utilized for duties that don’t require the best stage of efficiency.

This publish lets you fine-tune all of those fashions on Amazon EKS. To offer a easy and reproducible expertise of making an EKS cluster and working FSDP jobs on it, we use the aws-do-eks venture. The instance will even work with a pre-existing EKS cluster.

A scripted walkthrough is obtainable on GitHub for an out-of-the-box expertise. Within the following sections, we clarify the end-to-end course of in additional element.

Provision the answer infrastructure

For the experiments described on this publish, we use clusters with p4de (A100 GPU) and p5 (H100 GPU) nodes.

Cluster with p4de.24xlarge nodes

For our cluster with p4de nodes, we use the next eks-gpu-p4de-odcr.yaml script:

export ODCR_ID=<your-capacityreservation-id>

cat > ./eks-gpu-p4de-odcr.yaml <<EOF
apiVersion: eksctl.io/v1alpha5
variety: ClusterConfig
metadata:
  identify: do-eks-yaml-p4de-odcr
  model: "1.28"
  area: us-east-1
  tags:
    karpenter.sh/discovery: do-eks-yaml-p4de-odcr
availabilityZones:
  - us-east-1a
  - us-east-1b
  - us-east-1c
  - us-east-1d
managedNodeGroups:
  - identify: sys
    instanceType: c5.2xlarge
    desiredCapacity: 1
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
nodeGroups:
  - identify: p4de-odcr
    instanceType: p4de.24xlarge
    instancePrefix: p4de-odcr
    privateNetworking: true
    availabilityZones:
      - us-east-1c
    efaEnabled: true
    minSize: 0
    desiredCapacity: 2
    maxSize: 64
    volumeSize: 500
    capacityReservation:
      capacityReservationTarget:
        capacityReservationID: $ODCR_ID
    iam:
      withAddonPolicies:
        cloudWatch: true
        ebs: true
        fsx: true
iam:
  withOIDC: true
EOF

Utilizing eksctl and the previous cluster manifest, we create a cluster with p4de nodes:

eksctl create cluster -f ./eks-gpu-p4de-odcr.yaml

Cluster with p5.48xlarge nodes

A terraform template for an EKS cluster with P5 nodes is situated within the following GitHub repo.

You may customise the cluster by way of the variables.tf file after which create it by way of the Terraform CLI:

terraform init && terraform plan -out tfplan && terraform apply tfplan

You may confirm the cluster availability by working a easy kubectl command:

The cluster is wholesome if the output of this command reveals the anticipated variety of nodes in Prepared standing.

Deploy conditions

To run FSDP on Amazon EKS, we use the PyTorchJob customized useful resource. It requires etcd and Kubeflow Coaching Operator as conditions.

Deploy etcd with the next code:

kubectl apply -f https://uncooked.githubusercontent.com/aws-samples/aws-do-eks/most important/Container-Root/eks/deployment/etcd/etcd-deployment.yaml

Deploy Kubeflow Coaching Operator with the next code:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

Construct and push an FSDP container picture to Amazon ECR

Use the next code to construct an FSDP container picture and push it to Amazon Elastic Container Registry (Amazon ECR):

# Obtain Dockerfile
curl -L -o ./Dockerfile.llama2-efa https://uncooked.githubusercontent.com/aws-samples/aws-do-eks/most important/Container-Root/eks/deployment/distributed-training/pytorch/pytorchjob/fsdp/Dockerfile.llama2-efa

# Construct Picture
AWS_REGION=$(aws configure get area)
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output textual content)
REGISTRY=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
IMAGE=fsdp
TAG=":llama2-efa"

docker construct --progress=plain -t ${REGISTRY}${IMAGE}${TAG} -f ./Dockerfile.llama2-efa .

# Log in to ECR, create registry, push picture
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
aws ecr create-repository --repository-name ${IMAGE}
docker picture push ${REGISTRY}${IMAGE}${TAG}

Create the FSDP PyTorchJob manifest

Insert your Hugging Face token within the following snippet previous to working it:

HF_TOKEN=”<insert_your_huggingface_token_here>”

Configure your PyTorchJob with .env file or instantly in your surroundings variables as under:

JOB_NAME=fsdp
RDZV_HOST=etcd
RDZV_PORT=2379
NUM_WORKERS=2
INSTANCE_TYPE=p5.48xlarge
GPU_PER_WORKER=8
EFA_PER_WORKER=32
MODEL_NAME=meta-llama/Llama-2-7b-hf

CMD="huggingface-cli login --token ${HF_TOKEN} && torchrun --nproc_per_node=${GPU_PER_WORKER} --nnodes=${NUM_WORKERS} examples/finetuning.py --num_epochs=5 --batch_size_training=3 --enable_fsdp --model_name $MODEL_NAME --output_dir ."

Generate the PyTorchJob manifest utilizing the fsdp template and generate.sh script or create it instantly utilizing the script under:

cat > ./fsdp.yaml <<EOF
apiVersion: kubeflow.org/v1
variety: PyTorchJob
metadata:
  identify: $JOB_NAME
spec:
  elasticPolicy:
    rdzvBackend: etcd
    rdzvHost: $RDZV_HOST
    rdzvPort: $RDZV_PORT
    minReplicas: 1
    maxReplicas: 64
    maxRestarts: 100
    metrics:
      - sort: Useful resource
        useful resource:
          identify: cpu
          goal:
            sort: Utilization
            averageUtilization: 90
  pytorchReplicaSpecs:
    Employee:
      replicas: $NUM_WORKERS
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: $JOB_NAME
        spec:
          volumes:
            - identify: shmem
              hostPath:
                path: /dev/shm
          nodeSelector:
            node.kubernetes.io/instance-type: '${INSTANCE_TYPE}'
          containers:
            - identify: pytorch
              picture: '${REGISTRY}${IMAGE}${TAG}'
              imagePullPolicy: All the time
              sources:
                requests:
                  nvidia.com/gpu: $GPU_PER_WORKER
                  vpc.amazonaws.com/efa: $EFA_PER_WORKER
                limits:
                  nvidia.com/gpu: $GPU_PER_WORKER
                  vpc.amazonaws.com/efa: $EFA_PER_WORKER
              env:
                - identify: LOGLEVEL
                  worth: DEBUG
                - identify: NCCL_DEBUG
                  worth: INFO
                - identify: TORCH_NCCL_ASYNC_ERROR_HANDLING
                  worth: '1'
              command:
                - bash
                - '-c'
                - '${CMD}'
              volumeMounts:
                - identify: shmem
                  mountPath: /dev/shm
EOF

Run the PyTorchJob

Run the PyTorchJob with the next code:

kubectl apply -f ./fsdp.yaml

You will notice the required variety of FDSP employee pods created and, after pulling the picture, they’ll enter right into a Operating state.

To see the standing of the PyTorchJob, use the next code:

kubectl describe -f ./fsdp.yaml

To cease the PyTorchJob, use the next code:

kubectl delete -f ./fsdp.yaml

After a job is full, it must be deleted earlier than initiating a brand new run. We’ve additionally noticed that deleting theetcdpod and letting it restart previous to launching a brand new job helps keep away from a RendezvousClosedError.

Scale the cluster

You may repeat the previous steps of making and working jobs whereas various the quantity and occasion sort of employee nodes within the cluster. This lets you produce scaling charts just like the one proven earlier. Typically, you need to see a discount in GPU reminiscence footprint, discount in epoch time, and improve in throughput when extra nodes are added to the cluster. The earlier chart was produced by conducting a number of experiments utilizing a p5 node group various from 1–16 nodes in dimension.

Observe the FSDP coaching workload

Observability of generative synthetic intelligence workloads is essential to permit visibility into your working jobs in addition to help in maximizing the utilization of your compute sources. On this publish, we use a number of Kubernetes-native and open supply observability instruments for this goal. These instruments allow you to trace errors, statistics, and mannequin conduct, making AI observability an important a part of any enterprise use case. On this part, we present varied approaches for observing FSDP coaching jobs.

Employee pod logs

On the most simple stage, you want to have the ability to see the logs of your coaching pods. This could simply be achieved through the use of Kubernetes-native instructions.
First, retrieve an inventory of pods and find the identify of the one that you simply wish to see logs for:

Then view the logs for the chosen pod:

kubectl logs -f <pod_name>

Just one employee (elected chief) pod log will record the general job statistics. The identify of the elected chief pod is obtainable at the start of every employee pod log, recognized by the important thing master_addr=.

CPU utilization

Distributed coaching workloads require each CPU and GPU sources. To optimize these workloads, it’s essential to know how these sources are utilized. Luckily, some nice open supply utilities can be found that assist visualize CPU and GPU utilization. For viewing CPU utilization, you should utilizehtop. In case your employee pods comprise this utility, you should utilize the under command to open a shell right into a pod after which runhtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you may deploy an htopdaemonsetjust like the one offered within the following GitHub repo.

Thedaemonsetwill run a light-weight htop pod on every node. You may exec into any of those pods and run thehtopcommand:

kubectl exec -it <htop_pod_name> -- htop

The next screenshot reveals the CPU utilization on one of many nodes within the cluster. On this case, we’re a P5.48xlarge occasion, which has 192 vCPUs. The processor cores are idle whereas the mannequin weights are downloaded, and we see rising utilization whereas the mannequin weights are being loaded to GPU reminiscence.

GPU utilization

If thenvtoputility is obtainable in your pod, it’s possible you’ll exec into it utilizing under after which runnvtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you may deploy a nvtopdaemonsetjust like the one offered within the following GitHub repo.

This can run anvtoppod on every node. You may exec into any of these pods and runnvtop:

kubectl exec -it <nvtop_pod_name> -- nvtop

The next screenshot reveals the GPU utilization on one of many nodes within the coaching cluster. On this case, we’re a P5.48xlarge occasion, which has 8 NVIDIA H100 GPUs. The GPUs are idle whereas the mannequin weights are downloaded, then GPU reminiscence utilization will increase because the mannequin weights are loaded onto the GPU, and GPU utilization spikes to 100% whereas the coaching iterations are underway.

Grafana dashboard

Now that you simply perceive how your system works on the pod and node stage, it’s additionally essential to have a look at metrics on the cluster stage. Aggregated utilization metrics may be collected by NVIDIA DCGM Exporter and Prometheus and visualized in Grafana.

An instance Prometheus-Grafana deployment is obtainable within the following GitHub repo.

An instance DCGM exporter deployment is obtainable within the following GitHub repo.

A easy Grafana dashboard is proven within the following screenshot. It was constructed by choosing the next DCGM metrics: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_MEM_COPY_UTIL, DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_GPU_TEMP, and DCGM_FI_DEV_POWER_USAGE. The dashboard may be imported into Prometheus from GitHub.

The next dashboard reveals one run of a Llama2 7b single epoch coaching job. The graphs present that because the streaming multiprocessor (SM) clock will increase, the ability draw and temperature of the GPUs improve as nicely, along with GPU and reminiscence utilization. You can even see that there have been no XID errors and the GPUs have been wholesome throughout this run.

Since March 2024 GPU observability for EKS is supported natively in CloudWatch Container Insights. To allow this performance simply deploy the CloudWatch Observability Add-on in your EKS cluster. Then it is possible for you to to browse pod, node, and cluster stage metrics by way of pre-configured and customizable dashboards in Container Insights.

Clear up

When you created your cluster utilizing the examples offered on this weblog, you may execute the next code to delete the cluster and any sources related to it, together with the VPC:
For eksctl:

eksctl delete cluster -f ./eks-gpu-p4de-odcr.yaml

For terraform:

Upcoming options

FSDP is predicted to incorporate a per-parameter sharding function, aiming to additional enhance its reminiscence footprint per GPU. Moreover, the continuing growth of FP8 assist goals to enhance FSDP efficiency on H100 GPUs. Lastly, when FSDP is built-in withtorch.compile, we hope to see extra efficiency enhancements and enablement of options like selective activation checkpointing.

Conclusion

On this publish, we mentioned how FSDP reduces the reminiscence footprint on every GPU, enabling the coaching of bigger fashions extra effectively and reaching close to linear scaling in throughput. We demonstrated this by way of a step-by-step implementation of coaching a Llama2 mannequin utilizing Amazon EKS on P4de and P5 cases and used observability instruments like kubectl, htop, nvtop, and dcgm to watch logs, in addition to CPU and GPU utilization.

We encourage you to make the most of PyTorch FSDP to your personal LLM coaching jobs. Get began at aws-do-fsdp.

In regards to the Authors

Kanwaljit Khurmi is a Principal AI/ML Options Architect at Amazon Net Companies. He works with AWS clients to offer steerage and technical help, serving to them enhance the worth of their machine studying options on AWS. Kanwaljit focuses on serving to clients with containerized, distributed computing and deep studying functions.

Alex Iankoulski is a Principal Options Architect, Self-managed Machine Studying at AWS. He’s a full-stack software program and infrastructure engineer who likes to do deep, hands-on work. In his function, he focuses on serving to clients with containerization and orchestration of ML and AI workloads on container-powered AWS companies. He’s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s greatest challenges.

Ana Simoes is a Principal Machine Studying Specialist, ML Frameworks at AWS. She helps clients deploying AI, ML, and generative AI at a big scale on HPC infrastructure within the cloud. Ana focuses on supporting clients to realize price-performance for brand new workloads and use instances for generative AI and machine studying.

Hamid Shojanazeri is a Companion Engineer at PyTorch engaged on open supply, high-performance mannequin optimization, distributed coaching (FSDP), and inference. He’s the co-creator of llama-recipe and contributor to TorchServe. His most important curiosity is to enhance cost-efficiency, making AI extra accessible to the broader group.

Much less Wright is an AI/Companion Engineer in PyTorch. He works on Triton/CUDA kernels (Accelerating Dequant with SplitK work decomposition); paged, streaming, and quantized optimizers; and PyTorch Distributed (PyTorch FSDP).

Supply hyperlink