Scaling Large Language Models with PyTorch 2.0 FSDP on Amazon EKS – Part 2

Scaling PyTorch FSDP on AWS with Amazon EKS: Achieving Linear Scaling of Deep Learning Models

Training Large Language Models with PyTorch FSDP on AWS

Machine learning (ML) research has shown that large language models (LLMs) trained with extensive datasets result in higher model quality. However, training these large models requires modern tools and infrastructure to be efficient and scalable. The PyTorch Distributed Data Parallelism (DDP) helps process data at scale, but it requires the model to fit on one GPU. The PyTorch Fully Sharded Data Parallel (FSDP) library breaks this limitation by enabling model sharding to train large models across data parallel workers.

To achieve distributed model training, a cluster of worker nodes is required to scale. Amazon Elastic Kubernetes Service (Amazon EKS) simplifies the process of running AI/ML workloads, making it more manageable and less time-consuming.

In collaboration with Meta’s PyTorch team, AWS demonstrates how to use the PyTorch FSDP library to achieve linear scaling of deep learning models on AWS seamlessly using Amazon EKS and AWS Deep Learning Containers (DLCs). The blog post showcases the implementation of training 7B, 13B, and 70B Llama2 models using Amazon EKS with 16 Amazon EC2 p4de.24xlarge instances or 16 EC2 p5.48xlarge instances, achieving near linear scaling in throughput and faster training time.

Challenges of Training LLMs

Businesses are increasingly adopting LLMs for various tasks to enhance efficiency and accuracy in applications. However, training or fine-tuning these large models requires a significant amount of data and compute power, adding complexity to the ML stack. Limited memory on a single GPU restricts the size of the model that can be trained and limits the batch size used during training.

To address this challenge, model parallelism techniques like PyTorch FSDP were created to overcome the limitation of GPU memory. By adopting a sharded data parallel technique, FSDP reduces the memory footprint of the training job, enabling the training of very large models or using larger batch sizes.

FSDP Overview

In PyTorch DDP training, each GPU holds a complete copy of the model, including weights, gradients, and optimizer states. FSDP overcomes this limitation by sharding model parameters, optimizer states, and gradients across data parallel workers while preserving the simplicity of data parallelism.

FSDP offers various parameters for tuning performance and efficiency, including transformer wrapping policy, flexible mixed precision, activation checkpointing, and different sharding strategies.

Solution Overview

Setting up a compute cluster using Amazon EKS simplifies running Kubernetes-based AI/ML workloads. The Kubeflow Training Operator on Amazon EKS facilitates fine-tuning and scalable distributed training for ML models, including PyTorch.

By using the PyTorchJob custom resource of the Kubeflow Training Operator, training jobs can be run on Kubernetes with a configurable number of worker replicas to optimize resource utilization.

Conclusion

PyTorch FSDP reduces the memory footprint on each GPU, enabling the training of larger models more efficiently and achieving near linear scaling in throughput. Monitoring tools like kubectl, htop, nvtop, and dcgm can be used to observe logs, CPU, and GPU utilization during training.

Take advantage of PyTorch FSDP for your LLM training jobs and get started with AWS for efficient and scalable model training.

About the Authors

Kanwaljit Khurmi, Alex Iankoulski, Ana Simoes, Hamid Shojanazeri, and Less Wright are experts in AI/ML solutions, machine learning frameworks, and distributed training, working to improve cost-efficiency and accessibility of AI technologies.

LEAVE A REPLY

Please enter your comment!
Please enter your name here