Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
train-rdzv.yaml		train-rdzv.yaml
train.yaml		train.yaml

README.md

Distributed Training with PyTorch

This example demonstrates how to run distributed training with PyTorch using SkyPilot.

The example is based on PyTorch's official minGPT example.

Overview

There are two ways to run distributed training with PyTorch:

Using normal torchrun
Using rdvz backend

The main difference between the two for fixed-size distributed training is that rdvz backend automatically handles the rank for each node, while torchrun requires the rank to be set manually.

SkyPilot offers convinient built-in environment variables to help you start distributed training easily.

Using normal `torchrun`

The following command will spawn 2 nodes with 2 L4 GPU each:

sky launch -c train train.yaml

In train.yaml, we use torchrun to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

run: | cd examples/mingpt MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) torchrun \ --nnodes=$SKYPILOT_NUM_NODES \ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ --master_addr=$MASTER_ADDR \ --master_port=8008 \ --node_rank=${SKYPILOT_NODE_RANK} \ main.py

Using `rdzv` backend

rdzv is an alternative backend for distributed training:

sky launch -c train-rdzv train-rdzv.yaml

In train-rdzv.yaml, we use torchrun to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

run: | cd examples/mingpt MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) echo "Starting distributed training, head node: $MASTER_ADDR" torchrun \ --nnodes=$SKYPILOT_NUM_NODES \ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:29500 \ --rdzv_id $SKYPILOT_TASK_ID \ main.py

Scale up

If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot's built-in environment variables will be set automatically.

For example, the following command will spawn 4 nodes with 4 L4 GPUs each.

sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+

We increase the --cpus to 8+ as well to avoid the performance to be bottlenecked by the CPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed-pytorch

distributed-pytorch

README.md

Distributed Training with PyTorch

Overview

Using normal `torchrun`

Using `rdzv` backend

Scale up

Files

distributed-pytorch

Directory actions

More options

Directory actions

More options

Latest commit

History

distributed-pytorch

Folders and files

parent directory

README.md

Distributed Training with PyTorch

Overview

Using normal torchrun

Using rdzv backend

Scale up

Using normal `torchrun`

Using `rdzv` backend