Distributed Training with Training Operator

How Training Operator performs distributed training on Kubernetes

This page shows different distributed strategies that can be achieved with Training Operator.

Distributed Training for PyTorch

This diagram shows how Training Operator creates PyTorch workers for ring all-reduce algorithm.

Distributed PyTorchJob

User is responsible for writing a training code using native PyTorch Distributed APIs and create a PyTorchJob with required number of workers and GPUs using Training Operator Python SDK. Then, Training Operator creates Kubernetes pods with appropriate environment variables for the torchrun CLI to start distributed PyTorch training job.

At the end of the ring all-reduce algorithm gradients are synchronized in every worker (g1, g2, g3, g4) and model is trained.

You can define various distributed strategies supported by PyTorch in your training code (e.g. PyTorch FSDP), and Training Operator will set the appropriate environment variables for torchrun.

Distributed Training for TensorFlow

This diagram shows how Training Operator creates TensorFlow parameter server (PS) and workers for PS distributed training.

Distributed TFJob

User is responsible for writing a training code using native TensorFlow Distributed APIs and create a TFJob with required number PSs, workers, and GPUs using Training Operator Python SDK. Then, Training Operator creates Kubernetes pods with appropriate environment variables for TF_CONFIG to start distributed TensorFlow training job.

Parameter server splits training data for every worker and averages model weights based on gradients produced by every worker.

You can define various distributed strategies supported by TensorFlow in your training code, and Training Operator will set the appropriate environment variables for TF_CONFIG.

Feedback

Was this page helpful?