Distributed Training with Training Operator
This page shows different distributed strategies that can be achieved with Training Operator.
Distributed Training for PyTorch
This diagram shows how Training Operator creates PyTorch workers for ring all-reduce algorithm.
User is responsible for writing a training code using native
PyTorch Distributed APIs
and create a PyTorchJob with required number of workers and GPUs using Training Operator Python SDK.
Then, Training Operator creates Kubernetes pods with appropriate environment variables for the
torchrun
CLI to start distributed
PyTorch training job.
At the end of the ring all-reduce algorithm gradients are synchronized
in every worker (g1, g2, g3, g4
) and model is trained.
You can define various distributed strategies supported by PyTorch in your training code
(e.g. PyTorch FSDP), and Training Operator will set
the appropriate environment variables for torchrun
.
Distributed Training for TensorFlow
This diagram shows how Training Operator creates TensorFlow parameter server (PS) and workers for PS distributed training.
User is responsible for writing a training code using native
TensorFlow Distributed APIs and create a
TFJob with required number PSs, workers, and GPUs using Training Operator Python SDK.
Then, Training Operator creates Kubernetes pods with appropriate environment variables for
TF_CONFIG
to start distributed TensorFlow training job.
Parameter server splits training data for every worker and averages model weights based on gradients produced by every worker.
You can define various distributed strategies supported by TensorFlow
in your training code, and Training Operator will set the appropriate environment
variables for TF_CONFIG
.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.