Installation
This guide describes how to install Training Operator on your Kubernetes cluster. Training Operator is a lightweight Kubernetes controller that orchestrates appropriate Kubernetes workloads to perform distributed ML training and fine-tuning.
Prerequisites
These are minimal requirements to install Training Operator:
- Kubernetes >= 1.27
kubectl
>= 1.27- Python >= 3.7
Installing Training Operator
You need to install Training Operator control plane and Python SDK to create training jobs.
Installing Control Plane
You can skip these steps if you have already installed Kubeflow platform using manifests or package distributions. Kubeflow platform includes Training Operator.
You can install Training Operator as a standalone component.
Run the following command to install the stable release of Training Operator control plane: v1.7.0
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.7.0"
Run the following command to install the latest changes of Training Operator control plane:
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=master"
After installing it, you can verify that Training Operator controller is running as follows:
$ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
training-operator-658c68d697-46zmn 1/1 Running 0 90s
Run this command to check installed Kubernetes CRDs for each supported ML framework:
$ kubectl get crd
mpijobs.kubeflow.org 2023-06-09T00:31:07Z
mxjobs.kubeflow.org 2023-06-09T00:31:05Z
paddlejobs.kubeflow.org 2023-06-09T00:31:09Z
pytorchjobs.kubeflow.org 2023-06-09T00:31:06Z
tfjobs.kubeflow.org 2023-06-09T00:31:04Z
xgboostjobs.kubeflow.org 2023-06-09T00:31:04Z
Installing Python SDK
Training Operator implements Python SDK to simplify creation of distributed training and fine-tuning jobs for Data Scientists.
Run the following command to install the stable release of Training Operator SDK:
pip install -U kubeflow-training
You can also install the Python SDK using the specific GitHub commit, for example:
pip install git+https://github.com/kubeflow/training-operator.git@7345e33b333ba5084127efe027774dd7bed8f6e6#subdirectory=sdk/python
Install Python SDK with Fine-Tuning Capabilities
If you want to use train
API for LLM fine-tuning with Training Operator, install the Python SDK
with the additional packages from HuggingFace:
pip install -U kubeflow-training[huggingface]
Next steps
Run your first Training Operator Job by following the Getting Started guide.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.