LLM Fine-Tuning with Training Operator

How Training Operator performs fine-tuning on Kubernetes

This page shows how Training Operator implements the API to fine-tune LLMs.

Architecture

In the following diagram you can see how train Python API works:

Fine-Tune API for LLMs

Once user executes train API, Training Operator creates PyTorchJob with appropriate resources to fine-tune LLM.
Storage initializer InitContainer is added to the PyTorchJob worker 0 to download pre-trained model and dataset with provided parameters.
PVC with ReadOnlyMany access mode it attached to each PyTorchJob worker to distribute model and dataset across Pods. Note: Your Kubernetes cluster must support volumes with ReadOnlyMany access mode, otherwise you can use a single PyTorchJob worker.
Every PyTorchJob worker runs LLM Trainer that fine-tunes model using provided parameters.

Training Operator implements train API with these pre-created components:

Model Provider

Model provider downloads pre-trained model. Currently, Training Operator supports HuggingFace model provider that downloads model from HuggingFace Hub.

You can implement your own model provider by using this abstract base class

Dataset Provider

Dataset provider downloads dataset. Currently, Training Operator supports AWS S3 and HuggingFace dataset providers.

You can implement your own dataset provider by using this abstract base class

LLM Trainer

Trainer implements training loop to fine-tune LLM. Currently, Training Operator supports HuggingFace trainer to fine-tune LLMs.

You can implement your own trainer for other ML use-cases such as image classification, voice recognition, etc.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified May 20, 2024: Training: Add Fine-Tune API Docs (#3718) (36544ae2)