LLM Fine-Tuning with Training Operator
This page shows how Training Operator implements the API to fine-tune LLMs.
Architecture
In the following diagram you can see how train
Python API works:
Once user executes
train
API, Training Operator creates PyTorchJob with appropriate resources to fine-tune LLM.Storage initializer InitContainer is added to the PyTorchJob worker 0 to download pre-trained model and dataset with provided parameters.
PVC with
ReadOnlyMany
access mode it attached to each PyTorchJob worker to distribute model and dataset across Pods. Note: Your Kubernetes cluster must support volumes withReadOnlyMany
access mode, otherwise you can use a single PyTorchJob worker.Every PyTorchJob worker runs LLM Trainer that fine-tunes model using provided parameters.
Training Operator implements train
API with these pre-created components:
Model Provider
Model provider downloads pre-trained model. Currently, Training Operator supports HuggingFace model provider that downloads model from HuggingFace Hub.
You can implement your own model provider by using this abstract base class
Dataset Provider
Dataset provider downloads dataset. Currently, Training Operator supports AWS S3 and HuggingFace dataset providers.
You can implement your own dataset provider by using this abstract base class
LLM Trainer
Trainer implements training loop to fine-tune LLM. Currently, Training Operator supports HuggingFace trainer to fine-tune LLMs.
You can implement your own trainer for other ML use-cases such as image classification, voice recognition, etc.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.