Getting Started

Get started with Training Operator

This guide describes how to get started with Training Operator and run a few simple examples.

Prerequisites

You need to install the following components to run examples:

Training Operator control plane installed.
Training Python SDK installed.

Getting Started with PyTorchJob

You can create your first Training Operator distributed PyTorchJob using Python SDK. Define the training function that implements end-to-end model training. Each Worker will execute this function on the appropriate Kubernetes Pod. Usually, this function contains logic to download dataset, create model, and train the model.

Training Operator will automatically set WORLD_SIZE and RANK for the appropriate PyTorchJob worker to perform PyTorch Distributed Data Parallel (DDP).

If you install Training Operator as part of Kubeflow Platform, you can open a new Kubeflow Notebook to run this script. If you install Training Operator standalone, make sure that you configure local kubeconfig to access your Kubernetes cluster where you installed Training Operator.

def train_func():
    import torch
    import torch.nn.functional as F
    from torch.utils.data import DistributedSampler
    from torchvision import datasets, transforms

    # [1] Setup PyTorch DDP. WORLD_SIZE and RANK environments will be set by Training Operator.
    torch.distributed.init_process_group(backend="nccl")
    Distributor = torch.nn.parallel.DistributedDataParallel

    # [2] Create PyTorch CNN Model.
    class Net(torch.nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
            self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
            self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
            self.fc2 = torch.nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # [3] Attach model to GPU and distributor.
    device = "cuda"
    model = Net().to(device)
    model = Distributor(model)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

    # [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
    dataset = datasets.FashionMNIST(
        "./data",
        download=True,
        train=True,
        transform=transforms.Compose([transforms.ToTensor()]),
    )
    train_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=128,
        sampler=DistributedSampler(dataset),
    )

    # [5] Start model Training.
    for epoch in range(3):
        for batch_idx, (data, target) in enumerate(train_loader):
            # Attach Tensors to the device.
            data = data.to(device)
            target = target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )


from kubeflow.training import TrainingClient

# Start PyTorchJob with 3 Workers and 1 GPUs per Worker.
TrainingClient().create_job(
    name="pytorch-ddp",
    train_func=train_func,
    num_workers=3,
    resources_per_worker={"gpu": "1"},
)

Getting Started with TFJob

Similar to PyTorchJob example, you can use the Python SDK to create your first distributed TensorFlow job. Run the following script to create TFJob with pre-created Docker image: docker.io/kubeflow/tf-mnist-with-summaries:latest that contains distributed TensorFlow code:

from kubeflow.training import TrainingClient

TrainingClient().create_job(
    name="tensorflow-dist",
    job_kind="TFJob",
    base_image="docker.io/kubeflow/tf-mnist-with-summaries:latest",
    num_workers=3,
)

Run the following API to get logs from your TFJob:

TrainingClient().get_job_logs(
    name="tensorflow-dist",
    job_kind="TFJob",
    follow=True,
)

Next steps

Run FashionMNIST example with using Training Operator Python SDK.
Learn more about the PyTorchJob APIs.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified May 8, 2024: Katib: Reorganized Katib Docs (#3723) (9903837f)