Create, use, pass, and track ML artifacts

Most machine learning pipelines aim to create one or more machine learning artifacts, such as a model, dataset, evaluation metrics, etc.

KFP provides first-class support for creating machine learning artifacts via the dsl.Artifact class and other artifact subclasses. KFP maps these artifacts to their underlying ML Metadata schema title, the canonical name for the artifact type.

In general, artifacts and their associated annotations serve several purposes:

  • To provide logical groupings of component/pipeline input/output types
  • To provide a convenient mechanism for writing to object storage via the task’s local filesystem
  • To enable type checking of pipelines that create ML artifacts
  • To make the contents of some artifact types easily observable via special UI rendering

The following training_component demonstrates usage of both input and output artifacts using the traditional artifact syntax:

from kfp.dsl import Input, Output, Dataset, Model

@dsl.component
def training_component(dataset: Input[Dataset], model: Output[Model]):
    """Trains an output Model on an input Dataset."""
    with open(dataset.path) as f:
        contents = f.read()

    # ... train tf_model model on contents of dataset ...

    tf_model.save(model.path)
    tf_model.metadata['framework'] = 'tensorflow'

This training_component does the following:

  1. Accepts an input dataset and declares an output model
  2. Reads the input dataset’s content from the local filesystem
  3. Trains a model (omitted)
  4. Saves the model as a component output
  5. Sets some metadata about the saved model

As illustrated by training_component, artifacts are simply a thin wrapper around some artifact properties, including the .path from which the artifact can be read/written and the artifact’s .metadata. The following sections describe these properties and other aspects of artifacts in detail.

Artifact properties

To use create and consume artifacts from components, you’ll use the available properties on artifact instances. Artifacts feature four properties:

  • name, the name of the artifact (cannot be overwritten on Vertex Pipelines).
  • .uri, the location of your artifact object. For input artifacts, this is where the object resides currently. For output artifacts, this is where you will write the artifact from within your component.
  • .metadata, additional key-value pairs about the artifact.
  • .path, a local path that corresponds to the artifact’s .uri.

The artifact .path attribute is particularly helpful. When you write the contents of your artifact to the location provided by the artifact’s .path attribute, the pipelines backend will handle copying the file at .path to the URI at .uri automatically, allowing you to create artifact files within a component by only interacting with the task’s local filesystem.

As you will see more in the other examples in this section, each of these properties are accessible on artifacts inside components:

from kfp import dsl
from kfp.dsl import Dataset
from kfp.dsl import Input

@dsl.component
def print_artifact_properties(dataset: Input[Dataset]):
    with open(dataset.path) as f:
        lines = f.readlines()
    
    print('Information about the artifact')
    print('Name:', dataset.name)
    print('URI:', dataset.uri)
    print('Path:', dataset.path)
    print('Metadata:', dataset.metadata)
    
    return len(lines)

Note that input artifacts should be treated as immutable. You should not try to modify the contents of the file at .path and any changes to the artifact’s properties will not affect the artifact’s metadata in ML Metadata.

Artifacts in components

The KFP SDK supports two forms of artifact authoring syntax for components: traditional and Pythonic.

The traditional artifact authoring syntax is the original artifact authoring style provided by the KFP SDK. The traditional artifact authoring syntax is supported for both Python Components and Container Components. It is supported at runtime by the open source KFP backend and the Google Cloud Vertex Pipelines backend.

The Pythonic artifact authoring syntax provides an alterative artifact I/O syntax that is familiar to Python developers. The Pythonic artifact authoring syntax is supported for Python Components only. This syntax is not supported for Container Components. It is currently only supported at runtime by the Google Cloud Vertex Pipelines backend.

Traditional artifact syntax

When using the traditional artifact authoring syntax, all artifacts are provided to the component function as an input wrapped in an Input or Output type marker.

def my_component(in_artifact: Input[Artifact], out_artifact: Output[Artifact]):
    ...

For input artifacts, you can read the artifact using its .uri or .path attribute.

For output artifacts, a pre-constructed output artifact will be passed into the component. You can update the output artifact’s properties in place and write the artifact’s contents to the artifact’s .path or .uri attribute. You should not return the artifact instance from your component. For example:

from kfp import dsl
from kfp.dsl import Dataset, Input, Model, Output

@dsl.component
def train_model(dataset: Input[Dataset], model: Output[Model]):
    with open(dataset.path) as f:
        dataset_lines = f.readlines()

    # train a model
    trained_model = ...
    
    trained_model.save(model.path)
    model.metadata['samples'] = len(dataset_lines)

New Pythonic artifact syntax

To use the Pythonic artifact authoring syntax, simply annotate your components with the artifact class as you would when writing normal Python.

def my_component(in_artifact: Artifact) -> Artifact:
    ...

Inside the body of your component, you can read artifacts passed in as input (no change from the traditional artifact authoring syntax). For artifact outputs, you’ll construct the artifact in your component code, then return the artifact as an output. For example:

from kfp import dsl
from kfp.dsl import Dataset, Model

@dsl.component
def train_model(dataset: Dataset) -> Model:
    with open(dataset.path) as f:
        dataset_lines = f.readlines()

    # train a model
    trained_model = ...

    model_artifact = Model(uri=dsl.get_uri(), metadata={'samples': len(dataset_lines)})
    trained_model.save(model_artifact.path)
    
    return model_artifact

For a typical output artifact which is written to one or more files, the dsl.get_uri function can be used at runtime to obtain a unique object storage URI that corresponds to the current task. The optional suffix parameter is useful for avoiding path collisions when your component has multiple artifact outputs.

Multiple output artifacts should be specified similarly to multiple output parameters:

from kfp import dsl
from kfp.dsl import Dataset, Model
from typing import NamedTuple

@dsl.component
def train_multiple_models(
    dataset: Dataset,
) -> NamedTuple('outputs', model1=Model, model2=Model):
    with open(dataset.path) as f:
        dataset_lines = f.readlines()

    # train a model
    trained_model1 = ...
    trained_model2 = ...
    
    model_artifact1 = Model(uri=dsl.get_uri(suffix='model1'), metadata={'samples': len(dataset_lines)})
    trained_model1.save(model_artifact1.path)
    
    model_artifact2 = Model(uri=dsl.get_uri(suffix='model2'), metadata={'samples': len(dataset_lines)})
    trained_model2.save(model_artifact2.path)
    
    outputs = NamedTuple('outputs', model1=Model, model2=Model)
    return outputs(model1=model_artifact1, model2=model_artifact2)

Artifacts in pipelines

Irrespective of whether your components use the Pythonic or traditional artifact authoring syntax, pipelines that use artifacts should be annotated with the Pythonic artifact syntax:

def my_pipeline(in_artifact: Artifact) -> Artifact:
    ...

See the following pipeline which accepts a Dataset as input and outputs a Model, surfaced from the inner component train_model:

from kfp import dsl
from kfp.dsl import Dataset, Model

@dsl.pipeline
def augment_and_train(dataset: Dataset) -> Model:
    augment_task = augment_dataset(dataset=dataset)
    return train_model(dataset=augment_task.output).output

The KFP SDK compiler will type check artifact usage according to the rules described in Type Checking.

Please see Pipeline Basics for comprehensive documentation on how to author a pipeline.

Lists of artifacts

KFP supports input lists of artifacts, annotated as List[Artifact] or Input[List[Artifact]]. This is useful for collecting output artifacts from a loop of tasks using the dsl.ParallelFor and dsl.Collected control flow objects.

Pipelines can also return an output list of artifacts by using a -> List[Artifact] return annotation and returning a dsl.Collected instance.

Both consuming an input list of artifacts and returning an output list of artifacts from a pipeline are described in Pipeline Control Flow: Parallel looping. Creating output lists of artifacts from a single-step component is not currently supported.

Artifact types

The artifact annotation indicates the type of the artifact. KFP provides several artifact types within the DSL:

DSL objectArtifact schema title
Artifactsystem.Artifact
Datasetsystem.Dataset
Modelsystem.Model
Metricssystem.Metrics
ClassificationMetricssystem.ClassificationMetrics
SlicedClassificationMetricssystem.SlicedClassificationMetrics
HTMLsystem.HTML
Markdownsystem.Markdown

Artifact, Dataset, Model, and Metrics are the most generic and commonly used artifact types. Artifact is the default artifact base type and should be used in cases where the artifact type does not fit neatly into another artifact category. Artifact is also compatible with all other artifact types. In this sense, the Artifact type is also an artifact “any” type.

On the KFP open source UI, ClassificationMetrics, SlicedClassificationMetrics, HTML, and Markdown provide special UI rendering to make the contents of the artifact easily observable.

Feedback

Was this page helpful?


Last modified June 11, 2024: Fixed broken links (9665cfe8)