Building Components

A tutorial on how to create components and use them in a pipeline

A pipeline component is a self-contained set of code that performs one step in your ML workflow. This document describes the concepts required to build components, and demonstrates how to get started building components.

Before you begin

Run the following command to install the Kubeflow Pipelines SDK.

$ pip3 install kfp --upgrade

For more information about the Kubeflow Pipelines SDK, see the SDK reference guide.

Understanding pipeline components

Pipeline components are self-contained sets of code that perform one step in your ML workflow, such as preprocessing data or training a model. To create a component, you must build the component’s implementation and define the component specification.

Your component’s implementation includes the component’s executable code and the Docker container image that the code runs in. Learn more about designing a pipeline component.

Once you have built your component’s implementation, you can define your component’s interface as a component specification. A component specification defines:

  • The component’s inputs and outputs.
  • The container image that your component’s code runs in, the command to use to run your component’s code, and the command-line arguments to pass to your component’s code.
  • The component’s metadata, such as the name and description.

Learn more about creating a component specification.

If your component’s code is implemented as a Python function, use the Kubeflow Pipelines SDK to package your function as a component. Learn more about building Python function-based components.

Designing a pipeline component

When Kubeflow Pipelines executes a component, a container image is started in a Kubernetes Pod and your component’s inputs are passed in as command-line arguments. You can pass small inputs, such as strings and numbers, by value. Larger inputs, such as CSV data, must be passed as paths to files. When your component has finished, the component’s outputs are returned as files.

When you design your component’s code, consider the following:

  • Which inputs can be passed to your component by value? Examples of inputs that you can pass by value include numbers, booleans, and short strings. Any value that you could reasonably pass as a command-line argument can be passed to your component by value. All other inputs are passed to your component by a reference to the input’s path.
  • To return an output from your component, the output’s data must be stored as a file. When you define your component, you let Kubeflow Pipelines know what outputs your component produces. When your pipeline runs, Kubeflow Pipelines passes the paths that you use to store your component’s outputs as inputs to your component.
  • Outputs are typically written to a single file. In some cases, you may need to return a directory of files as an output. In this case, create a directory at the output path and write the output files to that location. In both cases, it may be necessary to create parent directories if they do not exist.
  • Your component’s goal may be to create a dataset in an external service, such as a BigQuery table. In this case, it may make sense for the component to output an identifier for the produced data, such as a table name, instead of the data itself. We recommend that you limit this pattern to cases where the data must be put into an external system instead of keeping it inside the Kubeflow Pipelines system.
  • Since your inputs and output paths are passed in as command-line arguments, your component’s code must be able to read inputs from the command line. If your component is built with Python, libraries such as argparse and absl.flags make it easier to read your component’s inputs.
  • Your component’s code can be implemented in any language, so long as it can run in a container image.

The following is an example program written using Python3. This program reads a given number of lines from an input file and writes those lines to an output file. This means that this function accepts three command-line parameters:

  • The path to the input file.
  • The number of lines to read.
  • The path to the output file.
#!/usr/bin/env python3
import argparse
from pathlib import Path

# Function doing the actual work (Outputs first N lines from a text file)
def do_work(input1_file, output1_file, param1):
  for x, line in enumerate(input1_file):
    if x >= param1:
      break
    _ = output1_file.write(line)
  
# Defining and parsing the command-line arguments
parser = argparse.ArgumentParser(description='My program description')
# Paths must be passed in, not hardcoded
parser.add_argument('--input1-path', type=str,
  help='Path of the local file containing the Input 1 data.')
parser.add_argument('--output1-path', type=str,
  help='Path of the local file where the Output 1 data should be written.')
parser.add_argument('--param1', type=int, default=100,
  help='The number of lines to read from the input and write to the output.')
args = parser.parse_args()

# Creating the directory where the output file is created (the directory
# may or may not exist).
Path(args.output1_path).parent.mkdir(parents=True, exist_ok=True)

with open(args.input1_path, 'r') as input1_file:
    with open(args.output1_path, 'w') as output1_file:
        do_work(input1_file, output1_file, args.param1)

If this program is saved as program.py, the command-line invocation of this program is:

python3 program.py --input1-path <path-to-the-input-file> \
  --param1 <number-of-lines-to-read> \
  --output1-path <path-to-write-the-output-to> 

Containerize your component’s code

For Kubeflow Pipelines to run your component, your component must be packaged as a Docker container image and published to a container registry that your Kubernetes cluster can access. The steps to create a container image are not specific to Kubeflow Pipelines. To make things easier for you, this section provides some guidelines on standard container creation.

  1. Create a Dockerfile for your container. A Dockerfile specifies:

    • The base container image. For example, the operating system that your code runs on.
    • Any dependencies that need to be installed for your code to run.
    • Files to copy into the container, such as the runnable code for this component.

    The following is an example Dockerfile.

    FROM python:3.7
    RUN python3 -m pip install keras
    COPY ./src /pipelines/component/src
    

    In this example:

    • The base container image is python:3.7.
    • The keras Python package is installed in the container image.
    • Files in your ./src directory are copied into /pipelines/component/src in the container image.
  2. Create a script named build_image.sh that uses Docker to build your container image and push your container image to a container registry. Your Kubernetes cluster must be able to access your container registry to run your component. Examples of container registries include Google Container Registry and Docker Hub.

    The following example builds a container image, pushes it to a container registry, and outputs the strict image name. It is a best practice to use the strict image name in your component specification to ensure that you are using the expected version of a container image in each component execution.

    #!/bin/bash -e
    image_name=gcr.io/my-org/my-image
    image_tag=latest
    full_image_name=${image_name}:${image_tag}
    
    cd "$(dirname "$0")" 
    docker build -t "${full_image_name}" .
    docker push "$full_image_name"
    
    # Output the strict image name, which contains the sha256 image digest
    docker inspect --format="{{index .RepoDigests 0}}" "${full_image_name}"
    

    In the preceding example:

    • The image_name specifies the full name of your container image in the container registry.
    • The image_tag specifies that this image should be tagged as latest.

    Save this file and run the following to make this script executable.

    chmod +x build_image.sh
    
  3. Run your build_image.sh script to build your container image and push it to a container registry.

  4. Use docker run to test your container image locally. If necessary, revise your application and Dockerfile until your application works as expected in the container.

Creating a component specification

To create a component from your containerized program, you must create a component specification that defines the component’s interface and implementation. The following sections provide an overview of how to create a component specification by demonstrating how to define the component’s implementation, interface, and metadata.

To learn more about defining a component specification, see the component specification reference guide.

Define your component’s implementation

The following example creates a component specification YAML and defines the component’s implementation.

  1. Create a file named component.yaml and open it in a text editor.

  2. Create your component’s implementation section and specify the strict name of your container image. The strict image name is provided when you run your build_image.sh script.

    implementation:
      container:
        # The strict name of a container image that you've pushed to a container registry.
        image: gcr.io/my-org/my-image@sha256:a172..752f
    
  3. Define a command for your component’s implementation. This field specifies the command-line arguments that are used to run your program in the container.

    implementation:
      container:
        image: gcr.io/my-org/my-image@sha256:a172..752f
        # command is a list of strings (command-line arguments). 
        # The YAML language has two syntaxes for lists and you can use either of them. 
        # Here we use the "flow syntax" - comma-separated strings inside square brackets.
        command: [
          python3, 
          # Path of the program inside the container
          /pipelines/component/src/program.py,
          --input1-path,
          {inputPath: Input 1},
          --param1, 
          {inputValue: Parameter 1},
          --output1-path, 
          {outputPath: Output 1},
        ]
    

    The command is formatted as a list of strings. Each string in the command is a command-line argument or a placeholder. At runtime, placeholders are replaced with an input or output. In the preceding example, two inputs and one output path are passed into a Python script at /pipelines/component/src/program.py.

    There are three types of input/output placeholders:

    • {inputValue: <input-name>}: This placeholder is replaced with the value of the specified input. This is useful for small pieces of input data, such as numbers or small strings.

    • {inputPath: <input-name>}: This placeholder is replaced with the path to this input as a file. Your component can read the contents of that input at that path during the pipeline run.

    • {outputPath: <output-name>}: This placeholder is replaced with the path where your program writes this output’s data. This lets the Kubeflow Pipelines system read the contents of the file and store it as the value of the specified output.

    The <input-name> name must match the name of an input in the inputs section of your component specification. The <output-name> name must match the name of an output in the outputs section of your component specification.

Define your component’s interface

The following examples demonstrate how to specify your component’s interface.

  1. To define an input in your component.yaml, add an item to the inputs list with the following attributes:

    • name: Human-readable name of this input. Each input’s name must be unique.
    • description: (Optional.) Human-readable description of the input.
    • default: (Optional.) Specifies the default value for this input.
    • type: (Optional.) Specifies the input’s type. Learn more about the types defined in the Kubeflow Pipelines SDK and how type checking works in pipelines and components.
    • optional: Specifies if this input is optional. The value of this attribute is of type Bool, and defaults to False.

    In this example, the Python program has two inputs:

    • Input 1 contains String data.
    • Parameter 1 contains an Integer.
    inputs:
    - {name: Input 1, type: String, description: 'Data for input 1'}
    - {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}
    

    Note: Input 1 and Parameter 1 do not specify any details about how they are stored or how much data they contain. Consider using naming conventions to indicate if inputs are expected to be small enough to pass by value.

  2. After your component finishes its task, the component’s outputs are passed to your pipeline as paths. At runtime, Kubeflow Pipelines creates a path for each of your component’s outputs. These paths are passed as inputs to your component’s implementation.

    To define an output in your component specification YAML, add an item to the outputs list with the following attributes:

    In this example, the Python program returns one output. The output is named Output 1 and it contains String data.

    outputs:
    - {name: Output 1, type: String, description: 'Output 1 data.'}
    

    Note: Consider using naming conventions to indicate if this output is expected to be small enough to pass by value. You should limit the amount of data that is passed by value to 200 KB per pipeline run.

  3. After you define your component’s interface, the component.yaml should be something like the following:

    inputs:
    - {name: Input 1, type: String, description: 'Data for input 1'}
    - {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}
    
    outputs:
    - {name: Output 1, type: String, description: 'Output 1 data.'}
    
    implementation:
      container:
        image: gcr.io/my-org/my-image@sha256:a172..752f
        # command is a list of strings (command-line arguments). 
        # The YAML language has two syntaxes for lists and you can use either of them. 
        # Here we use the "flow syntax" - comma-separated strings inside square brackets.
        command: [
          python3, 
          # Path of the program inside the container
          /pipelines/component/src/program.py,
          --input1-path,
          {inputPath: Input 1},
          --param1, 
          {inputValue: Parameter 1},
          --output1-path, 
          {outputPath: Output 1},
        ]
    

Specify your component’s metadata

To define your component’s metadata, add the name and description fields to your component.yaml

name: Get Lines
description: Gets the specified number of lines from the input file.

inputs:
- {name: Input 1, type: String, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}

outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}

implementation:
  container:
    image: gcr.io/my-org/my-image@sha256:a172..752f
    # command is a list of strings (command-line arguments). 
    # The YAML language has two syntaxes for lists and you can use either of them. 
    # Here we use the "flow syntax" - comma-separated strings inside square brackets.
    command: [
      python3, 
      # Path of the program inside the container
      /pipelines/component/src/program.py,
      --input1-path,
      {inputPath: Input 1},
      --param1, 
      {inputValue: Parameter 1},
      --output1-path, 
      {outputPath: Output 1},
    ]

Using your component in a pipeline

You can use the Kubeflow Pipelines SDK to load your component using methods such as the following:

These functions create a factory function that you can use to create ContainerOp instances to use as steps in your pipeline. This factory function’s input arguments include your component’s inputs and the paths to your component’s outputs. The function signature may be modified in the following ways to ensure that it is valid and Pythonic.

  • Inputs with default values will come after the inputs without default values and outputs.
  • Input and output names are converted to Pythonic names (spaces and symbols are replaced with underscores and letters are converted to lowercase). For example, an input named Input 1 is converted to input_1.

The following example demonstrates how to load the text of your component specification and run it in a single-step pipeline. Before you run this example, update the component specification to use the component specification you defined in the previous sections.

import kfp
import kfp.components as comp

create_step_get_lines = comp.load_component_from_text("""
name: Get Lines
description: Gets the specified number of lines from the input file.

inputs:
- {name: Input 1, type: String, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}

outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}

implementation:
  container:
    image: gcr.io/my-org/my-image@sha256:a172..752f
    # command is a list of strings (command-line arguments). 
    # The YAML language has two syntaxes for lists and you can use either of them. 
    # Here we use the "flow syntax" - comma-separated strings inside square brackets.
    command: [
      python3, 
      # Path of the program inside the container
      /pipelines/component/src/program.py,
      --input1-path,
      {inputPath: Input 1},
      --param1, 
      {inputValue: Parameter 1},
      --output1-path, 
      {outputPath: Output 1},
    ]""")

# create_step_get_lines is a "factory function" that accepts the arguments
# for the component's inputs and output paths and returns a pipeline step
# (ContainerOp instance).
#
# To inspect the get_lines_op function in Jupyter Notebook, enter 
# "get_lines_op(" in a cell and press Shift+Tab.
# You can also get help by entering `help(get_lines_op)`, `get_lines_op?`,
# or `get_lines_op??`.

# Define your pipeline 
def my_pipeline():
    get_lines_step = create_step_get_lines(
        # Input name "Input 1" is converted to pythonic parameter name "input_1"
        input_1='one\ntwo\nthree\nfour\nfive\nsix\nseven\neight\nnine\nten',
        parameter_1='5',
    )

# If you run this command on a Jupyter notebook running on Kubeflow,
# you can exclude the host parameter.
# client = kfp.Client()
client = kfp.Client(host='<your-kubeflow-pipelines-host-name>')

# Compile, upload, and submit this pipeline for execution.
client.create_run_from_pipeline_func(my_pipeline, arguments={})

Organizing the component files

This section provides a recommended way to organize a component’s files. There is no requirement that you must organize the files in this way. However, using the standard organization makes it possible to reuse the same scripts for testing, image building, and component versioning.

components/<component group>/<component name>/

    src/*            # Component source code files
    tests/*          # Unit tests
    run_tests.sh     # Small script that runs the tests
    README.md        # Documentation. If multiple files are needed, move to docs/.

    Dockerfile       # Dockerfile to build the component container image
    build_image.sh   # Small script that runs docker build and docker push

    component.yaml   # Component definition in YAML format

See this sample component for a real-life component example.

Next steps