A pipeline component is self-contained set of code that performs one step in the ML workflow (pipeline), such as data preprocessing, data transformation, model training, and so on. A component is analogous to a function, in that it has a name, parameters, return values, and a body.
The code for each component includes the following:
Client code: The code that talks to endpoints to submit jobs. For example, code to talk to the Google Dataproc API to submit a Spark job.
Runtime code: The code that does the actual job and usually runs in the cluster. For example, Spark code that transforms raw data into preprocessed data.
Note the naming convention for client code and runtime code—for a task named “mytask”:
mytask.pyprogram contains the client code.
mytaskdirectory contains all the runtime code.
A component specification in YAML format describes the component for the Kubeflow Pipelines system. A component definition has the following parts:
- Metadata: name, description, etc.
- Interface: input/output specifications (name, type, description, default value, etc).
- Implementation: A specification of how to run the component given a set of argument values for the component’s inputs. The implementation section also describes how to get the output values from the component once the component has finished running.
For the complete definition of a component, see the component specification.
You must package your component as a Docker image. Components represent a specific program or entry point inside a container.
Each component in a pipeline executes independently. The components do not run in the same process and cannot directly share in-memory data. You must serialize (to strings or files) all the data pieces that you pass between the components so that the data can travel over the distributed network. You must then deserialize the data for use in the downstream component.
- Read an overview of Kubeflow Pipelines.
- Follow the pipelines quickstart guide to deploy Kubeflow and run a sample pipeline directly from the Kubeflow Pipelines UI.
- Build your own component and pipeline.
- Build a reusable component for sharing in multiple pipelines.