Overview of Trial Templates

How to specify trial template parameters and support a custom resource (CRD) in Katib

This guide describes how to configure trial template parameters and use custom Kubernetes CRD in Katib. You will learn about changing trial template specification, how to use Kubernetes ConfigMaps to store templates and how to modify Katib controller to support your Kubernetes CRD in Katib experiments.

Katib has these CRD examples in upstream:

To use your own Kubernetes resource follow the steps below.

For the details on how to configure and run your experiment, follow the running an experiment guide.

Use trial template to submit experiment

To run the Katib experiment you have to specify a trial template for your worker job where actual training is running. Learn more about Katib concepts in the overview guide.

Configure trial template specification

Trial template specification is located under .spec.trialTemplate of your experiment. For the API overview refer to the TrialTemplate type.

To define experiment’s trial, you should specify these parameters in .spec.trialTemplate:

  • trialParameters - list of the parameters which are used in the trial template during experiment execution. Note: Your trial template must contain each parameter from the trialParameters. You can set these parameters in any field of your template, except .metadata.name and .metadata.namespace. Check below how you can use trial metadata parameters in your template. For example, your training container can receive hyperparameters as command-line or arguments or as environment variables.

    Your experiment’s suggestion produces trialParameters before running the trial. Each trialParameter has these structure:

    • name - the parameter name that is replaced in your template.

    • description (optional) - the description of the parameter.

    • reference - the parameter name that experiment’s suggestion returns. Usually, for the hyperparameter tuning parameter references are equal to the experiment search space. For example, in grid example search space has three parameters (lr, num-layers and optimizer) and trialParameters contains each of these parameters in reference.

  • You have to define your experiment’s trial template in one of the trialSpec or configMap sources. Note: Your template must omit .metadata.name and .metadata.namespace.

    To set the parameters from the trialParameters, you need to use this expression: ${trialParameters.<parameter-name>} in your template. Katib automatically replaces it with the appropriate values from the experiment’s suggestion.

    For example, --lr=${trialParameters.learningRate} is the learningRate parameter.

    • trialSpec - the experiment’s trial template in unstructured format. The template should be a valid YAML. Check the grid example.

    • configMap - Kubernetes ConfigMap specification where the experiment’s trial template is located. This ConfigMap must have the label app: katib-trial-templates and contains key-value pairs, where key: <template-name>, value: <template-yaml>. Check the example of the ConfigMap with trial templates.

      The configMap specification should have:

      1. configMapName - the ConfigMap name with the trial templates.

      2. configMapNamespace - the ConfigMap namespace with the trial templates.

      3. templatePath - the ConfigMap’s data path to the template.

      Check the example with ConfigMap source for the trial template.

.spec.trialTemplate parameters below are used to control trial behavior. If parameter has the default value, it can be omitted in the experiment YAML.

  • retain - indicates that trial’s resources are not clean-up after the trial is complete. Check the example with retain: true parameter.

    The default value is false

  • primaryPodLabels - the trial worker’s Pod or Pods labels. These Pods are injected by Katib metrics collector. Note: If primaryPodLabels is omitted, the metrics collector wraps all worker’s Pods. Learn more about Katib metrics collector in running an experiment guide. Check the example with primaryPodLabels.

    The default value for Kubeflow TFJob and PyTorchJob is job-role: master

    The primaryPodLabels default value works only if you specify your template in .spec.trialTemplate.trialSpec. For the configMap template source you have to manually set primaryPodLabels.

  • primaryContainerName - the training container name where actual model training is running. Katib metrics collector wraps this container to collect required metrics for the single experiment optimization step.

  • successCondition - The trial worker’s object status in which trial’s job has succeeded. This condition must be in GJSON format. Check the example with successCondition.

    The default value for Kubernetes Job is status.conditions.#(type=="Complete")#|#(status=="True")#

    The default value for Kubeflow TFJob and PyTorchJob is status.conditions.#(type=="Succeeded")#|#(status=="True")#

    The successCondition default value works only if you specify your template in .spec.trialTemplate.trialSpec. For the configMap template source you have to manually set successCondition.

  • failureCondition - The trial worker’s object status in which trial’s job has failed. This condition must be in GJSON format. Check the example with failureCondition.

    The default value for Kubernetes Job is status.conditions.#(type=="Failed")#|#(status=="True")#

    The default value for Kubeflow TFJob and PyTorchJob is status.conditions.#(type=="Failed")#|#(status=="True")#

    The failureCondition default value works only if you specify your template in .spec.trialTemplate.trialSpec. For the configMap template source you have to manually set failureCondition.

Use trial metadata in template

You can’t specify .metadata.name and .metadata.namespace in your trial template, but you can get this data during the experiment run. For example, if you want to append the trial’s name to your model storage.

To do this, point .trialParameters[x].reference to the appropriate metadata parameter and use .trialParameters[x].name in your trial template.

The table below shows the connection between .trialParameters[x].reference value and trial metadata.

Reference Trial metadata
${trialSpec.Name} Trial name
${trialSpec.Namespace} Trial namespace
${trialSpec.Kind} Kubernetes resource kind for the trial's worker
${trialSpec.APIVersion} Kubernetes resource APIVersion for the trial's worker
${trialSpec.Labels[custom-key]} Trial's worker label with custom-key key
${trialSpec.Annotations[custom-key]} Trial's worker annotation with custom-key key

Check the example of using trial metadata.

Use custom Kubernetes resource as a trial template

By default, you can define your trial worker as Kubernetes Job, Kubeflow TFJob, Kubeflow PyTorchJob, Kubeflow MPIJob or Tekton Pipeline.

Note: To use Tekton Pipeline, you need to modify Tekton installation to change nop image. Follow the Tekton integration guide to know more about it.

It is possible to use your own Kubernetes CRD or other Kubernetes resource (e.g. Kubernetes Deployment) as a trial worker without modifying Katib controller source code and building the new image. As long as your CRD creates Kubernetes Pods, allows to inject the sidecar container on these Pods and has succeeded and failed status, you can use it in Katib.

To do that, you need to modify Katib components before installing it on your Kubernetes cluster. Accordingly, you have to know your CRD API group and version, the CRD object’s kind. Also, you need to know which resources your custom object is created. Check the Kubernetes guide to know more about CRDs.

Follow these two simple steps to integrate your custom CRD in Katib:

  1. Modify Katib controller ClusterRole’s rules with the new rule to give Katib access to all resources that are created by the trial. To know more about ClusterRole, check Kubernetes guide.

    In case of Tekton Pipeline, trial creates Tekton PipelineRun, then Tekton PipelineRun creates Tekton TaskRun. Therefore, Katib controller ClusterRole should have access to the pipelineruns and taskruns:

    - apiGroups:
        - tekton.dev
      resources:
        - pipelineruns
        - taskruns
      verbs:
        - "*"
    
  2. Modify Katib controller Deployment’s args with the new --trial-resources=<object-kind>.<object-API-version>.<object-API-group> flag.

    For example, to support Tekton Pipeline:

    - "--trial-resources=PipelineRun.v1beta1.tekton.dev"
    

After these changes, deploy Katib as described in the getting started guide and wait until the katib-controller Pod is created. You can check logs from the Katib controller to check your resource integration:

kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow

Expected output for the Tekton Pipeline:

{"level":"info","ts":1604325430.9762623,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: tekton.dev/v1beta1, Kind=PipelineRun"}
{"level":"info","ts":1604325430.9763885,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}

If you ran the above steps successfully, you should be able to use your custom object YAML in the experiment’s trial template source spec.

We appreciate your feedback on using various CRDs in Katib. It would be great, if you let us know about your experiments. The developer guide is a good starting point to know how to contribute to the project.

Next steps