Katib Configuration Overview

How to make changes in Katib configuration

This guide describes Katib config — the Kubernetes Config Map that contains information about:

  1. Current metrics collectors (key = metrics-collector-sidecar).

  2. Current algorithms (suggestions) (key = suggestion).

  3. Current early stopping algorithms (key = early-stopping).

The Katib Config Map must be deployed in the KATIB_CORE_NAMESPACE namespace with the katib-config name. The Katib controller parses the Katib config when you submit your experiment.

You can edit this Config Map even after deploying Katib.

If you are deploying Katib in the Kubeflow namespace, run this command to edit your Katib config:

kubectl edit configMap katib-config -n kubeflow

Metrics Collector Sidecar settings

These settings are related to Katib metrics collectors, where:

  • key: metrics-collector-sidecar
  • value: corresponding JSON settings for each metrics collector kind

Example for the File metrics collector with all settings:

metrics-collector-sidecar: |-
{
  "File": {
    "image": "docker.io/kubeflowkatib/file-metrics-collector",
    "imagePullPolicy": "Always",
    "resources": {
      "requests": {
        "memory": "200Mi",
        "cpu": "250m",
        "ephemeral-storage": "200Mi"
      },
      "limits": {
        "memory": "1Gi",
        "cpu": "500m",
        "ephemeral-storage": "2Gi"
      }
    },
    "waitAllProcesses": false
  },
  ...
}

All of these settings except image can be omitted. If you don’t specify any other settings, a default value is set automatically.

  1. image - a Docker image for the File metrics collector’s container (must be specified).

  2. imagePullPolicy - an image pull policy for the File metrics collector’s container.

    The default value is IfNotPresent

  3. resources - resources for the File metrics collector’s container. In the above example you can check how to specify limits and requests. Currently, you can specify only memory, cpu and ephemeral-storage resources.

    The default values for the requests are:

    • memory = 10Mi
    • cpu = 50m
    • ephemeral-storage = 500Mi

    The default values for the limits are:

    • memory = 100Mi
    • cpu = 500m
    • ephemeral-storage = 5Gi

    You can run your metrics collector’s container without requesting the cpu, memory, or ephemeral-storage resource from the Kubernetes cluster. For instance, you have to remove ephemeral-storage from the container resources to use the Google Kubernetes Engine cluster autoscaler.

    To remove specific resources from the metrics collector’s container set the negative values in requests and limits in your Katib config as follows:

    "requests": {
      "cpu": "-1",
      "memory": "-1",
      "ephemeral-storage": "-1"
    },
    "limits": {
      "cpu": "-1",
      "memory": "-1",
      "ephemeral-storage": "-1"
    }
    
  4. waitAllProcesses - a flag to define whether the metrics collector should wait until all processes in the training container are finished before start to collect metrics.

    The default value is true

Suggestion settings

These settings are related to Katib suggestions, where:

  • key: suggestion
  • value: corresponding JSON settings for each algorithm name

If you want to use a new algorithm, you need to update the Katib config. For example, using a random algorithm with all settings looks as follows:

suggestion: |-
{
  "random": {
    "image": "docker.io/kubeflowkatib/suggestion-hyperopt",
    "imagePullPolicy": "Always",
    "resources": {
      "requests": {
        "memory": "100Mi",
        "cpu": "100m",
        "ephemeral-storage": "100Mi"
      },
      "limits": {
        "memory": "500Mi",
        "cpu": "500m",
        "ephemeral-storage": "3Gi"
      }
    },
    "serviceAccountName": "random-sa"
  },
  ...
}

All of these settings except image can be omitted. If you don’t specify any other settings, a default value is set automatically.

  1. image - a Docker image for the suggestion’s container with a random algorithm (must be specified).

    Image example: docker.io/kubeflowkatib/<suggestion-name>

    For each algorithm (suggestion) you can specify one of the following suggestion names in the Docker image:

    Suggestion name List of supported algorithms Description
    suggestion-hyperopt random, tpe Hyperopt optimization framework
    suggestion-chocolate grid, random, quasirandom, bayesianoptimization, mocmaes Chocolate optimization framework
    suggestion-skopt bayesianoptimization Scikit-optimize optimization framework
    suggestion-goptuna cmaes, random, tpe Goptuna optimization framework
    suggestion-hyperband hyperband Katib Hyperband implementation
    suggestion-enas enas Katib ENAS implementation
    suggestion-darts darts Katib DARTS implementation
  2. imagePullPolicy - an image pull policy for the suggestion’s container with a random algorithm.

    The default value is IfNotPresent

  3. resources - resources for the suggestion’s container with a random algorithm. In the above example you can check how to specify limits and requests. Currently, you can specify only memory, cpu and ephemeral-storage resources.

    The default values for the requests are:

    • memory = 10Mi
    • cpu = 50m
    • ephemeral-storage = 500Mi

    The default values for the limits are:

    • memory = 100Mi
    • cpu = 500m
    • ephemeral-storage = 5Gi

    You can run your suggestion’s container without requesting the cpu, memory, or ephemeral-storage resource from the Kubernetes cluster. For instance, you have to remove ephemeral-storage from the container resources to use the Google Kubernetes Engine cluster autoscaler.

    To remove specific resources from the suggestion’s container set the negative values in requests and limits in your Katib config as follows:

    "requests": {
      "cpu": "-1",
      "memory": "-1",
      "ephemeral-storage": "-1"
    },
    "limits": {
      "cpu": "-1",
      "memory": "-1",
      "ephemeral-storage": "-1"
    }
    
  4. serviceAccountName - a service account for the suggestion’s container with a random algorithm.

    In the above example, the random-sa service account is attached for each experiment’s suggestion with a random algorithm until you change or delete this service account from the Katib config.

    By default, the suggestion pod doesn’t have any specific service account, in which case, the pod uses the default service account.

    Note: If you want to run your experiments with early stopping, the suggestion’s deployment must have permission to update the experiment’s trial status. If you don’t specify a service account in the Katib config, Katib controller creates required Kubernetes Role-based access control for the suggestion.

    If you need your own service account for the experiment’s suggestion with early stopping, you have to follow the rules:

    • The service account name can’t be equal to <experiment-name>-<experiment-algorithm>

    • The service account must have sufficient permissions to update the experiment’s trial status.

Suggestion volume settings

When you create an experiment with FromVolume resume policy, you are able to specify PersistentVolume (PV) and PersistentVolumeClaim (PVC) settings for the experiment’s suggestion. Learn more about Katib concepts in the overview guide.

If PV settings are empty, Katib controller creates only PVC. If you want to use the default volume specification, you can omit these settings.

Follow the example for the random algorithm:

suggestion: |-
{
  "random": {
    "image": "docker.io/kubeflowkatib/suggestion-hyperopt",
    "volumeMountPath": "/opt/suggestion/data",
    "persistentVolumeClaimSpec": {
      "accessModes": [
        "ReadWriteMany"
      ],
      "resources": {
        "requests": {
          "storage": "3Gi"
        }
      },
      "storageClassName": "katib-suggestion"
    },
    "persistentVolumeSpec": {
      "accessModes": [
        "ReadWriteMany"
      ],
      "capacity": {
        "storage": "3Gi"
      },
      "hostPath": {
        "path": "/tmp/suggestion/unique/path"
      },
      "storageClassName": "katib-suggestion"
    },
    "persistentVolumeLabels": {
      "type": "local"
    }
  },
  ...
}
  1. volumeMountPath - a mount path for the suggestion’s container with random algorithm.

    The default value is /opt/katib/data

  2. persistentVolumeClaimSpec - a PVC specification for the suggestion’s PVC.

    The default value is set, if you don’t specify any of these settings:

    • persistentVolumeClaimSpec.accessModes[0] - the default value is ReadWriteOnce

    • persistentVolumeClaimSpec.resources.requests.storage - the default value is 1Gi

  3. persistentVolumeSpec - a PV specification for the suggestion’s PV.

    PV persistentVolumeReclaimPolicy is always equal to Delete to properly remove all resources once Katib experiment is deleted. To know more about PV reclaim policies check the Kubernetes documentation.

  4. persistentVolumeLabels - PV labels for the suggestion’s PV.

Early stopping settings

These settings are related to Katib early stopping, where:

  • key: early-stopping
  • value: corresponding JSON settings for each early stopping algorithm name

If you want to use a new early stopping algorithm, you need to update the Katib config. For example, using a medianstop early stopping algorithm with all settings looks as follows:

early-stopping: |-
{
  "medianstop": {
    "image": "docker.io/kubeflowkatib/earlystopping-medianstop",
    "imagePullPolicy": "Always"
  },
  ...
}

All of these settings except image can be omitted. If you don’t specify any other settings, a default value is set automatically.

  1. image - a Docker image for the early stopping’s container with a medianstop algorithm (must be specified).

    Image example: docker.io/kubeflowkatib/<early-stopping-name>

    For each early stopping algorithm you can specify one of the following early stopping names in the Docker image:

    Early stopping name Early stopping algorithm Description
    earlystopping-medianstop medianstop Katib Median Stopping implementation
  2. imagePullPolicy - an image pull policy for the early stopping’s container with a medianstop algorithm.

    The default value is IfNotPresent

Next steps