Resuming an Experiment
This guide describes how to modify running experiments and restart completed experiments. You will learn about changing the experiment execution process and use various resume policies for the Katib experiment.
For the details on how to configure and run your experiment, follow the running an experiment guide.
Modify running experiment
While the experiment is running you are able to change trial count parameters. For example, you can decrease the maximum number of hyperparameter sets that are trained in parallel.
You can change only
Use Kubernetes API or
in-place update of resources
to make experiment changes. For example, run:
kubectl edit experiment <experiment-name> -n <experiment-namespace>
Make appropriate changes and save it. Controller automatically processes the new parameters and makes necessary changes.
If you want to increase or decrease parallel trial execution, modify
parallelTrialCount. Controller accordingly creates or deletes trials in line with the
If you want to increase or decrease maximum trial count, modify
maxTrialCountshould be greater than current count of
Succeededtrials. You can remove the
maxTrialCountparameter, if your experiment should run endless with
parallelTrialCountof parallel trials until the experiment reaches
If you want to increase or decrease maximum failed trial count, modify
maxFailedTrialCount. You can remove the
maxFailedTrialCountparameter, if the experiment should not reach
Resume succeeded experiment
Katib experiment is restartable only if it is in
Succeeded status because
maxTrialCount has been reached. To check current experiment status run:
kubectl get experiment <experiment-name> -n <experiment-namespace>.
To restart an experiment, you are able to change only
as described above
To control various resume policies, you can specify
for the experiment.
Refer to the
Resume policy: Never
Use this policy if your experiment should not be resumed at any time. After the experiment has finished, the suggestion’s Deployment and Service are deleted and you can’t restart the experiment. Learn more about Katib concepts in the overview guide.
example for more details.
Resume policy: LongRunning
Use this policy if you intend to restart the experiment. After the experiment has finished, the suggestion’s Deployment and Service stay running. Modify experiment’s trial count parameters to restart the experiment.
When you delete the experiment, the suggestion’s Deployment and Service are deleted.
This is the default policy for all Katib experiments.
You can omit
.spec.resumePolicy parameter for that functionality.
Resume policy: FromVolume
Use this policy if you intend to restart the experiment. In that case, volume is attached to the suggestion’s Deployment.
Katib controller creates PersistentVolumeClaim (PVC) in addition to the suggestion’s Deployment and Service.
Note: Your Kubernetes cluster must have
dynamic volume provisioning
to automatically provision storage for the created PVC. Otherwise, you have to define
suggestion’s PersistentVolume (PV)
specification in the Katib configuration settings and Katib controller will create PVC and PV.
Follow the Katib configuration guide
to set up the suggestion’s volume settings.
PVC is deployed with the name:
<suggestion-name>-<suggestion-algorithm>in the suggestion namespace.
PV is deployed with the name:
After the experiment has finished, the suggestion’s Deployment and Service are deleted. Suggestion data can be retained in the volume. When you restart the experiment, the suggestion’s Deployment and Service are created and suggestion statistics can be recovered from the volume.
When you delete the experiment, the suggestion’s Deployment, Service, PVC and PV are deleted automatically.
example for more details.
Learn how to configure and run your Katib experiments.
Check the Katib Configuration (Katib config).
How to set up environment variables for each Katib component.