How to Resume Experiments
This guide describes how to modify running Experiments and restart completed Experiments. You will learn about changing the Experiment execution process and use various resume policies for the Katib Experiment.
Modify Running Experiment
While the Experiment is running you are able to change Trial count parameters. For example, you can decrease the maximum number of hyperparameter sets that are trained in parallel.
You can change only parallelTrialCount
, maxTrialCount
and maxFailedTrialCount
Experiment parameters.
Use Kubernetes API or kubectl
in-place update of resources
to make Experiment changes. For example, run:
kubectl edit experiment <experiment-name> -n <experiment-namespace>
Make appropriate changes and save it. Controller automatically processes the new parameters and makes necessary changes.
If you want to increase or decrease parallel Trial execution, modify
parallelTrialCount
. Controller accordingly creates or deletes Trials in line with theparallelTrialCount
value.If you want to increase or decrease maximum Trial count, modify
maxTrialCount
.maxTrialCount
should be greater than current count ofSucceeded
Trials. You can remove themaxTrialCount
parameter, if your Experiment should run endless withparallelTrialCount
of parallel Trials until the Experiment reachesGoal
ormaxFailedTrialCount
If you want to increase or decrease maximum failed Trial count, modify
maxFailedTrialCount
. You can remove themaxFailedTrialCount
parameter, if the Experiment should not reachFailed
status.
Resume Succeeded Experiment
Katib Experiment is restartable only if it is in Succeeded
status because maxTrialCount
has been reached. To check current Experiment status run:
kubectl get experiment <experiment-name> -n <experiment-namespace>
.
To restart an Experiment, you are able to change only parallelTrialCount
,
maxTrialCount
and maxFailedTrialCount
as described above
To control various resume policies, you can specify .spec.resumePolicy
for the Experiment.
Resume policy: Never
Use this policy if your Experiment should not be resumed at any time. After the Experiment has finished, the Suggestion’s Deployment and Service are deleted and you can’t restart the Experiment.
This is the default policy for all Katib Experiments. You can omit .spec.resumePolicy
parameter
for that functionality.
Resume policy: LongRunning
Use this policy if you intend to restart the Experiment. After the Experiment has finished, the Suggestion’s Deployment and Service stay running until you delete your Experiment. Modify Experiment’s Trial count parameters to restart the Experiment.
Check the
long-running-resume.yaml
example for more details.
Resume policy: FromVolume
Use this policy if you intend to restart the Experiment. In that case, volume is attached to the Suggestion’s Deployment.
Katib controller creates PersistentVolumeClaim (PVC) in addition to the Suggestion’s Deployment and Service.
PVC is deployed with the name:
<suggestion-name>-<suggestion-algorithm>
in the Suggestion namespace.PV is deployed with the name:
<suggestion-name>-<suggestion-algorithm>-<suggestion-namespace>
After the Experiment has finished, the Suggestion’s Deployment and Service are deleted. Suggestion data can be retained in the volume. When you restart the Experiment, the Suggestion’s Deployment and Service are created and Suggestion statistics can be recovered from the volume.
When you delete the Experiment, the Suggestion’s Deployment, Service, PVC and PV are deleted automatically.
Check the
from-volume-resume.yaml
example for more details.
Next steps
- Learn how to configure and run your Katib Experiments.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.