Upgrade Sandboxes
This document describes how to upgrade sandboxes managed by OpenKruise Agents, including both pre-warmed pool sandboxes (managed by SandboxSet) and claimed sandboxes (already allocated to users).
All features described in this document are available since v0.3.0.
Overviewโ
OpenKruise Agents supports two upgrade scenarios:
| Scenario | Target Resource | Upgrade Method | Description |
|---|---|---|---|
| Pre-warmed pool | SandboxSet | Rolling update | Modify spec.template in SandboxSet to trigger a rolling upgrade of idle sandboxes |
| Claimed sandboxes | SandboxUpdateOps | Batch recreate | Create a SandboxUpdateOps resource to batch upgrade sandboxes that are already claimed and running |
Prerequisitesโ
- OpenKruise Agents controller is installed and running.
- CRDs (
Sandbox,SandboxSet,SandboxUpdateOps) are registered in the cluster. - For claimed sandbox upgrades with lifecycle hooks, the
agent-runtimesidecar must be enabled (for executing pre/post upgrade scripts) andcsimust be enabled (if persistent storage mount is needed).
Upgrade Pre-warmed Pool Sandboxes (SandboxSet)โ
How It Worksโ
When you modify the spec.template field of a SandboxSet, the controller detects the template change and performs a rolling update of the sandboxes in the pool. The controller:
- Computes a new
updateRevisionhash from the updated template. - Deletes old-revision sandboxes in batches (respecting
maxUnavailable). - Creates new sandboxes with the updated template to maintain the desired replica count.
During scale-up, newly created sandboxes use the latest template. During scale-down, sandboxes with the old revision are removed first.
Configurationโ
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxSet
metadata:
name: my-sandbox-pool
namespace: default
spec:
replicas: 10
updateStrategy:
# Maximum number or percentage of sandboxes that can be unavailable during the update.
# Can be an absolute number (e.g., 5) or a percentage (e.g., "10%").
# Default: "20%"
maxUnavailable: "20%"
template:
spec:
containers:
- name: sandbox
image: my-registry/sandbox-image:v2 # Update the image version here
resources:
requests:
cpu: "1"
memory: "512Mi"
limits:
cpu: "2"
memory: "1Gi"
To trigger an upgrade, simply update any field under spec.template (e.g., container image, resources, environment variables) and apply the change:
kubectl apply -f sandboxset.yaml
Monitoring Progressโ
Check the SandboxSet status to monitor the rolling update:
kubectl get sandboxset my-sandbox-pool -o wide
Example output:
NAME REPLICAS AVAILABLE UPDATEDREPLICAS UPDATEDAVAILABLEREPLICAS UPDATEREVISION AGE
my-sandbox-pool 10 8 6 5 a1b2c3d4 30m
| Field | Description |
|---|---|
REPLICAS | Total number of sandboxes (creating + available + running + paused) |
AVAILABLE | Number of sandboxes ready to be claimed |
UPDATEDREPLICAS | Number of sandboxes that have been updated to the latest revision |
UPDATEDAVAILABLEREPLICAS | Number of updated sandboxes that are available |
UPDATEREVISION | Hash of the current desired template version |
The rolling update is complete when UPDATEDAVAILABLEREPLICAS equals the desired REPLICAS count.
You can also inspect individual sandbox revisions:
kubectl get sandboxes -l agents.kruise.io/sandbox-template=my-sandbox-pool -o custom-columns=\
NAME:.metadata.name,\
PHASE:.status.phase,\
REVISION:.status.updateRevision
Upgrade Claimed Sandboxes (SandboxUpdateOps)โ
How It Worksโ
Claimed sandboxes are already allocated to users and running workloads. To upgrade them, you create a SandboxUpdateOps resource that:
- Selects target sandboxes by label selector.
- Applies a Strategic Merge Patch to each sandbox's template.
- Optionally sets lifecycle hooks (pre/post upgrade) for data backup and restoration.
- The sandbox controller then performs a Recreate upgrade for each sandbox โ executing the three-phase lifecycle:
PreUpgrade (backup) โ UpgradePod (delete old pod + create new pod) โ PostUpgrade (restore)
Important Constraintsโ
- Service interruption: The Recreate upgrade deletes the old pod and creates a new one. The sandbox will be unavailable during the upgrade.
- Memory and IP loss: Memory state and IP address are lost during the recreate process (the new pod gets a new IP).
- One active SandboxUpdateOps per namespace: Only one SandboxUpdateOps can be actively updating sandboxes in a namespace at a time.
- Lifecycle hooks require agent-runtime: Pre/post upgrade scripts are executed via the
agent-runtimesidecar. Make sureruntimes: [{name: "agent-runtime"}]is configured on the target sandboxes. - Persistent storage requires CSI: If the sandbox needs to preserve filesystem data across upgrades,
runtimes: [{name: "csi"}]must be configured to mount external storage.
Configurationโ
Basic Example (Image Update Only)โ
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxUpdateOps
metadata:
name: upgrade-my-sandboxes
namespace: default
spec:
selector:
matchLabels:
agents.kruise.io/sandbox-template: my-sandbox-pool
updateStrategy:
# Maximum number of sandboxes that can be upgrading simultaneously.
# Can be an absolute number or a percentage.
maxUnavailable: 2
patch:
spec:
containers:
- name: sandbox
image: my-registry/sandbox-image:v2
Full Example with Lifecycle Hooksโ
Use lifecycle hooks to backup workspace data before upgrade and restore it after the new pod is running:
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxUpdateOps
metadata:
name: upgrade-with-backup
namespace: default
spec:
selector:
matchLabels:
agents.kruise.io/sandbox-template: my-sandbox-pool
updateStrategy:
maxUnavailable: 1
patch:
spec:
containers:
- name: sandbox
image: my-registry/sandbox-image:v2
lifecycle:
preUpgrade:
exec:
command:
- /bin/bash
- -c
- |
# Backup workspace data to external storage before pod is destroyed
tar czf /mnt/shared/backup-$(hostname).tar.gz -C /home/user/workspace .
echo "Backup completed"
timeoutSeconds: 120
postUpgrade:
exec:
command:
- /bin/bash
- -c
- |
# Restore workspace data from external storage after new pod is ready
if [ -f /mnt/shared/backup-$(hostname).tar.gz ]; then
tar xzf /mnt/shared/backup-$(hostname).tar.gz -C /home/user/workspace
rm -f /mnt/shared/backup-$(hostname).tar.gz
echo "Restore completed"
fi
timeoutSeconds: 120
Notes on lifecycle hooks:
preUpgrade: Executed before the old pod is deleted. Use it to save state (e.g., backup files to external/shared storage).postUpgrade: Executed after the new pod is running and ready. Use it to restore state.timeoutSeconds: Maximum time (in seconds) to wait for the hook to complete. Default is 60 seconds.- The
exec.commandruns inside the sandbox via the agent-runtime (envd) interface.
Applying the Upgradeโ
kubectl apply -f sandboxupdateops.yaml
Monitoring Progressโ
Check the SandboxUpdateOps status:
kubectl get sandboxupdateops upgrade-my-sandboxes
Example output:
NAME PHASE TOTAL UPDATED UPDATING FAILED AGE
upgrade-my-sandboxes Updating 10 3 2 0 5m
| Field | Description |
|---|---|
PHASE | Current phase: Pending, Updating, Completed, or Failed |
TOTAL | Total number of sandboxes selected for update |
UPDATED | Number of sandboxes successfully updated |
UPDATING | Number of sandboxes currently being updated |
FAILED | Number of sandboxes that failed to update |
Check individual sandbox status during upgrade:
kubectl get sandbox -l agents.kruise.io/sandbox-template=my-sandbox-pool
A sandbox going through the upgrade will show phase Upgrading and transition back to Running once the upgrade is complete.
Detailed Sandbox Upgrade Statusโ
Inspect the conditions on a specific sandbox to see the upgrade progress:
kubectl get sandbox <sandbox-name> -o yaml
The Upgrading condition indicates the current stage:
| Condition Reason | Description |
|---|---|
PreUpgrade | Executing the pre-upgrade lifecycle hook |
PreUpgradeFailed | Pre-upgrade hook failed |
UpgradePod | Deleting old pod and creating new pod |
UpgradePodFailed | New pod failed to start (e.g., image pull error, container crash) |
PostUpgrade | Executing the post-upgrade lifecycle hook |
PostUpgradeFailed | Post-upgrade hook failed |
Succeeded | Upgrade completed successfully |
Example condition during upgrade:
status:
phase: Upgrading
conditions:
- type: Ready
status: "False"
reason: Upgrading
message: "sandbox is upgrading"
- type: Upgrading
status: "False"
reason: UpgradePod
message: ""
Upgrade Lifecycle Flowโ
The following diagram illustrates the three-phase Recreate upgrade lifecycle for a single sandbox:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Sandbox Upgrade Flow โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Template Changed (revision mismatch detected) โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโ success โโโโโโโโโโโโโโ success โโโโโโโโโ โ
โ โPreUpgradeโโโโโโโโโโโโโโบโ UpgradePod โโโโโโโโโโโโโโบโPost โ โ
โ โ (backup) โ โ(delete old โ โUpgradeโ โ
โ โโโโโโฌโโโโโโ โ create new)โ โ(restore)โโ
โ โ โโโโโโโฌโโโโโโโ โโโโโฌโโโโ โ
โ โ fail โ fail โfail โ
โ โผ โผ โผ โ
โ PreUpgrade UpgradePod PostUpgrade โ
โ Failed Failed Failed โ
โ โ
โ On success of PostUpgrade: โ
โ Phase: Upgrading โ Running โ
โ Ready: False โ True โ
โ Upgrading condition: Succeeded โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Troubleshootingโ
Diagnosing Upgrade Failuresโ
Check the sandbox conditions for detailed error information:
kubectl get sandbox <sandbox-name> -o jsonpath='{.status.conditions}' | jq .
Pre-Upgrade Hook Failedโ
Symptom: Sandbox stuck in Upgrading phase with condition reason PreUpgradeFailed.
Resolution:
- Check the error message in the
Upgradingcondition for script failure details. - Fix the pre-upgrade script.
- Delete the failed SandboxUpdateOps and create a new one to retry.
If backup is not needed, remove the lifecycle section from the SandboxUpdateOps spec.
UpgradePod Failedโ
Symptom: Sandbox stuck in Upgrading phase with condition reason UpgradePodFailed.
Common causes:
- Image pull errors (wrong image name or registry authentication issues).
- Container crash loops (application startup failures).
- Insufficient node resources.
Resolution:
- Check the condition message for container error details.
- Fix the underlying issue (image, resources, etc.).
- Delete the failed SandboxUpdateOps and create a new one. If
postUpgradehooks are configured, create a new SandboxUpdateOps with only thepostUpgradehook (nopreUpgradeand nopatch) to complete the recovery.
Post-Upgrade Hook Failedโ
Symptom: Sandbox stuck in Upgrading phase with condition reason PostUpgradeFailed.
Resolution:
- Check the error message for script failure details.
- Fix the post-upgrade script.
- Delete the failed SandboxUpdateOps and create a new one with only the
postUpgradelifecycle (nopreUpgradeorpatch) to retry the restoration step.
Rollbackโ
To rollback a failed upgrade:
-
Delete the current SandboxUpdateOps:
kubectl delete sandboxupdateops <name> -
Create a new SandboxUpdateOps that reverts the patch (set the original image/configuration):
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxUpdateOps
metadata:
name: rollback-sandboxes
namespace: default
spec:
selector:
matchLabels:
agents.kruise.io/sandbox-template: my-sandbox-pool
patch:
spec:
containers:
- name: sandbox
image: my-registry/sandbox-image:v1 # Revert to the previous version
lifecycle:
postUpgrade:
exec:
command:
- /bin/bash
- -c
- |
# Restore data if needed
if [ -f /mnt/shared/backup-$(hostname).tar.gz ]; then
tar xzf /mnt/shared/backup-$(hostname).tar.gz -C /home/user/workspace
fi
timeoutSeconds: 120
Pausing an Update Operationโ
You can pause an ongoing SandboxUpdateOps to stop it from upgrading additional sandboxes:
spec:
paused: true
This prevents new sandboxes from being picked up for upgrade but does not affect sandboxes that are already mid-upgrade.