Skip to main content

4 posts tagged with "release"

View All Tags

· 8 min read
Siyu Wang

We’re pleased to announce the release of Kubernetes 1.1, which is a CNCF Sandbox level project.

OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.

What's new?​

In release v1.1, OpenKruise optimizes some existing features, and improves its performance in large-scale clusters. Here we are going to introduce some changes of it.

Note that OpenKruise v1.1 bumps Kubernetes dependencies to v1.22, which means we can use new fields of up to K8s v1.22 in Pod template of workloads like CloneSet and Advanced StatefulSet. But OpenKruise can still be used in Kubernetes cluster >= 1.16 version.

1. Keep containers order for in-place update​

In the release v1.0 we published last year, OpenKruise has intruduced Container Launch Priority, which supports to define different priorities for containers in a Pod and keeps their start order during Pod creation.

But in v1.0, it can only control the order in Pod creation. If you try to update the containers in-place, they will be updated at the same time.

Recently, the community has discussed with some companies such as LinkedIn and get more input from the users. In some scenarios, the containers in Pod may have special relationship, for example base-container should firstly update its configuration before app-container update, or we have to forbid multiple containers updating together to avoid log-container losing the logs of app-container.

So, OpenKruise supports in-place update with container priorities since v1.1.

There is no extra options, just make sure containers have their launch priorities since Pod creation. If you modify them both in once in-place update, Kruise will firstly update the containers with higher priority. Then Kruise will not update the containers with lower priority util the higher one has updated successfully.

The in-place udpate here includes both modification of image and env from metadata, read the concept doc for more details

  • For pods without container launch priorities, no guarantees of the execution order during in-place update multiple containers.
  • For pods with container launch priorities:
    • keep execution order during in-place update multiple containers with different priorities.
    • no guarantees of the execution order during in-place update multiple containers with the same priority.

For example, we have the CloneSet that includes two containers with different priorities:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
...
spec:
replicas: 1
template:
metadata:
annotations:
app-config: "... config v1 ..."
spec:
containers:
- name: sidecar
env:
- name: KRUISE_CONTAINER_PRIORITY
value: "10"
- name: APP_CONFIG
valueFrom:
fieldRef:
fieldPath: metadata.annotations['app-config']
- name: main
image: main-image:v1
updateStrategy:
type: InPlaceIfPossible

When we update the CloneSet to change app-config annotation and image of main container, which means both sidecar and main containers need to update, Kruise will firstly in-place update pods that recreates sidecar container with the new env from annotation.

At this moment, we can find the apps.kruise.io/inplace-update-state annotation in updated Pod and see its value:

{
"revision": "{CLONESET_NAME}-{HASH}", // the target revision name of this in-place update
"updateTimestamp": "2022-03-22T09:06:55Z", // the start time of this whole update
"nextContainerImages": {"main": "main-image:v2"}, // the next containers that should update images
// "nextContainerRefMetadata": {...}, // the next containers that should update env from annotations/labels
"preCheckBeforeNext": {"containersRequiredReady": ["sidecar"]}, // the pre-check must be satisfied before the next containers can update
"containerBatchesRecord":[
{"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]} // the first batch of containers that have updated (it just means the spec of containers has updated, such as images in pod.spec.container or annotaions/labels, but dosn't mean the real containers on node have been updated completely)
]
}

When the sidecar container has been updated successfully, Kruise will update the next main container. Finally, you will find the apps.kruise.io/inplace-update-state annotation looks like:

{
"revision": "{CLONESET_NAME}-{HASH}",
"updateTimestamp": "2022-03-22T09:06:55Z",
"lastContainerStatuses":{"main":{"imageID":"THE IMAGE ID OF OLD MAIN CONTAINER"}},
"containerBatchesRecord":[
{"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]},
{"timestamp":"2022-03-22T09:07:20Z","containers":["main"]}
]
}

Usually, users only have to care about the containerBatchesRecord to make sure the containers are updated in different batches. If the Pod is blocking during in-place update, you should check the nextContainerImages/nextContainerRefMetadata and see if the previous containers in preCheckBeforeNext have been updated successfully and ready.

2. StatefulSetAutoDeletePVC​

Since Kubernetes v1.23, the upstream StatefulSet has supported StatefulSetAutoDeletePVC feature, it controls if and how PVCs are deleted during the lifecycle of a StatefulSet, refer to this doc.

So, Advanced StatefulSet has rebased this feature from upstream, which also requires you to enable StatefulSetAutoDeletePVC feature-gate during install/upgrade Kruise.

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
...
persistentVolumeClaimRetentionPolicy: # optional
whenDeleted: Retain | Delete
whenScaled: Retain | Delete

Once enabled, there are two policies you can configure for each StatefulSet:

  • whenDeleted: configures the volume retention behavior that applies when the StatefulSet is deleted.
  • whenScaled: configures the volume retention behavior that applies when the replica count of the StatefulSet is reduced; for example, when scaling down the set.

For each policy that you can configure, you can set the value to either Delete or Retain.

  • Retain (default): PVCs from the volumeClaimTemplate are not affected when their Pod is deleted. This is the behavior before this new feature.
  • Delete: The PVCs created from the volumeClaimTemplate are deleted for each Pod affected by the policy. With the whenDeleted policy all PVCs from the volumeClaimTemplate are deleted after their Pods have been deleted. With the whenScaled policy, only PVCs corresponding to Pod replicas being scaled down are deleted, after their Pods have been deleted.

Note that:

  1. StatefulSetAutoDeletePVC only deletes PVCs created by volumeClaimTemplate instead of the PVCs created by user or related to StatefulSet Pod.
  2. The policies only apply when Pods are being removed due to the StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node where the new Pod is about to launch.

3. Advanced DaemonSet refactor, lifecycle hook​

The behavior of Advanced DaemonSet used to be a little different with the upstream controller, such as it required extra configuration to choose whether not-ready and unschedulable nodes should be handled, which makes users confused and hard to understand.

In release v1.1, we have refactored Advanced DaemonSet to make it rebase with upstream. Now, the default behavior of Advanced DaemonSet should be same with the upstream DaemonSet, which means users can conveniently modify the apiVersion field to convert a built-in DaemonSet to Advanced DaemonSet.

Meanwhile, we also add lifecycle hook for Advanced DaemonSet. Currently it supports preDelete hook, which allows users to do something (for example check node resources) before Pod deleting.

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
...
# define with label
lifecycle:
preDelete:
labelsHandler:
example.io/block-deleting: "true"

When Advanced DaemonSet delete a Pod (including scale in and recreate update):

  • Delete it directly if no lifecycle hook definition or Pod not matched preDelete hook
  • Otherwise, Advanced DaemonSet will firstly update Pod to PreparingDelete state and wait for user controller to remove the label/finalizer and Pod not matched preDelete hook

4. Improve performance by disable DeepCopy​

By default, when we are writing Operator/Controller with controller-runtime and use the Client interface in sigs.k8s.io/controller-runtime/pkg/client to get/list typed objects, it will always get objects from Informer. That's known by most people.

But what's many people don't know, is that controller-runtime will firstly deep copy all the objects got from Informer and then return the copied objects.

This design aims to avoid developers directly modifying the objects in Informer. After DeepCopy, no matter how developers modify the objected returned by get/list, it will not change the objects in Informer, which are only synced by ListWatch from kube-apiserver.

However, in some large-scale clusters, mutliple controllers of OpenKruise and their workers are reconciling together, which may bring so many DeepCopy operations. For example, there are a lot of application CloneSets and some of them have managed thousands of Pods, then each worker will list all Pod of the CloneSet during Reconcile and there exists multiple workers. It brings CPU and Memory pressure to kruise-manager and even sometimes makes it Out-Of-Memory.

So I have submitted and merged DisableDeepCopy feature in upstream, which contains in controller-runtime >= v0.10 version. It allows developers to specify some resource types that will directly return the objects from Informer without DeepCopy during get/list.

For example, we can add cache options when initialize Manager in main.go to avoid DeepCopy for Pod objects.

    mgr, err := ctrl.NewManager(cfg, ctrl.Options{
...
NewCache: cache.BuilderWithOptions(cache.Options{
UnsafeDisableDeepCopyByObject: map[client.Object]bool{
&v1.Pod{}: true,
},
}),
})

But in Kruise v1.1, we re-implement Delegating Client instead of using the feature of controller-runtime. It allows developers to avoid DeepCopy with DisableDeepCopy ListOption in any list places, which is more flexible.

    if err := r.List(context.TODO(), &podList, client.InNamespace("default"), utilclient.DisableDeepCopy); err != nil {
return nil, nil, err
}

5. Other changes​

For more changes, their authors and commits, you can read the Github release.

Get Involved​

Welcome to get involved with OpenKruise by joining us in Github/Slack/DingTalk/WeChat. Have something you’d like to broadcast to our community? Share your voice at our Bi-weekly community meeting (Chinese), or through the channels below:

  • Join the community on Slack (English).
  • Join the community on DingTalk: Search GroupID 23330762 (Chinese).
  • Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).

· 7 min read
Siyu Wang

We’re pleased to announce the release of Kubernetes 1.0, which is a CNCF Sandbox level project.

OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.

openkruise-features|center|450x400

Overall, OpenKruise currently provides features in these areas:

  • Application workloads: Enhanced strategies of deploy and upgrade for stateless/stateful/daemon applications, such as in-place update, canary/flowing upgrade.
  • Sidecar container management: supports to define sidecar container alone, which means it can inject sidecar containers, upgrade them with no effect on application containers and even hot upgrade.
  • Enhanced operations: such as restart containers in-place, pre-download images on specific nodes, keep containers launch priority in a Pod, distribute one resource to multiple namespaces.
  • Application availability protection: protect availability for applications that deployed in Kubernetes.

What's new?​

1. InPlace Update for environments​

Author: @FillZpp

OpenKruise has supported InPlace Update since very early version, mostly for workloads like CloneSet and Advanced StatefulSet. Comparing to recreate Pods during upgrade, in-place update only has to modify the fields in existing Pods.

inplace-update-comparation|center|450x400

As the picture shows above, we only modify the image field in Pod during in-place update. So that:

  • Avoid additional cost of scheduling, allocating IP, allocating and mounting volumes.
  • Faster image pulling, because of we can re-use most of image layers pulled by the old image and only to pull several new layers.
  • When a container is in-place updating, the other containers in Pod will not be affected and remain running.

However, OpenKruise only supports to in-place update image field in Pod and has to recreate Pods if other fields need to update. All the way through, more and more users hope OpenKruise could support in-place update more fields such as env -- which is hard to implement, for it is limited by kube-apiserver.

After our unremitting efforts, OpenKruise finally support in-place update environments via Downward API since version v1.0. Take the CloneSet YAML below as an example, user has to set the configuration in annotation and write a env from it. After that, he just needs to modify the annotation value when changing the configuration. Kruise will restart all containers with env from the annotation in such Pod to enable the new configuration.

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
...
spec:
replicas: 1
template:
metadata:
annotations:
app-config: "... the real env value ..."
spec:
containers:
- name: app
env:
- name: APP_CONFIG
valueFrom:
fieldRef:
fieldPath: metadata.annotations['app-config']
updateStrategy:
type: InPlaceIfPossible

At the same time, we have removed the limit of imageID for in-place update, which means you can update a new image with the same imageID to the old image.

For more details please read documentation.

2. Distribute resources over multiple namespaces​

Author: @veophi

For the scenario, where the namespace-scoped resources such as Secret and ConfigMap need to be distributed or synchronized to different namespaces, the native k8s currently only supports manual distribution and synchronization by users one-by-one, which is very inconvenient.

Typical examples:

  • When users want to use the imagePullSecrets capability of SidecarSet, they must repeatedly create corresponding Secrets in relevant namespaces, and ensure the correctness and consistency of these Secret configurations;
  • When users want to configure some common environment variables, they probably need to distribute ConfigMaps to multiple namespaces, and the subsequent modifications of these ConfigMaps might require synchronization among these namespaces.

Therefore, in the face of these scenarios that require the resource distribution and continuously synchronization across namespaces, we provide a tool, namely ResourceDistribution, to do this automatically.

Currently, ResourceDistribution supports the two kind resources --- Secret & ConfigMap.

apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
...
targets:
namespaceLabelSelector:
...
# or includedNamespaces, excludedNamespaces

So you can see ResourceDistribution is a kind of cluster-scoped CRD, which is mainly composed of two fields: resource and targets.

  • resource is a complete and correct resource structure in YAML style.
  • targets indicates the target namespaces that the resource should be distributed into.

For more details please read documentation.

3. Container launch priority​

Author: @Concurrensee

Containers in a same Pod in it might have dependence, which means the application in one container runs depending on another container. For example:

  1. Container A has to start first. Container B can start only if A is already running.
  2. Container B has to exit first. Container A can stop only if B has already exited.

Currently, the sequences of containers start and stop are controlled by Kubelet. Kubernetes used to have a KEP, which plans to add a type field for container to identify the priority of start and stop. However, it has been refused because of sig-node thought it may bring a huge change to code.

So OpenKruise provides a feature named Container Launch Priority, which helps user control the sequence of containers start in a Pod.

  1. User only has to put the annotation apps.kruise.io/container-launch-priority: Ordered in a Pod, then Kruise will ensure all containers in this Pod should be started by the sequence of pod.spec.containers list.
  2. If you want to customize the launch sequence, you can add KRUISE_CONTAINER_PRIORITY environment in container. The range of the value is [-2147483647, 2147483647]. The container with higher priority will be guaranteed to start before the others with lower priority.

For more details please read documentation.

4. kubectl-kruise commandline tool​

Author: @hantmac

OpenKruise used to provide SDK like kruise-api and client-java for some programming languages, which can be imported into users' projects. On the other hand, some users also need to operate the workload resources with commandline in test environment.

However, the rollout, set image commands in original kubectl can only work for built-in workloads, such as Deployment and StatefulSet.

So, OpenKruise now provide a commandline tool named kubectl-kruise, which is a standard plugin of kubectl and can work for OpenKruise workload types.

# rollout undo cloneset
$ kubectl kruise rollout undo cloneset/nginx

# rollout status advanced statefulset
$ kubectl kruise rollout status statefulsets.apps.kruise.io/sts-demo

# set image of a cloneset
$ kubectl kruise set image cloneset/nginx busybox=busybox nginx=nginx:1.9.1

For more details please read documentation.

5. Other changes​

CloneSet:

  • Add maxUnavailable field in scaleStrategy to support rate limiting of scaling up.
  • Mark revision stable when all pods updated to it, won't wait all pods to be ready.

WorkloadSpread:

  • Manage the pods that have created before WorkloadSpread.
  • Optimize the update and retry logic for webhook injection.

Advanced DaemonSet:

  • Support in-place update Daemon Pod.
  • Support progressive annotation to control if pods creation should be limited by partition.

SidecarSet:

  • Fix SidecarSet filter active pods.
  • Add SourceContainerNameFrom and EnvNames fields in transferenv to make the container name flexible and the list shorter.

PodUnavailableBudget:

  • Add no pub-protection annotation to skip validation for the specific Pod.
  • PodUnavailableBudget controller watches workload replicas changed.

NodeImage:

  • Add --nodeimage-creation-delay flag to delay NodeImage creation after Node ready.

UnitedDeployment:

  • Fix pod NodeSelectorTerms length 0 when UnitedDeployment NodeSelectorTerms is nil.

Other optimization:

  • kruise-daemon list and watch pods using protobuf.
  • Export cache resync args and defaults to be 0 in chart value.
  • Fix http checker reloading after webhook certs updated.
  • Generate CRDs with original controller-tools and markers.

Get Involved​

Welcome to get involved with OpenKruise by joining us in Github/Slack/DingTalk/WeChat. Have something you’d like to broadcast to our community? Share your voice at our Bi-weekly community meeting (Chinese), or through the channels below:

  • Join the community on Slack (English).
  • Join the community on DingTalk: Search GroupID 23330762 (Chinese).
  • Join the community on WeChat: Search User openkruise and let the robot invite you (Chinese).

· 5 min read
Siyu Wang

On Sep 6th, 2021, OpenKruise released the latest version v0.10.0, with new features, such as WorkloadSpread and PodUnavailableBudget. This article provides an overview of this new version.

WorkloadSpread​

WorkloadSpread can distribute Pods of workload to different types of Node according to some polices, which empowers single workload the abilities for multi-domain deployment and elastic deployment.

Some common policies include:

  • fault toleration spread (for example, spread evenly among hosts, az, etc)
  • spread according to the specified ratio (for example, deploy Pod to several specified az according to the proportion)
  • subset management with priority, such as
    • deploy Pods to ecs first, and then deploy to eci when its resources are insufficient.
    • deploy a fixed number of Pods to ecs first, and the rest Pods are deployed to eci.
  • subset management with customization, such as
    • control how many pods in a workload are deployed in different cpu arch
    • enable pods in different cpu arch to have different resource requirements

The feature of WorkloadSpread is similar with UnitedDeployment in OpenKruise community. Each WorkloadSpread defines multi-domain called subset. Each domain may provide the limit to run the replicas number of pods called maxReplicas. WorkloadSpread injects the domain configuration into the Pod by Webhook, and it also controls the order of scale in and scale out.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
name: workloadspread-demo
spec:
targetRef:
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
kind: Deployment | CloneSet
name: workload-xxx
subsets:
- name: subset-a
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
maxReplicas: 10 | 30%
- name: subset-b
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b

The WorkloadSpread is related to a Workload via targetRef. When a Pod is created by the Workload, it will be injected topology policies by Kruise according to the rules in WorkloadSpread.

Note that WorkloadSpread uses Pod Deletion Cost to control the priority of scale down. So:

  • If the Workload type is CloneSet, it already supports the feature.
  • If the Workload type is Deployment or ReplicaSet, it requires your Kubernetes version >= 1.22.

Also you have to enable WorkloadSpread feature-gate when you install or upgrade Kruise.

PodUnavailableBudget​

Kubernetes offers Pod Disruption Budget to help you run highly available applications even when you introduce frequent voluntary disruptions. PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. However, it can only constrain the voluntary disruption triggered by the Eviction API. For example, when you run kubectl drain, the tool tries to evict all of the Pods on the Node you're taking out of service.

In the following voluntary disruption scenarios, there are still business disruption or SLA degradation situations:

  1. The application owner update deployment's pod template for general upgrading, while cluster administrator drain nodes to scale the cluster down(learn about Cluster Autoscaling).
  2. The middleware team is using SidecarSet to rolling upgrade the sidecar containers of the cluster, e.g. ServiceMesh envoy, while HPA triggers the scale-down of business applications.
  3. The application owner and middleware team release the same Pods at the same time based on OpenKruise cloneSet, sidecarSet in-place upgrades

In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which greatly improves the high availability of application services.

apiVersion: apps.kruise.io/v1alpha1
kind: PodUnavailableBudget
metadata:
name: web-server-pub
namespace: web
spec:
targetRef:
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
kind: Deployment | CloneSet | StatefulSet | ...
name: web-server
# selector label query over pods managed by the budget
# selector and TargetReference are mutually exclusive, targetRef is priority to take effect.
# selector is commonly used in scenarios where applications are deployed using multiple workloads,
# and targetRef is used for protection against a single workload.
# selector:
# matchLabels:
# app: web-server
# maximum number of Pods unavailable for the current cloneset, the example is cloneset.replicas(5) * 60% = 3
# maxUnavailable and minAvailable are mutually exclusive, maxUnavailable is priority to take effect
maxUnavailable: 60%
# Minimum number of Pods available for the current cloneset, the example is cloneset.replicas(5) * 40% = 2
# minAvailable: 40%

You have to enable the feature-gates when install or upgrade Kruise:

  • PodUnavailableBudgetDeleteGate: protect Pod deletion or eviction.
  • PodUnavailableBudgetUpdateGate: protect Pod update operations, such as in-place update.

CloneSet supports scaledown priority by Spread Constraints​

When replicas of a CloneSet decreased, it has the arithmetic to choose Pods and delete them.

  1. Node unassigned < assigned
  2. PodPending < PodUnknown < PodRunning
  3. Not ready < ready
  4. Lower pod-deletion cost < higher pod-deletion-cost
  5. Higher spread rank < lower spread rank
  6. Been ready for empty time < less time < more time
  7. Pods with containers with higher restart counts < lower restart counts
  8. Empty creation time pods < newer pods < older pods

"4" has provided in Kruise v0.9.0 and it is also used by WorkloadSpread to control the Pod deletion. "5" is added in Kruise v0.10.0 to sort Pods by their Topology Spread Constraints during scaledown.

Advanced StatefulSet supports scaleup with rate limit​

To avoid a large amount of failed Pods after user created an incorrect Advanced StatefulSet, Kruise add a maxUnavailable field into its scaleStrategy.

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
replicas: 100
scaleStrategy:
maxUnavailable: 10% # percentage or absolute number

When the field is set, Advanced StatefulSet will guarantee that the number of unavailable Pods should not bigger than the strategy number during Pod creation.

Note that the feature can only be used in StatefulSet with podManagementPolicy=Parallel.

More​

For more changes, please refer to the release page or ChangeLog.

· 13 min read
Siyu Wang

On May 20, 2021, OpenKruise released the latest version v0.9.0, with new features, such as Pod restart and resource cascading deletion protection. This article provides an overview of this new version.

Pod Restart and Recreation​

Restarting container is a necessity in daily operation and a common technical method for recovery. In the native Kubernetes, the container granularity is inoperable. Pod, as the minimum operation unit, can only be created or deleted.

Some may ask: why do users still need to pay attention to the operation such as container restart in the cloud-native era? Aren't the services the only thing for users to focus on in the ideal Serverless model?

To answer this question, we need to see the differences between cloud-native architecture and traditional infrastructures. In the era of traditional physical and virtual machines, multiple application instances are deployed and run on one machine, but the lifecycles of the machine and applications are separated. Thus, application instance restart may only require a systemctl or supervisor command but not the restart of the entire machine. However, in the era of containers and cloud-native, the lifecycle of the application is bound to that of the Pod container. In other words, under normal circumstances, one container only runs one application process, and one Pod provides services for only one application instance.

Due to these restrictions, current native Kubernetes provides no API for the container (application) restart for upper-layer services. OpenKruise v0.9.0 supports restarting containers in a single Pod, compatible with standard Kubernetes clusters of version 1.16 or later. After installing or upgrading OpenKruise, users only need to create a ContainerRecreateRequest (CRR) object to initiate a restart process. The simplest YAML file is listed below:

apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
metadata:
namespace: pod-namespace
name: xxx
spec:
podName: pod-name
containers:
- name: app
- name: sidecar

The value of namespace must be the same as the namespace of the Pod to be operated. The name can be set as needed. The podName in the spec clause indicates the Pod name. The containers indicate a list that specifies one or more container names in the Pod to restart.

In addition to the required fields above, CRR also provides a variety of optional restart policies:

spec:
# ...
strategy:
failurePolicy: Fail
orderedRecreate: false
terminationGracePeriodSeconds: 30
unreadyGracePeriodSeconds: 3
minStartedSeconds: 10
activeDeadlineSeconds: 300
ttlSecondsAfterFinished: 1800
  • failurePolicy: Values: Fail or Ignore. Default value: Fail. If any container stops or fails to recreate, CRR ends immediately.
  • orderedRecreate: Default value: false. Value true indicates when the list contains multiple containers, the new container will only be recreated after the previous recreation is finished.
  • terminationGracePeriodSeconds: The time for the container to gracefully exit. If this parameter is not specified, the time defined for the Pod is used.
  • unreadyGracePeriodSeconds: Set the Pod to the unready state before recreation and wait for the time expiration to execute recreation.
    • Note: This feature needs the feature-gate KruisePodReadinessGate to be enabled, which will inject a readinessGate when a Pod is created. Otherwise, only the pods created by the OpenKruise workload are injected with readinessGate by default. It means only these Pods can use the unreadyGracePeriodSeconds parameter during the CRR recreation.
  • minStartedSeconds: The minimal period that the new container remains running to judge whether the container is recreated successfully.
  • activeDeadlineSeconds: The expiration period set for CRR execution to mark as ended (unfinished container will be marked as failed.)
  • ttlSecondsAfterFinished: The period after which the CRR will be deleted automatically after the execution ends.

How it works under the hood: After it is created, a CRR is processed by the kruise-manager. Then, it will be sent to the kruise-daemon (contained by the node where Pod resides) for execution. The execution process is listed below:

  1. If preStop is specified for a Pod, the kruise-daemon will first call the CRI to run the command specified by preStop in the container.
  2. If no preStop exists or preStop execution is completed, the kruise-daemon will call the CRI to stop the container.
  3. When the kubelet detects the container exiting, it creates a new container with an increasing "serial number" and starts it. postStart will be executed at the same time.
  4. When the kruise-daemon detects the start of the new container, it reports to CRR that the restart is completed.

ContainerRecreateRequest

The container "serial number" corresponds to the restartCount reported by kubelet in the Pod status. Therefore, the restartCount of the Pod increases after the container is restarted. Temporary files written to the rootfs in the old container will be lost due to the container recreation, but data in the volume mount remains.

Cascading Deletion Protection​

The level triggered automation of Kubernetes is a double-edged sword. It brings declarative deployment capabilities to applications while potentially enlarging the influence of mistakes at a final-state scale. For example, with the cascading deletion mechanism, once an owning resource is deleted under normal circumstances (non-orphan deletion), all owned resources associated will be deleted by the following rules:

  1. If a CRD is deleted, all its corresponding CR will be cleared.
  2. If a namespace is deleted, all resources in this namespace, including Pods, will be cleared.
  3. If a workload (Deployment, StatefulSet, etc) is deleted, all Pods under it will be cleared.

Due to failures caused by cascading deletion, we have heard many complaints from Kubernetes users and developers in the community. It is unbearable for any enterprise to mistakenly delete objects at such a large scale in the production environment.

Therefore, in OpenKruise v0.9.0, we applied the feature of cascading deletion protection to community in the hope of ensuring stability for more users. If you want to use this feature in the current version, the feature-gate of ResourcesDeletionProtection needs to be explicitly enabled when installing or upgrading OpenKruise.

A label of policy.kruise.io/delete-protection can be given on the resource objects that require protection. Its value can be the following two things:

  • Always: The object cannot be deleted unless the label is removed.
  • Cascading: The object cannot be deleted if any subordinate resources are available.

The following table lists the supported resource types and cascading relationships:

KindGroupVersionCascading judgement
Namespacecorev1whether there is active Pods in this namespace
CustomResourceDefinitionapiextensions.k8s.iov1beta1, v1whether there is existing CRs of this CRD
Deploymentappsv1whether the replicas is 0
StatefulSetappsv1whether the replicas is 0
ReplicaSetappsv1whether the replicas is 0
CloneSetapps.kruise.iov1alpha1whether the replicas is 0
StatefulSetapps.kruise.iov1alpha1, v1beta1whether the replicas is 0
UnitedDeploymentapps.kruise.iov1alpha1whether the replicas is 0

New Features of CloneSet​

Deletion Priority​

The controller.kubernetes.io/pod-deletion-cost annotation was added to Kubernetes after version 1.21. ReplicaSet will sort the Kubernetes resources according to this cost value during scale in. CloneSet has supported the same feature since OpenKruise v0.9.0.

Users can configure this annotation in the pod. The int type of its value indicates the deletion cost of a certain pod compared to other pods under the same CloneSet. Pods with a lower cost have a higher deletion priority. If this annotation is not set, the deletion cost of the pod is 0 by default.

Note: This deletion order is not determined solely by deletion cost. The real order serves like this:

  1. Not scheduled < scheduled
  2. PodPending < PodUnknown < PodRunning
  3. Not ready < ready
  4. Smaller pod-deletion cost < larger pod-deletion cost
  5. Period in the Ready state: short < long
  6. Containers restart: more times < fewer times
  7. Creation time: short < long

Image Pre-Download for In-Place Update​

When CloneSet is used for the in-place update of an application, only the container image is updated, while the Pod is not rebuilt. This ensures that the node where the Pod is located will not change. Therefore, if the CloneSet pulls the new image from all the Pod nodes in advance, the Pod in-place update speed will be improved substantially in subsequent batch releases.

If you want to use this feature in the current version, the feature-gate of PreDownloadImageForInPlaceUpdate needs to be explicitly enabled when installing or upgrading OpenKruise. If you update the images in the CloneSet template and the publish policy supports in-place update, CloneSet will create an ImagePullJob object automatically (the batch image pre-download function provided by OpenKruise) to download new images in advance on the node where the Pod is located.

By default, CloneSet sets the parallelism to 1 for ImagePullJob, which means images are pulled for one node and then another. For any adjustment, you can set the parallelism in the CloneSet annotation by executing the following code:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "5"

Pod Replacement by Scale Out and Scale In​

In previous versions, the maxUnavailable and maxSurge policies of CloneSet only take effect during the application release process. In OpenKruise v0.9.0 and later versions, these two policies also function when deleting a specified Pod.

When the user specifies one or more Pods to be deleted through podsToDelete or apps.kruise.io/specified-delete: true, CloneSet will only execute deletion when the number of unavailable Pods (of the total replicas) is less than the value of maxUnavailable. In addition, if the user has configured the maxSurge policy, the CloneSet will possibly create a new Pod first, wait for the new Pod to be ready, and then delete the old specified Pod.

The replacement method depends on the value of maxUnavailable and the number of unavailable Pods. For example:

  • For a CloneSet, maxUnavailable=2, maxSurge=1 and only pod-a is unavailable. If you specify pod-b to be deleted, CloneSet will delete it promptly and create a new Pod.
  • For a CloneSet, maxUnavailable=1, maxSurge=1 and only pod-a is unavailable. If you specify pod-b to be deleted, CloneSet will create a new Pod, wait for it to be ready, and then delete the pod-b.
  • For a CloneSet, maxUnavailable=1, maxSurge=1 and only pod-a is unavailable. If you specify this pod-a to be deleted, CloneSet will delete it promptly and create a new Pod.

Efficient Rollback Based on Partition Final State​

In the native workload, Deployment does not support phased release, while StatefulSet provides partition semantics to allow users to control the times of gray scale upgrades. OpenKruise workloads, such as CloneSet and Advanced StatefulSet, also provide partitions to support phased release.

For CloneSet, the semantics of Partition is the number or percentage of Pods remaining in the old version. For example, for a CloneSet with 100 replicas, if the partition value is changed in the sequence of 80 ➡️ 60 ➡️ 40 ➡️ 20 ➡️ 0 by steps during the image upgrade, the CloneSet is released in five batches.

However, in the past, whether it is Deployment, StatefulSet, or CloneSet, if rollback is required during the release process, the template information (image) must be changed back to the old version. During the phased release of StatefulSet and CloneSet, reducing partition value will trigger the upgrade to a new version. Increasing partition value will not trigger rollback to the old version.

The partition of CloneSet supports the "final state rollback" function after v0.9.0. If the feature-gate CloneSetPartitionRollback is enabled when installing or upgrading OpenKruise, increasing the partition value will trigger CloneSet to roll back the corresponding number of new Pods to the old version.

There is a clear advantage here. During the phased release, only the partition value needs to be adjusted to flexibly control the numbers of old and new versions. However, the "old and new versions" for CloneSet correspond to updateRevision and currentRevision in its status:

  • updateRevision: The version of the template defined by the current CloneSet.
  • currentRevision: The template version of CloneSet during the previous successful full release.

Short Hash​

By default, the value of controller-revision-hash in Pod label set by CloneSet is the full name of the ControllerRevision. For example:

apiVersion: v1
kind: Pod
metadata:
labels:
controller-revision-hash: demo-cloneset-956df7994

The name is concatenated with the CloneSet name and the ControllerRevision hash value. Generally, the hash value is 8 to 10 characters in length. In Kubernetes, a label cannot exceed 63 characters in length. Therefore, the name of CloneSet cannot exceed 52 characters in length, or the Pod cannot be created.

In v0.9.0, the new feature-gate CloneSetShortHash is introduced. If it is enabled, CloneSet will set the value of controller-revision-hash in the Pod to a hash value only, like 956df7994. Therefore, the length restriction of the CloneSet name is eliminated. (CloneSet can still recognize and manage the Pod with revision labels in the full format, even if this function is enabled.)

New Features of SidecarSet​

Sidecar Hot Upgrade Function​

SidecarSet is a workload provided by OpenKruise to manage sidecar containers separately. Users can inject and upgrade specified sidecar containers within a certain range of Pods using SidecarSet.

By default, for the independent in-place sidecar upgrade, the sidecar stops the container of the old version first and then creates a container of the new version. This method applies to sidecar containers that do not affect the Pod service availability, such as the log collection agent. However, for sidecar containers acting as a proxy such as Istio Envoy, this upgrade method is defective. Envoy, as a proxy container in the Pod, handles all the traffic. If users restart and upgrade directly, service availability will be affected. Thus, you need a complex grace termination and coordination mechanism to upgrade the envoy sidecar separately. Therefore, we offer a new solution for the upgrade of this kind of sidecar containers, namely, hot upgrade:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
spec:
# ...
containers:
- name: nginx-sidecar
image: nginx:1.18
lifecycle:
postStart:
exec:
command:
- /bin/bash
- -c
- /usr/local/bin/nginx-agent migrate
upgradeStrategy:
upgradeType: HotUpgrade
hotUpgradeEmptyImage: empty:1.0.0
  • upgradeType: HotUpgrade indicates that the type of the sidecar container is a hot upgrade, so the hot upgrade solution, hotUpgradeEmptyImage, will be executed. When performing a hot upgrade on the sidecar container, an empty container is required to switch services during the upgrade. The empty container has almost the same configuration as the sidecar container, except the image address, for example, command, lifecycle, and probe, but it does no actual work.
  • lifecycle.postStart: State migration. This procedure completes the state migration during the hot upgrade. The script needs to be executed according to business characteristics. For example, NGINX hot upgrade requires shared Listen FD and traffic reloading.

More​

For more changes, please refer to the release page or ChangeLog.