OpenKruise v1.1, features enhanced, improve performance in large-scale clusters

March 29, 2022 · 8 min read

Maintainer of OpenKruise

We’re pleased to announce the release of Kubernetes 1.1, which is a CNCF Sandbox level project.

OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.

What's new?

In release v1.1, OpenKruise optimizes some existing features, and improves its performance in large-scale clusters. Here we are going to introduce some changes of it.

Note that OpenKruise v1.1 bumps Kubernetes dependencies to v1.22, which means we can use new fields of up to K8s v1.22 in Pod template of workloads like CloneSet and Advanced StatefulSet. But OpenKruise can still be used in Kubernetes cluster >= 1.16 version.

1. Keep containers order for in-place update

In the release v1.0 we published last year, OpenKruise has intruduced Container Launch Priority, which supports to define different priorities for containers in a Pod and keeps their start order during Pod creation.

But in v1.0, it can only control the order in Pod creation. If you try to update the containers in-place, they will be updated at the same time.

Recently, the community has discussed with some companies such as LinkedIn and get more input from the users. In some scenarios, the containers in Pod may have special relationship, for example base-container should firstly update its configuration before app-container update, or we have to forbid multiple containers updating together to avoid log-container losing the logs of app-container.

So, OpenKruise supports in-place update with container priorities since v1.1.

There is no extra options, just make sure containers have their launch priorities since Pod creation. If you modify them both in once in-place update, Kruise will firstly update the containers with higher priority. Then Kruise will not update the containers with lower priority util the higher one has updated successfully.

The in-place udpate here includes both modification of image and env from metadata, read the concept doc for more details

For pods without container launch priorities, no guarantees of the execution order during in-place update multiple containers.
For pods with container launch priorities:
- keep execution order during in-place update multiple containers with different priorities.
- no guarantees of the execution order during in-place update multiple containers with the same priority.

For example, we have the CloneSet that includes two containers with different priorities:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  ...
spec:
  replicas: 1
  template:
    metadata:
      annotations:
        app-config: "... config v1 ..."
    spec:
      containers:
      - name: sidecar
        env:
        - name: KRUISE_CONTAINER_PRIORITY
          value: "10"
        - name: APP_CONFIG
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['app-config']
      - name: main
        image: main-image:v1
  updateStrategy:
    type: InPlaceIfPossible

When we update the CloneSet to change app-config annotation and image of main container, which means both sidecar and main containers need to update, Kruise will firstly in-place update pods that recreates sidecar container with the new env from annotation.

At this moment, we can find the apps.kruise.io/inplace-update-state annotation in updated Pod and see its value:

{
  "revision": "{CLONESET_NAME}-{HASH}",         // the target revision name of this in-place update
  "updateTimestamp": "2022-03-22T09:06:55Z",    // the start time of this whole update
  "nextContainerImages": {"main": "main-image:v2"},                // the next containers that should update images
  // "nextContainerRefMetadata": {...},                            // the next containers that should update env from annotations/labels
  "preCheckBeforeNext": {"containersRequiredReady": ["sidecar"]},  // the pre-check must be satisfied before the next containers can update
  "containerBatchesRecord":[
    {"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]}  // the first batch of containers that have updated (it just means the spec of containers has updated, such as images in pod.spec.container or annotaions/labels, but dosn't mean the real containers on node have been updated completely)
  ]
}

When the sidecar container has been updated successfully, Kruise will update the next main container. Finally, you will find the apps.kruise.io/inplace-update-state annotation looks like:

{
  "revision": "{CLONESET_NAME}-{HASH}",
  "updateTimestamp": "2022-03-22T09:06:55Z",
  "lastContainerStatuses":{"main":{"imageID":"THE IMAGE ID OF OLD MAIN CONTAINER"}},
  "containerBatchesRecord":[
    {"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]},
    {"timestamp":"2022-03-22T09:07:20Z","containers":["main"]}
  ]
}

Usually, users only have to care about the containerBatchesRecord to make sure the containers are updated in different batches. If the Pod is blocking during in-place update, you should check the nextContainerImages/nextContainerRefMetadata and see if the previous containers in preCheckBeforeNext have been updated successfully and ready.

2. StatefulSetAutoDeletePVC

Since Kubernetes v1.23, the upstream StatefulSet has supported StatefulSetAutoDeletePVC feature, it controls if and how PVCs are deleted during the lifecycle of a StatefulSet, refer to this doc.

So, Advanced StatefulSet has rebased this feature from upstream, which also requires you to enable StatefulSetAutoDeletePVC feature-gate during install/upgrade Kruise.

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
  ...
  persistentVolumeClaimRetentionPolicy:  # optional
    whenDeleted: Retain | Delete
    whenScaled: Retain | Delete

Once enabled, there are two policies you can configure for each StatefulSet:

whenDeleted: configures the volume retention behavior that applies when the StatefulSet is deleted.
whenScaled: configures the volume retention behavior that applies when the replica count of the StatefulSet is reduced; for example, when scaling down the set.

For each policy that you can configure, you can set the value to either Delete or Retain.

Retain (default): PVCs from the volumeClaimTemplate are not affected when their Pod is deleted. This is the behavior before this new feature.
Delete: The PVCs created from the volumeClaimTemplate are deleted for each Pod affected by the policy. With the whenDeleted policy all PVCs from the volumeClaimTemplate are deleted after their Pods have been deleted. With the whenScaled policy, only PVCs corresponding to Pod replicas being scaled down are deleted, after their Pods have been deleted.

Note that:

StatefulSetAutoDeletePVC only deletes PVCs created by volumeClaimTemplate instead of the PVCs created by user or related to StatefulSet Pod.
The policies only apply when Pods are being removed due to the StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node where the new Pod is about to launch.

3. Advanced DaemonSet refactor, lifecycle hook

The behavior of Advanced DaemonSet used to be a little different with the upstream controller, such as it required extra configuration to choose whether not-ready and unschedulable nodes should be handled, which makes users confused and hard to understand.

In release v1.1, we have refactored Advanced DaemonSet to make it rebase with upstream. Now, the default behavior of Advanced DaemonSet should be same with the upstream DaemonSet, which means users can conveniently modify the apiVersion field to convert a built-in DaemonSet to Advanced DaemonSet.

Meanwhile, we also add lifecycle hook for Advanced DaemonSet. Currently it supports preDelete hook, which allows users to do something (for example check node resources) before Pod deleting.

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  ...
  # define with label
  lifecycle:
    preDelete:
      labelsHandler:
        example.io/block-deleting: "true"

When Advanced DaemonSet delete a Pod (including scale in and recreate update):

Delete it directly if no lifecycle hook definition or Pod not matched preDelete hook
Otherwise, Advanced DaemonSet will firstly update Pod to PreparingDelete state and wait for user controller to remove the label/finalizer and Pod not matched preDelete hook

4. Improve performance by disable DeepCopy

By default, when we are writing Operator/Controller with controller-runtime and use the Client interface in sigs.k8s.io/controller-runtime/pkg/client to get/list typed objects, it will always get objects from Informer. That's known by most people.

But what's many people don't know, is that controller-runtime will firstly deep copy all the objects got from Informer and then return the copied objects.

This design aims to avoid developers directly modifying the objects in Informer. After DeepCopy, no matter how developers modify the objected returned by get/list, it will not change the objects in Informer, which are only synced by ListWatch from kube-apiserver.

However, in some large-scale clusters, mutliple controllers of OpenKruise and their workers are reconciling together, which may bring so many DeepCopy operations. For example, there are a lot of application CloneSets and some of them have managed thousands of Pods, then each worker will list all Pod of the CloneSet during Reconcile and there exists multiple workers. It brings CPU and Memory pressure to kruise-manager and even sometimes makes it Out-Of-Memory.

So I have submitted and merged DisableDeepCopy feature in upstream, which contains in controller-runtime >= v0.10 version. It allows developers to specify some resource types that will directly return the objects from Informer without DeepCopy during get/list.

For example, we can add cache options when initialize Manager in main.go to avoid DeepCopy for Pod objects.

    mgr, err := ctrl.NewManager(cfg, ctrl.Options{
		...
		NewCache: cache.BuilderWithOptions(cache.Options{
			UnsafeDisableDeepCopyByObject: map[client.Object]bool{
				&v1.Pod{}: true,
			},
		}),
	})

But in Kruise v1.1, we re-implement Delegating Client instead of using the feature of controller-runtime. It allows developers to avoid DeepCopy with DisableDeepCopy ListOption in any list places, which is more flexible.

    if err := r.List(context.TODO(), &podList, client.InNamespace("default"), utilclient.DisableDeepCopy); err != nil {
		return nil, nil, err
	}

5. Other changes

For more changes, their authors and commits, you can read the Github release.

Get Involved

Welcome to get involved with OpenKruise by joining us in Github/Slack/DingTalk/WeChat. Have something you’d like to broadcast to our community? Share your voice at our Bi-weekly community meeting (Chinese), or through the channels below:

Join the community on Slack (English).
Join the community on DingTalk: Search GroupID 23330762 (Chinese).
Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).

What's new?​

1. Keep containers order for in-place update​

2. StatefulSetAutoDeletePVC​

3. Advanced DaemonSet refactor, lifecycle hook​

4. Improve performance by disable DeepCopy​

5. Other changes​

Get Involved​