Skip to main content

OpenKruise v1.3, New Custom Pod Probe Capabilities and Significant Performance Improvements for Large-Scale Clusters

· 9 min read
Mingshan Zhao

We’re pleased to announce the release of OpenKruise 1.3, which is a CNCF Sandbox level project.

OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.

What's new?

In release v1.3, OpenKruise provides a new CRD named PodProbeMarker, improves its performance in large-scale clusters, Advanced DaemonSet support pre-download image, and some new features have been added to CloneSet, WorkloadSpread, AdvancedCronJob, SidecarSet etc.

Here we are going to introduce some changes of it.

1. New CRD and Controller: PodProbeMarker

Kubernetes provides three Pod lifecycle management:

  • Readiness Probe Used to determine whether the business container is ready to respond to user requests. If the probe fails, the Pod will be removed from Service Endpoints.
  • Liveness Probe Used to determine the health status of the container. If the probe fails, the kubelet will restart the container.
  • Startup Probe Used to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds.

So the Probe capabilities provided in Kubernetes have defined specific semantics and related behaviors. In addition, there is actually a need to customize Probe semantics and related behaviors, such as:

  • GameServer defines Idle Probe to determine whether the Pod currently has a game match, if not, from the perspective of cost optimization, the Pod can be scaled down.
  • K8S Operator defines the main-secondary probe to determine the role of the current Pod (main or secondary). When upgrading, the secondary can be upgraded first, so as to achieve the behavior of selecting the main only once during the upgrade process, reducing the service interruption time during the upgrade process.

OpenKruise provides the ability to customize the Probe and return the result to the Pod Status, and the user can decide the follow-up behavior based on the probe result.

An object of PodProbeMarker may look like this:

apiVersion: apps.kruise.io/v1alpha1
kind: PodProbeMarker
metadata:
name: game-server-probe
namespace: ns
spec:
selector:
matchLabels:
app: game-server
probes:
- name: Idle
containerName: game-server
probe:
exec: /home/game/idle.sh
initialDelaySeconds: 10
timeoutSeconds: 3
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
markerPolicy:
- state: Succeeded
labels:
gameserver-idle: 'true'
annotations:
controller.kubernetes.io/pod-deletion-cost: '-10'
- state: Failed
labels:
gameserver-idle: 'false'
annotations:
controller.kubernetes.io/pod-deletion-cost: '10'
podConditionType: game.io/idle

PodProbeMarker results can be viewed at Pod Object:

apiVersion: v1
kind: Pod
metadata:
labels:
app: game-server
gameserver-idle: 'true'
annotations:
controller.kubernetes.io/pod-deletion-cost: '-10'
name: game-server-58cb9f5688-7sbd8
namespace: ns
spec:
...
status:
conditions:
# podConditionType
- type: game.io/idle
# Probe State 'Succeeded' indicates 'True', and 'Failed' indicates 'False'
status: "True"
lastProbeTime: "2022-09-09T07:13:04Z"
lastTransitionTime: "2022-09-09T07:13:04Z"
# If the probe fails to execute, the message is stderr
message: ""

2. Performance optimization: significant performance improvements for large-scale clusters

  • #1026 The introduction of a delayed queueing mechanism significantly optimizes the CloneSet controller work queue buildup problem when kruise-manager is pulled up in large-scale application clusters, ideally reducing initialization time by more than 80%.
  • #1027 Optimize PodUnavailableBudget controller Event Handler logic to reduce the number of irrelevant Pods in the queue.
  • #1011 The caching mechanism optimizes the CPU and Memory consumption of Advanced DaemonSet's repetitive simulation of Pod scheduling computations in large-scale clusters.
  • #1015, #1068 Significantly reduce runtime memory consumption in large clusters. Complete the Disable DeepCopy feature started in v1.1, and reduce the conversion consumption of expressions type label selector.

3. SidecarSet support inject specific historical version

SidecarSet will record historical revision of some fields such as containers, volumes, initContainers, imagePullSecrets and patchPodMetadata via ControllerRevision. Based on this feature, you can easily select a specific historical revision to inject when creating Pods, rather than always inject the latest revision of sidecar.

SidecarSet records ControllerRevision in the same namespace as Kruse Manager. You can execute kubectl get controllerrvisions -n kruise-system -l kruise.io/sidecarset-name=<your-sidecarset-name> to list the ControllerRevisions of your SidecarSet. Moreover, the ControllerRevision name of current SidecarSet revision is shown in status.latestRevision field, so you can record it very easily.

There are two configuration methods as follows:

select revision via ControllerRevision name

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
...
updateStrategy:
partition: 90%
injectionStrategy:
revisionName: <specific-controllerrevision-name>

select revision via custom version label

You can add or update the label apps.kruise.io/sidecarset-custom-version=<your-version-id> to SidecarSet when creating or publishing SidecarSet, to mark each historical revision. SidecarSet will bring this label down to the corresponding ControllerRevision object, and you can easily use the <your-version-id> to describe which historical revision you want to inject.

Assume that you are publishing version-2 in canary way (you wish only 10% Pods will be upgraded), and you want to inject the stable version-1 to newly-created Pods to reduce the risk of the canary publishing:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
labels:
apps.kruise.io/sidecarset-custom-version: example/version-2
spec:
...
updateStrategy:
partition: 90%
injectionStrategy:
customVersion: example/version-1

4. SidecarSet support inject pod annotations

SidecarSet support inject pod annotations, as follows:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
spec:
containers:
...
patchPodMetadata:
- annotations:
oom-score: '{"log-agent": 1}'
custom.example.com/sidecar-configuration: '{"command": "/home/admin/bin/start.sh", "log-level": "3"}'
patchPolicy: MergePatchJson
- annotations:
apps.kruise.io/container-launch-priority: Ordered
patchPolicy: Overwrite | Retain

patchPolicy is the injected policy, as follows:

  • Retain: By default, if annotation[key]=value exists in the Pod, the original value of the Pod will be retained. Inject annotations[key]=value2 only if annotation[key] does not exist in the Pod.
  • Overwrite: Corresponding to Retain, when annotation[key]=value exists in the Pod, it will be overwritten value2.
  • MergePatchJson: Corresponding to Overwrite, the annotations value is a json string. If the annotations[key] does not exist in the Pod, it will be injected directly. If it exists, do a json value merge. For example: annotations[oom-score]='{"main": 2}' exists in the Pod, after injection, the value json is merged into annotations[oom-score]='{"log-agent": 1, "main": 2}'.

Note: When the patchPolicy is Overwrite and MergePatchJson, the annotations can be updated synchronously when the SidecarSet in-place update the Sidecar Container. However, if only the annotations are modified, it will not take effect. It must be in-place update together with the sidecar container image. When patchPolicy is Retain, the annotations will not be updated when the SidecarSet in-place update the Sidecar Container.

According to the above configuration, when the sidecarSet is injected into the sidecar container, it will inject Pod annotations synchronously, as follows:

apiVersion: v1
kind: Pod
metadata:
annotations:
apps.kruise.io/container-launch-priority: Ordered
oom-score: '{"log-agent": 1, "main": 2}'
custom.example.com/sidecar-configuration: '{"command": "/home/admin/bin/start.sh", "log-level": "3"}'
name: test-pod
spec:
containers:
...

Note: SidecarSet should not modify any configuration outside the sidecar container for security consideration, so if you want to use this capability, you need to first configure SidecarSet_PatchPodMetadata_WhiteList whitelist or turn off whitelist checks via Feature-gate SidecarSetPatchPodMetadataDefaultsAllowed=true.

5. Advanced DaemonSet support pre-downloading image for update

If you have enabled the PreDownloadImageForDaemonSetUpdate feature-gate, DaemonSet controller will automatically pre-download the image you want to update to the nodes of all old Pods. It is quite useful to accelerate the progress of applications upgrade.

The parallelism of each new image pre-downloading by DaemonSet is 1, which means the image is downloaded on nodes one by one. You can change the parallelism using apps.kruise.io/image-predownload-parallelism annotation on DaemonSet according to the capability of image registry, for registries with more bandwidth and P2P image downloading ability, a larger parallelism can speed up the pre-download process.

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "10"

6. CloneSet Scaling with PreparingDelete

CloneSet considers Pods in PreparingDelete state as normal by default, which means these Pods will still be calculated in the replicas number.

In this situation:

  • if you scale down replicas from N to N-1, when the Pod to be deleted is still in PreparingDelete, you scale up replicas to N, the CloneSet will move the Pod back to Normal.
  • if you scale down replicas from N to N-1 and put a Pod into podsToDelete, when the specific Pod is still in PreparingDelete, you scale up replicas to N, the CloneSet will not create a new Pod until the specific Pod goes into terminating.
  • if you specificly delete a Pod without replicas changed, when the specific Pod is still in PreparingDelete, the CloneSet will not create a new Pod until the specific Pod goes into terminating.

Since Kruise v1.3.0, you can put a apps.kruise.io/cloneset-scaling-exclude-preparing-delete: "true" label into CloneSet, which indicates Pods in PreparingDelete will not be calculated in the replicas number.

In this situation:

  • if you scale down replicas from N to N-1, when the Pod to be deleted is still in PreparingDelete, you scale up replicas to N, the CloneSet will move the Pod back to Normal.
  • if you scale down replicas from N to N-1 and put a Pod into podsToDelete, even if the specific Pod is still in PreparingDelete, you scale up replicas to N, the CloneSet will create a new Pod immediately.
  • if you specificly delete a Pod without replicas changed, even if the specific Pod is still in PreparingDelete, the CloneSet will create a new Pod immediately.

7. Advanced CronJob Time zones

All CronJob schedule: times are based on the timezone of the kruise-controller-manager by default, which means the timezone set for the kruise-controller-manager container determines the timezone that the cron job controller uses.

However, we have introduce a spec.timeZone field in v1.3.0. You can set it to the name of a valid time zone name. For example, setting spec.timeZone: "Etc/UTC" instructs Kruise to interpret the schedule relative to Coordinated Universal Time.

A time zone database from the Go standard library is included in the binaries and used as a fallback in case an external database is not available on the system.

8. Other changes

For more changes, their authors and commits, you can read the Github release.

Get Involved

Welcome to get involved with OpenKruise by joining us in Github/Slack/DingTalk/WeChat. Have something you’d like to broadcast to our community? Share your voice at our Bi-weekly community meeting (Chinese), or through the channels below:

  • Join the community on Slack (English).
  • Join the community on DingTalk: Search GroupID 23330762 (Chinese).
  • Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).