OpenKruise 0.10.0, New features of multi-domain management, application protection

September 6, 2021 · 5 min read

Maintainer of OpenKruise

On Sep 6th, 2021, OpenKruise released the latest version v0.10.0, with new features, such as WorkloadSpread and PodUnavailableBudget. This article provides an overview of this new version.

WorkloadSpread

WorkloadSpread can distribute Pods of workload to different types of Node according to some polices, which empowers single workload the abilities for multi-domain deployment and elastic deployment.

Some common policies include:

fault toleration spread (for example, spread evenly among hosts, az, etc)
spread according to the specified ratio (for example, deploy Pod to several specified az according to the proportion)
subset management with priority, such as
- deploy Pods to ecs first, and then deploy to eci when its resources are insufficient.
- deploy a fixed number of Pods to ecs first, and the rest Pods are deployed to eci.
subset management with customization, such as
- control how many pods in a workload are deployed in different cpu arch
- enable pods in different cpu arch to have different resource requirements

The feature of WorkloadSpread is similar with UnitedDeployment in OpenKruise community. Each WorkloadSpread defines multi-domain called subset. Each domain may provide the limit to run the replicas number of pods called maxReplicas. WorkloadSpread injects the domain configuration into the Pod by Webhook, and it also controls the order of scale in and scale out.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: workloadspread-demo
spec:
  targetRef:
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet
    name: workload-xxx
  subsets:
  - name: subset-a
    requiredNodeSelectorTerm:
      matchExpressions:
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - zone-a
    maxReplicas: 10 | 30%
  - name: subset-b
    requiredNodeSelectorTerm:
      matchExpressions:
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - zone-b

The WorkloadSpread is related to a Workload via targetRef. When a Pod is created by the Workload, it will be injected topology policies by Kruise according to the rules in WorkloadSpread.

Note that WorkloadSpread uses Pod Deletion Cost to control the priority of scale down. So:

If the Workload type is CloneSet, it already supports the feature.
If the Workload type is Deployment or ReplicaSet, it requires your Kubernetes version >= 1.22.

Also you have to enable WorkloadSpread feature-gate when you install or upgrade Kruise.

PodUnavailableBudget

Kubernetes offers Pod Disruption Budget to help you run highly available applications even when you introduce frequent voluntary disruptions. PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. However, it can only constrain the voluntary disruption triggered by the Eviction API. For example, when you run kubectl drain, the tool tries to evict all of the Pods on the Node you're taking out of service.

In the following voluntary disruption scenarios, there are still business disruption or SLA degradation situations:

The application owner update deployment's pod template for general upgrading, while cluster administrator drain nodes to scale the cluster down(learn about Cluster Autoscaling).
The middleware team is using SidecarSet to rolling upgrade the sidecar containers of the cluster, e.g. ServiceMesh envoy, while HPA triggers the scale-down of business applications.
The application owner and middleware team release the same Pods at the same time based on OpenKruise cloneSet, sidecarSet in-place upgrades

In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which greatly improves the high availability of application services.

apiVersion: apps.kruise.io/v1alpha1
kind: PodUnavailableBudget
metadata:
  name: web-server-pub
  namespace: web
spec:
  targetRef:
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet | StatefulSet | ...
    name: web-server
  # selector label query over pods managed by the budget
  # selector and TargetReference are mutually exclusive, targetRef is priority to take effect.
  # selector is commonly used in scenarios where applications are deployed using multiple workloads,
  # and targetRef is used for protection against a single workload.
# selector:
#   matchLabels:
#     app: web-server
  # maximum number of Pods unavailable for the current cloneset, the example is cloneset.replicas(5) * 60% = 3
  # maxUnavailable and minAvailable are mutually exclusive, maxUnavailable is priority to take effect
  maxUnavailable: 60%
  # Minimum number of Pods available for the current cloneset, the example is cloneset.replicas(5) * 40% = 2
# minAvailable: 40%

You have to enable the feature-gates when install or upgrade Kruise:

PodUnavailableBudgetDeleteGate: protect Pod deletion or eviction.
PodUnavailableBudgetUpdateGate: protect Pod update operations, such as in-place update.

CloneSet supports scaledown priority by Spread Constraints

When replicas of a CloneSet decreased, it has the arithmetic to choose Pods and delete them.

Node unassigned < assigned
PodPending < PodUnknown < PodRunning
Not ready < ready
Lower pod-deletion cost < higher pod-deletion-cost
Higher spread rank < lower spread rank
Been ready for empty time < less time < more time
Pods with containers with higher restart counts < lower restart counts
Empty creation time pods < newer pods < older pods

"4" has provided in Kruise v0.9.0 and it is also used by WorkloadSpread to control the Pod deletion. "5" is added in Kruise v0.10.0 to sort Pods by their Topology Spread Constraints during scaledown.

Advanced StatefulSet supports scaleup with rate limit

To avoid a large amount of failed Pods after user created an incorrect Advanced StatefulSet, Kruise add a maxUnavailable field into its scaleStrategy.

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
  # ...
  replicas: 100
  scaleStrategy:
    maxUnavailable: 10% # percentage or absolute number

When the field is set, Advanced StatefulSet will guarantee that the number of unavailable Pods should not bigger than the strategy number during Pod creation.

Note that the feature can only be used in StatefulSet with podManagementPolicy=Parallel.

For more changes, please refer to the release page or ChangeLog.

WorkloadSpread​

PodUnavailableBudget​

CloneSet supports scaledown priority by Spread Constraints​

Advanced StatefulSet supports scaleup with rate limit​

More​

WorkloadSpread

PodUnavailableBudget

CloneSet supports scaledown priority by Spread Constraints

Advanced StatefulSet supports scaleup with rate limit

More