Skip to main content

OpenKruise v1.2, new PersistentPodState feature to achieve IP retention

· 8 min read
Siyu Wang

We’re pleased to announce the release of Kubernetes 1.2, which is a CNCF Sandbox level project.

OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.

What's new?

In release v1.2, OpenKruise provides a new CRD named PersistentPodState, some new fields of CloneSet status and lifecycle hook, and optimization of PodUnavailableBudget.

Here we are going to introduce some changes of it.

1. New CRD and Controller: PersistentPodState

With the development of cloud native, more and more companies start to deploy stateful services (e.g., Etcd, MQ) using Kubernetes. K8S StatefulSet is a workload for managing stateful services, and it considers the deployment characteristics of stateful services in many aspects. However, StatefulSet persistent only limited pod state, such as Pod Name is ordered and unchanging, PVC persistence, and can not cover other states, e.g. Pod IP retention, priority scheduling to previously deployed Nodes, etc. Typical Cases:

  • Service Discovery Middleware services are exceptionally sensitive to the Pod IP after deployment, requiring that the IP cannot be changed.

  • Database services persist data to the host disk, and changes to the Node to which they belong will result in data loss.

In response to the above description, by customizing PersistentPodState CRD, Kruise is able to persistent other states of the Pod, such as "IP Retention".

An object of PersistentPodState may look like this:

kind: PersistentPodState
name: echoserver
namespace: echoserver
# Native k8s or kruise StatefulSet
# only support StatefulSet
kind: StatefulSet
name: echoserver
# required node affinity. As follows, Pod rebuild will force deployment to the same zone
nodeTopologyKeys:[,other node labels]
# preferred node affinity. As follows, Pod rebuild will preferred deployment to the same node
- preference:
nodeTopologyKeys:[,other node labels]
# int [1, 100]
weight: 100

"IP Retention" should be a common requirement for K8S deployments of stateful services. It does not mean "Specified Pod IP", but requires that the Pod IP does not change after the first deployment, either by service release or by machine eviction. To achieve this, we need the K8S network component to support Pod IP retention and the ability to keep the IP as unchanged as possible. In this article, we have modified the Host-local plugin in the flannel network component so that it can achieve the effect of keeping the Pod IP unchanged under the same Node. Related principles will not be stated here, please refer to the code: host-local.

IP retention seems to be supported by the network component, how is it related with PersistentPodState? Well, there are some limitations to the implementation of "Pod IP unchanged" by network components. For example, flannel can only support the same Node to keep the Pod IP unchanged. However, the most important feature of K8S scheduling is "uncertainty", so "how to ensure that Pods are rebuilt and scheduled to the same Node" is the problem that PersistentPodState solves.

Also you can add the annotations below on your StatefulSet or Advanced StatefulSet, to let Kruise automatically create a PersistentPodState object for the StatefulSet. So you don't have to create it manually.

kind: StatefulSet
# auto generate PersistentPodState "true"
# preferred node affinity, As follows, Pod rebuild will preferred deployment to the same node[,other node labels]
# required node affinity, As follows, Pod rebuild will force deployment to the same zone[,other node labels]

2. CloneSet percentage partition calculation changed (breaking), and a new field in its status

Previously, CloneSet calculates its partition with round up if it is a percentage value, which means even you set partition to be a percentage less than 100%, it might update no Pods to the new revision. For example, the real partition of a CloneSet with replicas=8 and partition=90% will be calculated as 8 because of 8 * 90% with round up, so it will not update any Pod. This is a little confused, especially when we are using a rollout component like Kruise Rollout or Argo.

So since v1.2, CloneSet will make sure there is at lease one Pod should be updated when partition is a percentage less than 100%, unless the CloneSet has replicas <= 1.

However, it might be difficult for users to understand this arithmetic, but they have to known the expected updated number of Pods after a percentage partition was set.

So we also provide a new field expectedUpdatedReplicas in CloneSet status, which directly shows the expected updated number of Pods based on the given partition. Users only have to compare status.updatedReplicas >= status.expectedUpdatedReplicas to decide whether their CloneSet has finished rolling out new revision under partition restriction or not.

kind: CloneSet
replicas: 8
partition: 90%
replicas: 8
expectedUpdatedReplicas: 1
updatedReplicas: 1
updatedReadyReplicas: 1

3. Able to mark Pod not-ready for lifecycle hook

Kruise has already provided lifecycle hook in previous versions. CloneSet and Advanced StatefulSet support both PreDelete and InPlaceUpdate hooks, while Advanced DaemonSet only supports PreDelete hook.

Previously, the hooks only pause the operation and allow users to do something (for example remove pod from service endpoints) during Pod deleting and before/after in-place update. But the Pod is probably Ready during the hook state, so that removing it from some custom service implementation may break the rule of Kubernetes that we'd better only remove NotReady Pods from the endpoints.

So that a new field has been added into the lifecycle hook, markPodNotReady indicates the hooked Pod should be marked as NotReady or not.

type LifecycleStateType string

// Lifecycle contains the hooks for Pod lifecycle.
type Lifecycle struct
// PreDelete is the hook before Pod to be deleted.
PreDelete *LifecycleHook `json:"preDelete,omitempty"`
// InPlaceUpdate is the hook before Pod to update and after Pod has been updated.
InPlaceUpdate *LifecycleHook `json:"inPlaceUpdate,omitempty"`

type LifecycleHook struct {
LabelsHandler map[string]string `json:"labelsHandler,omitempty"`
FinalizersHandler []string `json:"finalizersHandler,omitempty"`

/********************** FEATURE STATE: 1.2.0 ************************/
// MarkPodNotReady = true means:
// - Pod will be set to 'NotReady' at preparingDelete/preparingUpdate state.
// - Pod will be restored to 'Ready' at Updated state if it was set to 'NotReady' at preparingUpdate state.
// Default to false.
MarkPodNotReady bool `json:"markPodNotReady,omitempty"`

For PreDelete hook, it will set Pod to be NotReady during PreparingDelete state if markPodNotReady is true, and the Pod can not be changed back to normal even if the replicas is increased again.

For InPlaceUpdate hook, it will set Pod to be NotReady during PreparingUpdate state if markPodNotReady is true, and the NotReady condition will be removed during Updated state.

4. PodUnavailableBudget supports any custom workloads and performance optimization

Kubernetes offers PodDisruptionBudget to help users run highly available applications even when you introduce frequent voluntary disruptions, but it can only constrain the voluntary disruption triggered by the Eviction API.

In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which greatly improves the high availability of application services. It can not only protect application Pods from eviction but also deletion, in-place update and other operations that could make Pods not ready.

Previously, PodUnavailableBudget only supports some specific workloads like CloneSet and Deployment. But it can not recognize unknown workloads that may be defined by users themself.

Since v1.2 release, PodUnavailableBudget has supported any custom workloads to protect their Pods from unavailable operations. All you have to do is to declare scale subresource for those custom workloads.

It looks like this in CRD:

labelSelectorPath: .status.labelSelector
specReplicasPath: .spec.replicas
statusReplicasPath: .status.replicas

But if you are using kubebuilder or operator-sdk to generate your project, one line comment on your workload struct will be fine:

// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.labelSelector

Besides, PodUnavailableBudget also optimizes its performance for large-scale clusters by disable DeepCopy from client list.

5. Other changes

For more changes, their authors and commits, you can read the Github release.

Get Involved

Welcome to get involved with OpenKruise by joining us in Github/Slack/DingTalk/WeChat. Have something you’d like to broadcast to our community? Share your voice at our Bi-weekly community meeting (Chinese), or through the channels below:

  • Join the community on Slack (English).
  • Join the community on DingTalk: Search GroupID 23330762 (Chinese).
  • Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).