Skip to main content

· 7 min read
Fei Guo

Ironically, probably every cloud user knew (or should realized that) failures in Cloud resources are inevitable. Hence, high availability is probably one of the most desirable features that Cloud Provider offers for cloud users. For example, in AWS, each geographic region has multiple isolated locations known as Availability Zones (AZs). AWS provides various AZ-aware solutions to allow the compute or storage resources of the user applications to be distributed across multiple AZs in order to tolerate AZ failure, which indeed happened in the past.

In Kubernetes, the concept of AZ is not realized by an API object. Instead, an AZ is usually represented by a group of hosts that have the same location label. Although hosts within the same AZ can be identified by labels, the capability of distributing Pods across AZs was missing in Kubernetes default scheduler. Hence it was difficult to use single StatefulSet or Deployment to perform AZ-aware Pods deployment. Fortunately, in Kubernetes 1.16, a new feature called "Pod Topology Spread Constraints" was introduced. Users now can add new constraints in the Pod Spec, and scheduler will enforce the constraints so that Pods can be distributed across failure domains such as AZs, regions or nodes, in a uniform fashion.

In Kruise, UnitedDeploymemt provides an alternative to achieve high availability in a cluster that consists of multiple fault domains - that is, managing multiple homogeneous workloads, and each workload is dedicated to a single Subset. Pod distribution across AZs is determined by the replica number of each workload. Since each Subset is associated with a workload, UnitedDeployment can support finer-grained rollout and deployment strategies. In addition, UnitedDeploymemt can be further extended to support multiple clusters! Let us reveal how UnitedDeployment is designed.

Using Subsets to describe domain topology#

UnitedDeploymemt uses Subset to represent a failure domain. Subset API primarily specifies the nodes that forms the domain and the number of replicas, or the percentage of total replicas, run in this domain. UnitedDeployment manages subset workloads against a specific domain topology, described by a Subset array.

type Topology struct {    // Contains the details of each subset.    Subsets []Subset}
type Subset struct {    // Indicates the name of this subset, which will be used to generate    // subset workload name prefix in the format '<deployment-name>-<subset-name>-'.    Name string
    // Indicates the node select strategy to form the subset.    NodeSelector corev1.NodeSelector
    // Indicates the number of the subset replicas or percentage of it on the    // UnitedDeployment replicas.    Replicas *intstr.IntOrString}

The specification of the subset workload is saved in Spec.Template. UnitedDeployment only supports StatefulSet subset workload as of now. An interesting part of Subset design is that now user can specify customized Pod distribution across AZs, which is not necessarily a uniform distribution in some cases. For example, if the AZ utilization or capacity are not homogeneous, evenly distributing Pods may lead to Pod deployment failure due to lack of resources. If users have prior knowledge about AZ resource capacity/usage, UnitedDeployment can help to apply an optimal Pod distribution to ensure overall cluster utilization remains balanced. Of course, if not specified, a uniform Pod distribution will be applied to maximize availability.

Customized subset rollout Partitions#

User can update all the UnitedDeployment subset workloads by providing a new version of subset workload template. Note that UnitedDeployment does not control the entire rollout process of all subset workloads, which is typically done by another rollout controller built on top of it. Since the replica number in each Subset can be different, it will be much more convenient to allow user to specify the individual rollout Partition of each subset workload instead of using one Partition to rule all, so that they can be upgraded in the same pace. UnitedDeployment provides ManualUpdate strategy to customize per subset rollout Partition.

type UnitedDeploymentUpdateStrategy struct {    // Type of UnitedDeployment update.    Type UpdateStrategyType    // Indicates the partition of each subset.    ManualUpdate *ManualUpdate}
type ManualUpdate struct {    // Indicates number of subset partition.    Partitions map[string]int32}

multi-cluster controller

This makes it fairly easy to coordinate multiple subsets rollout. For example, as illustrated in Figure 1, assuming UnitedDeployment manages three subsets and their replica numbers are 4, 2, 2 respectively, a rollout controller can realize a canary release plan of upgrading 50% of Pods in each subset at a time by setting subset partitions to 2, 1, 1 respectively. The same cannot be easily achieved by using a single workload controller like StatefulSet or Deployment.

Multi-Cluster application management (In future)#

UnitedDeployment can be extended to support multi-cluster workload management. The idea is that Subsets may not only reside in one cluster, but also spread over multiple clusters. More specifically, domain topology specification will associate a ClusterRegistryQuerySpec, which describes the clusters that UnitedDeployment may distribute Pods to. Each cluster is represented by a custom resource managed by a ClusterRegistry controller using Kubernetes cluster registry APIs.

type Topology struct {  // ClusterRegistryQuerySpec is used to find the all the clusters that  // the workload may be deployed to.   ClusterRegistry *ClusterRegistryQuerySpec  // Contains the details of each subset including the target cluster name and  // the node selector in target cluster.  Subsets []Subset}
type ClusterRegistryQuerySpec struct {  // Namespaces that the cluster objects reside.  // If not specified, default namespace is used.  Namespaces []string  // Selector is the label matcher to find all qualified clusters.  Selector   map[string]string  // Describe the kind and APIversion of the cluster object.  ClusterType metav1.TypeMeta}
type Subset struct {  Name string
  // The name of target cluster. The controller will validate that  // the TargetCluster exits based on Topology.ClusterRegistry.  TargetCluster *TargetCluster
  // Indicate the node select strategy in the Subset.TargetCluster.  // If Subset.TargetCluster is not set, node selector strategy refers to  // current cluster.  NodeSelector corev1.NodeSelector
  Replicas *intstr.IntOrString }
type TargetCluster struct {  // Namespace of the target cluster CRD  Namespace string  // Target cluster name  Name string}

A new TargetCluster field is added to the Subset API. If it presents, the NodeSelector indicates the node selection logic in the target cluster. Now UnitedDeployment controller can distribute application Pods to multiple clusters by instantiating a StatefulSet workload in each target cluster with a specific replica number (or a percentage of total replica), as illustrated in Figure 2.

multi-cluster	controller

At a first glance, UnitedDeployment looks more like a federation controller following the design pattern of Kubefed, but it isn't. The fundamental difference is that Kubefed focuses on propagating arbitrary object types to remote clusters instead of managing an application across clusters. In this example, had a Kubefed style controller been used, each StatefulSet workload in individual cluster would have a replica of 100. UnitedDeployment focuses more on providing the ability of managing multiple workloads in multiple clusters on behalf of one application, which is absent in Kubernetes community to the best of our knowledge.


This blog post introduces UnitedDeployment, a new controller which helps managing application spread over multiple domains (in arbitrary clusters). It not only allows evenly distributing Pods over AZs, which arguably can be more efficiently done using the new Pod Topology Spread Constraint APIs though, but also enables flexible workload deployment/rollout and supports multi-cluster use cases in the future.

· 4 min read
Fei Guo

The concept of controller in Kubernete is one of the most important reasons that make it successful. Controller is the core mechanism that supports Kubernetes APIs to ensure the system reaches the desired state. By leveraging CRDs/controllers and operators, it is fairly easy for other systems to integrate with Kubernetes.

Controller runtime library and the corresponding controller tool KubeBuilder are widely used by many developers to build their customized Kubernetes controllers. In Kruise project, we also use Kubebuilder to generate scaffolding codes that implement the "reconciling" logic. In this blog post, I will share some learnings from Kruise controller development, particularly, about concurrent reconciling.

Some people may already notice that controller runtime supports concurrent reconciling. Check for the options (source) used to create new controller:

type Options struct {    // MaxConcurrentReconciles is the maximum number of concurrent Reconciles which can be run. Defaults to 1.    MaxConcurrentReconciles int
    // Reconciler reconciles an object    Reconciler reconcile.Reconciler}

Concurrent reconciling is quite useful when the states of the controller's watched objects change so frequently that a large amount of reconcile requests are sent to and queued in the reconcile queue. Multiple reconcile loops do help drain the reconcile queue much more quickly compared to the default single reconcile loop case. Although this is a great feature for performance, without digging into the code, an immediate concern that a developer may raise is that will this introduce consistency issue? i.e., is it possible that two reconcile loops handle the same object at the same time?

The answer is NO, as you may expect. The "magic" is enforced by the workqueue implementation in Kubernetes client-go, which is used by controller runtime reconcile queue. The workqueue algorithm (source) is demonstrated in Figure 1.


Basically, the workqueue uses a queue and two sets to coordinate the process of handling multiple reconciling requests against the same object. Figure 1(a) presents the initial state of handling four reconcile requests, two of which target the same object A. When a request arrives, the target object is first added to the dirty set or dropped if it presents in dirty set, and then pushed to the queue only if it is not presented in processing set. Figure 1(b) shows the case of adding three requests consecutively. When a reconciling loop is ready to serve a request, it gets the target object from the front of the queue. The object is also added to the processing set and removed from the dirty set (Figure 1(c)). Now if a request of the processing object arrives, the object is only added to the dirty set, not to the queue (Figure 1(d)). This guarantees that an object is only handled by one reconciling loop. When reconciling is done, the object is removed from the processing set. If the object is also shown in the dirty set, it is added back to the back of the queue (Figure 1(e)).

The above algorithm has following implications:

  • It avoids concurrent reconciling for the same object.
  • The object processing order can be different from arriving order even if there is only one reconciling thread. This usually would not be a problem since the controller still reconciles to the final cluster state. However, the out of order reconciling may cause a significant delay for a request. workqueue-starve.... For example, as illustrated in Figure 2, assuming there is only one reconciling thread and two requests targeting the same object A arrive, one of them will be processed and object A will be added to the dirty set (Figure 2(b)). If the reconciling takes a long time and during which a large number of new reconciling requests arrive, the queue will be filled up by the new requests (Figure 2(c)). When reconciling is done, object A will be added to the back of the queue (Figure 2(d)). It would not be handled until all the requests coming after had been handled, which can cause a noticeable long delay. The workaround is actually simple - USE CONCURRENT RECONCILES. Since the cost of an idle go routine is fairly small, the overhead of having multiple reconcile threads is low even if the controller is idle. It seems that the MaxConcurrentReconciles value should be overwritten to a value larger than the default 1 (CloneSet uses 10 for example).
  • Last but not the least, reconcile requests can be dropped (if the target exists in dirty set). This means that we cannot assume that the controller can track all the object state change events. Recalling a presentation given by Tim Hockin, Kubernetes controller is level triggered, not edge triggered. It reconciles for state, not for events.

Thanks for reading the post, hope it helps.

· 6 min read
Fei Guo
Siyu Wang

Kubernetes does not provide a clear guidance about which controller is the best fit for a user application. Sometimes, this does not seem to be a big problem if users understand both the application and workload well. For example, users usually know when to choose Job/CronJob or DaemonSet since the concepts of these workload are straightforward - the former is designed for temporal batch style applications and the latter is suitable for long running Pod which is distributed in every node. On the other hand, the usage boundary between Deployment and StatefulSet is vague. An application managed by a Deployment conceptually can be managed by a StatefulSet as well, the opposite may also apply as long as the Pod OrderedReady capability of StatefulSet is not mandatory. Furthermore, as more and more customized controllers/operators become available in Kubernetes community, finding suitable controller can be a nonnegligible user problem especially when some controllers have functional overlaps.

Kruise attempts to mitigate the problem from two aspects:

  • Carefully design the new controllers in the Kruise suite to avoid unnecessary functional duplications that may confuse users.
  • Establish a classification mechanism for existing workload controllers so that user can more easily understand the use cases of them. We will elaborate this more in this post. The first and most intuitive criterion for classification is the controller name.

Controller Name Convention#

An easily understandable controller name can certainly help adoption. After consulting with many internal/external Kubernetes users, we decide to use the following naming conventions in Kruise. Note that these conventions are not contradicted with the controller names used in upstream controllers.

  • Set -suffix names: This type of controller manages Pods directly. Examples include CloneSet, ReplicaSet and SidecarSet. It supports various depolyment/rollout strategies in Pod level.

  • Deployment -suffix names: This type of controller does not manage Pods directly. Instead, it manages one or many Set -suffix workload instances which are created on behalf of one application. The controller can provide capabilities to orchestrate the deployment/rollout of multiple instances. For example, Deployment manages ReplicaSet and provides rollout capability which is not available in ReplicaSet. UnitedDeployment (planned in M3 release) manages multiple StatefulSet created in respect of multiple domains (i.e., fault domains) within one cluster.

  • Job -suffix names: This type of controller manages batch style applications with different depolyment/rollout strategies. For example, BroadcastJob distributes a job style Pod to every node in the cluster.

Set, Deployment and Job are widely adopted terms in Kubernetes community. Kruise leverages them with certain extensions.

Can we further distinguish controllers with the same name suffix? Normally the string prior to the suffix should be self-explainable, but in many cases it is hard to find a right word to describe what the controller does. Check to see how StatefulSet is originated in this thread. It takes four months for community to decide to use the name StatefulSet to replace the original name PetSet although the new name still confuse people by looking at its API documentation. This example showcases that sometimes a well-thought-out name may not be helpful to identify controller. Again, Kruise does not plan to resolve this problem. As an incremental effort, Kruise considers the following criterion to help classify Set -suffix controllers.

Fixed Pod Name#

One unique property of StatefulSet is that it maintains consistent identities for Pod network and storage. Essentially, this is done by fixing Pod names. Pod name can identify both network and storage since it is part of DNS record and can be used to name Pod volume claim. Why is this property needed given that all Pods in StatefulSet are created from the same Pod template? A well known use case is to manage distributed coordination server application such as etcd or Zookeeper. This type of application requires the cluster member (i.e., the Pod) to access the same data (in Pod volume) whenever a member is reconstructed upon failure, in order to function correctly. To differentiate the term State in StatefulSet from the same term used in other computer science areas, I'd like to associate State with Pod name in this document. That being said, controllers like ReplicaSet and DaemonSet are Stateless since they don't require to reuse the old Pod name when a Pod is recreated.

Supporting Stateful does lead to inflexibility for controller. StatefulSet relies on ordinal numbers to realize fixing Pod names. The workload rollout and scaling has to be done in a strict order. As a consequence, some useful enhancements to StatefulSet become impossible. For example,

  • Selective Pod upgrade and Pod deletion (when scale in). These features can be helpful when Pods are spread across different regions or fault domains.
  • The ability of taking control over existing Pods with arbitrary names. There are cases where Pod creation is done by one controller but Pod lifecycle management is done by another controller (e.g., StatefulSet).

We found that many containerized applications do not require the Stateful property of fixing Pod names, and StatefulSet is hard to be extended for those applications in many cases. To fill the gap, Kruise has released a new controller called CloneSet to manage the Stateless applications. In a nutshell, CloneSet provides PVC support and enriched rollout and management capabilities. The following table roughly compares Advanced StatefulSet and CloneSet in a few aspects.

FeaturesAdvanced StatefulSetCloneSet
Pod nameOrderedRandom
Inplace upgradeYesYes
Max unavailableYesYes
Selective deletionNoYes
Selective upgradeNoYes
Change Pod ownershipNoYes

Now, a clear recommendation to Kruise users is if your applications require fixed Pod names (identities for Pod network and storage), you can start with Advanced StatefulSet. Otherwise, CloneSet is the primary choice of Set -suffix controllers (if DaemonSet is not applicable).


Kruise aims to provide intuitive names for new controllers. As a supplement, this post provides additional guidance for Kruise users to pick the right controller for their applications. Hope it helps!