7 posts tagged with "workload"

A Flexible and Configurable Serverless Elastic Solution at the Workload Level

February 19, 2025 · 13 min read

Member of OpenKruise

Serverless represents an extension of cloud computing, inheriting its most significant feature: on-demand elastic scaling. This model design allows developers to focus on application logic without concerning themselves with deployment resources, thereby fully leveraging resource scalability to provide superior elasticity capabilities. Enterprises can also genuinely benefit from true pay-as-you-go characteristics. Consequently, more cloud providers are converging towards this new architectural paradigm.

The core capability of "flexible configurability" in Serverless technology focuses on enabling specific cloud usage scenarios to fully utilize cloud resources through simple, minimally invasive, and highly configurable methods. Its essence lies in resolving the conflict between capacity planning and actual cluster load configuration. This article will sequentially introduce two configurable components — WorkloadSpread and UnitedDeployment — discussing their core capabilities, technical principles, advantages and disadvantages, as well as real-world applications. Through these discussions, we aim to share OpenKruise's technical evolution and considerations in addressing Serverless workload elasticity.

Overview of Elastic Scenarios

As Serverless technology matures, more enterprises prefer using cloud resources (such as Alibaba Cloud ACS Serverless container instances) over on-premise resources (like managed resource pools or on-premise IDC data centers) to host applications with temporary, tidal, or bursty characteristics. This approach enhances resource utilization efficiency and reduces overall costs by adopting a pay-as-you-go model. Below are some typical elastic scenarios:

Prioritize using on-premise resources in offline IDC data centers; scale application to the cloud when resources are insufficient.
Prefer using pre-paid resource pool in the cloud; use pay-as-you-go Serverless instances for additional replicas when resources are insufficient.
Use high-quality stable compute power (e.g., dedicated cloud server instances) first; then use lower-quality compute power (e.g., Spot instances).
Configure different resource quantities for container replicas deployed on different compute platforms (e.g., X86, ARM, Serverless instances) to achieve similar performance.
Inject different middleware configurations into replicas deployed on nodes versus Serverless environments (e.g., shared Daemon on nodes, Sidecar injection on Serverless).

These components introduced in this article offer distinct advantages in solving the above problems. Users can choose appropriate capabilities based on their specific scenarios to effectively leverage elastic compute power.

Capabilities and Advantageous Scenarios of Two Components

WorkloadSpread: Utilizes a Mutating Webhook to intercept Pod creation requests that meet certain criteria and apply Patch operations to inject differentiated configurations. Suitable for existing applications requiring multiple elastic partitions with customized Pod Metadata and Spec fields.
UnitedDeployment: A workload with built-in capability of elastic partitioning and pod customization, offering stronger elasticity and capacity planning capabilities. Ideal for new applications needing detailed partitioning and individual configurations for each partition.

WorkloadSpread: An Elastic Strategy Plugin Based on Pod Mutating Webhook

WorkloadSpread is a bypass component provided by the OpenKruise community that spreads target workload Pods across different types of subsets according to specific rules, enhancing multi-region and elastic deployment capabilities without modifying the original workload. It supports almost all native or custom Kubernetes workloads, ensuring adaptability and flexibility in various environments.

Example Configuration

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: workloadspread-demo
spec:
  targetRef: # Supports almost all native or custom Kubernetes workloads
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet
    name: workload-xxx
  subsets:
    - name: subset-a
      # The first three replicas will be scheduled to this Subset
      maxReplicas: 3
      # Pod affinity configuration
      requiredNodeSelectorTerm:
        matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
              - zone-a
      patch:
        # Inject a custom label to Pods scheduled to this Subset
        metadata:
          labels:
            xxx-specific-label: xxx
    - name: subset-b
      # Deploy to Serverless clusters, no capacity and unlimited replicas
      requiredNodeSelectorTerm:
        matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
              - acs-cn-hangzhou
  scheduleStrategy:
    # Adaptive mode will reschedule failed Pods to other Subsets
    type: Adaptive | Fixed
    adaptive:
      rescheduleCriticalSeconds: 30

Powerful Partitioning Capability

WorkloadSpread spreads Pods into different elastic partitions using Subsets, scaling up forward and scaling down backward based on Subset order.

Flexible Scheduling Configuration

At the Subset level, WorkloadSpread supports selecting nodes via Labels and configuring advanced options such as taints and tolerations. For example, requiredNodeSelectorTerm specifies mandatory node attributes, preferredNodeSelectorTermssets preferred node attributes, and tolerations configures Pod tolerance for node taints. These configurations allow precise control over Pod scheduling and distribution.

At the global level, WorkloadSpread supports two scheduling strategies via the scheduleStrategy field: Fixed and Adaptive. The Fixed strategy ensures strict adherence to predefined Subset distributions, while the Adaptive strategy provides higher flexibility by automatically rescheduling Pods to other available Subsets when necessary.

Detailed Pod Customization

In Subset configurations, the patch field allows for fine-grained customization of Pods scheduled to that subset. Supported fields include container images, resource limits, environment variables, volume mounts, startup commands, probe configurations, and labels. This decouples Pod specifications from environment adaptations, enabling flexible workload adjustments for various partition environments.

...
# patch pod with a topology label:
patch:
  metadata:
    labels:
      topology.application.deploy/zone: "zone-a"
...

The example above demonstrates how to add or modify a label to all Pods in a Subset.

...
# patch pod container resources:
patch:
  spec:
    containers:
      - name: main
        resources:
          limit:
            cpu: "2"
            memory: 800Mi
...

The example above demonstrates how to add or modify the Pod Spec.

...
# patch pod container env with a zone name:
patch:
  spec:
    containers:
      - name: main
        env:
          - name: K8S_AZ_NAME
            value: zone-a
...

The example above demonstrates how to add or modify a container environment variable.

WorkloadSpread's Pod Mutating Webhook Mechanism

WorkloadSpread operates directly on Pods created by the target workload via Pod Mutating Webhook, ensuring non-intrusive operation. When a Pod creation request meets the criteria, the Webhook intercepts it, reads the corresponding WorkloadSpread configuration, selects an appropriate Subset, and modifies the Pod configuration accordingly. The controller maintains the controller.kubernetes.io/pod-deletion-cost label to ensure correct downsizing order.

Limitations of WorkloadSpread

Potential Risks of Webhook

WorkloadSpread depends on Pod Mutating Webhook to function, which intercepts all Pod creation requests in the cluster. If the Webhook Pod (kruise-manager) experiences performance issues or failures, it may prevent new Pods from being created. Additionally, during large-scale scaling operations, Webhook can become a performance bottleneck.

Limitations of Acting on Pods

While acting on Pods reduces business intrusion, it introduces limitations. For instance, CloneSet's gray release ratio cannot be controlled per Subset.

Case Study 1: Bandwidth Package Allocation in Large-Scale Load Testing

A company needed to perform load testing before a major shopping festival. They developed a load-agent program to generate requests and used a CloneSet to manage agent replicas. To save costs, they purchased 10 shared bandwidth packages (each supporting 300 Pods) and aimed to dynamically allocate them to elastic agent replicas.

They configured a WorkloadSpread with 11 Subsets: the first 10 Subsets had a capacity of 300 and patched Pod Annotations to bind specific bandwidth packages; the last Subset had no capacity and no bandwidth package, preventing extra bandwidth allocation if more than 3000 replicas were created.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: bandwidth-spread
  namespace: loadtest
spec:
  targetRef:
    apiVersion: apps.kruise.io/v1alpha1
    kind: CloneSet
    name: load-agent-XXXXX
  subsets:
    - name: bandwidthPackage-1
      maxReplicas: 300
      patch:
        metadata:
          annotations:
            k8s.aliyun.com/eip-common-bandwidth-package-id: <id1>

    - ...

    - name: bandwidthPackage-10
      maxReplicas: 300
      patch:
        metadata:
          annotations:
            k8s.aliyun.com/eip-common-bandwidth-package-id: <id10>
    - name: no-eip

Case Study 2: Compatibility for Scaling Managed K8S Cluster Services to Serverless Instances

A company had a web service running on an IDC that needed to scale up due to business growth but could not expand the local data center. They chose to use virtual nodes to access cloud-based Serverless elastic compute power, forming a hybrid cloud. Their application used acceleration services like Fluid, which were pre-deployed on nodes in the IDC but not available in the serverless subset. Therefore, they needed to inject a sidecar into cloud Pods to provide acceleration capabilities.

To achieve this without modifying the existing Deployment's 8 replicas, they used WorkloadSpread to add a label to Pods scaled to each subset, which controlled the Fluid sidecar injection.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: data-processor-spread
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: data-processor
  subsets:
    - name: local
      maxReplicas: 8
      patch:
        metadata:
          labels:
            serverless.fluid.io/inject: "false"
    - name: aliyun-acs
      patch:
        metadata:
          labels:
            serverless.fluid.io/inject: "true"

UnitedDeployment: A Native Workload with Built-in Elasticity

UnitedDeployment is an advanced workload provided by the OpenKruise community that natively supports partition management. Unlike WorkloadSpread, which enhances basic workloads, UnitedDeployment offers a new mode for managing partitioned elastic applications. It defines applications through a single template, and the controller creates and manages multiple secondary workloads to match different subsets. UnitedDeployment manages the entire lifecycle of applications within a single resource, including definition, partitioning, scaling, and upgrades.

Example Configuration

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
  name: sample-ud
spec:
  replicas: 6
  selector:
    matchLabels:
      app: sample
  template:
    cloneSetTemplate:
      metadata:
        labels:
          app: sample
      spec:
        # CloneSet Spec
        ...
  topology:
    subsets:
      - name: ecs
        nodeSelectorTerm:
          matchExpressions:
            - key: node-type
              operator: In
              values:
                - ecs
        maxReplicas: 2
      - name: acs-serverless
        nodeSelectorTerm:
          matchExpressions:
            - key: node-type
              operator: In
              values:
                - acs-virtual-kubelet

Advantages of UnitedDeployment

All-In-One Elastic Application Management

UnitedDeployment offers comprehensive all-in-one application management, enabling users to define applications, manage subsets, scale, and upgrade using a single resource.

The UnitedDeployment controller manages a corresponding type of secondary workload for each subset based on the workload template, without requiring additional attention from the user. Users only need to manage the application template and subsets; the UnitedDeployment controller will handle subsequent management tasks for each secondary workload, including creation, modification, and deletion. The controller also monitors the status of Pods created by these workloads when necessary to make corresponding adjustments.

It is the secondary workload controllers implement the specific scaling and updating operations. Thus, scaling and updating using UnitedDeployment produces exactly the same effect as directly using the corresponding workload. For example, a UnitedDeployment will inherit the same grayscale publishing and in-place upgrade capabilities from CloneSet when created with a CloneSet template.

Advanced Subset Management

UnitedDeployment incorporates two capacity allocation algorithms, enabling users to handle various scenarios of elastic applications through detailed subset capacity configurations.

The elastic allocation algorithm implements a classic elastic capacity allocation method similar to WorkloadSpread: by setting upper and lower capacity limits for each subset, Pods are scaled up in the defined order of subsets and scaled down in reverse order. This method has been thoroughly introduced earlier, so it will not be elaborated further here.

The specified allocation algorithm represents a new approach to capacity allocation. It directly assigns fixed numbers or percentages to some subsets and reserves at least one elastic subset to distribute the remaining replicas.

In addition to capacity allocation, UnitedDeployment also allows customizing any Pod Spec fields (including container images) for each subset, which is similar to WorkloadSpread. This grants UnitedDeployment's subset configuration with powerful flexibility.

Adaptive Elasticity

UnitedDeployment offers robust adaptive elasticity, automating scaling and rescheduling operations to reduce operational overhead. It supports Kubernetes Horizontal Pod Autoscaler (HPA), enabling automatic scaling based on predefined conditions while adhering strictly to subset configurations.

UnitedDeployment also offers adaptive Pod rescheduling capabilities similar to WorkloadSpread. Additionally, it allows configuration of timeout durations for scheduling failures and recovery times for subsets from unscheduable status, providing enhanced control over adaptive scheduling.

Limitations of UnitedDeployment

The many advantages of UnitedDeployment stem from its all-in-one management capabilities as an independent workload. However, this also leads to the drawback of higher business transformation intrusiveness. For users' existing application, it is necessary to modify PaaS systems and tools (such as operation and maintenance systems, release systems, etc.) to switch from existing workloads like Deployment and CloneSet to UnitedDeployment.

Case Study 1: Elastic Scaling of Pods to Virtual Nodes with Adaptation for Serverless Containers

Cloud providers typically offer three types of Kubernetes services:

Managed clusters with fixed nodes using cloud servers purchased by users.
Serverless clusters delivering container computing power directly via virtual node technology.
Hybrid clusters containing both managed nodes and virtual nodes.

In this case, a company planned to launch a new service with significant peak-to-valley traffic differences (up to tenfold). To handle this characteristic, they purchased a batch of cloud servers to form a managed cluster nodepool for handling baseline traffic and intended to quickly scale out new replicas to a serverless subset during peak hours. Additionally, their application required extra configuration to run in the Serverless environment. Below is an example configuration:

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
  name: elastic-app
spec:
  # Omitted business workload template
  ...
  topology:
    # Enable Adaptive scheduling to dispatch Pod replicas to ECS node pools and ACS instances adaptively
    scheduleStrategy:
      type: Adaptive
      adaptive:
        # Start scheduling to ACS Serverless instances 10 seconds after ECS node scheduling failure
        rescheduleCriticalSeconds: 10
        # Do not schedule to ECS nodes within one hour after the above scheduling failure
        unschedulableLastSeconds: 3600
    subsets:
      # Prioritize ECS without an upper limit; only schedule to ACS when ECS fails
      # During scale-in, delete ACS instances first, then ECS node pool Pods
      - name: ecs
        nodeSelectorTerm:
          matchExpressions:
            - key: type
              operator: NotIn
              values:
                - acs-virtual-kubelet
      - name: acs-serverless
        nodeSelectorTerm:
          matchExpressions:
            - key: type
              operator: In
              values:
                - acs-virtual-kubelet
          # Use patch to modify environment variables for Pods scheduled to elastic computing power, enabling Serverless mode
        patch:
          spec:
            containers:
              - name: main
                env:
                  - name: APP_RUNTIME_MODE
                    value: SERVERLESS
---
# Combine with HPA for automatic scaling
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: elastic-app-hpa
spec:
  minReplicas: 1
  maxReplicas: 100
  metrics:
    - resource:
        name: cpu
        targetAverageUtilization: 2
      type: Resource
  scaleTargetRef:
    apiVersion: apps.kruise.io/v1alpha1
    kind: UnitedDeployment
    name: elastic-app

Case Study 2: Allocating Different Resources to Pods with Different CPU Types

In this case, a company purchased several cloud server instances with Intel, AMD, and ARM platform CPUs to prepare for launching a new service. They wanted Pods scheduled on different platforms to exhibit similar performance. After stress testing, it was found that, compared to Intel CPUs as the benchmark, AMD platforms needed more CPU cores, while ARM platforms required more memory.

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
  name: my-app
spec:
  replicas: 4
  selector:
    matchLabels:
      app: my-app
  template:
    deploymentTemplate:
      ... # Omitted business workload template
  topology:
    # Intel, AMD, and Yitian 710 ARM machines carry 50%, 25%, and 25% of the replicas respectively
    subsets:
      - name: intel
        replicas: 50%
        nodeSelectorTerm:
          ... # Select Intel node pool through labels
        patch:
          spec:
            containers:
              - name: main
                resources:
                  limits:
                    cpu: 2000m
                    memory: 4000Mi
      - name: amd64
        replicas: 25%
        nodeSelectorTerm:
          ... # Select AMD node pool through labels
        # Allocate more CPU to AMD platform
        patch:
          spec:
            containers:
              - name: main
                resources:
                  limits:
                    cpu: 3000m
                    memory: 4000Mi
      - name: yitian-arm
        replicas: 25%
        nodeSelectorTerm:
          ... # Select ARM node pool through labels
        # Allocate more memory to ARM platform
        patch:
          spec:
            containers:
              - name: main
                resources:
                  limits:
                    cpu: 2000m
                    memory: 6000Mi

Summary

Elastic computing power can significantly reduce business costs and effectively increase the performance ceiling of services. To make good use of elastic computing power, it is necessary to choose appropriate elastic components based on specific application characteristics. The following table summarizes the capabilities of the two components introduced in this article, hoping to provide some reference.

Component	Partition Principle	Ease of Modification	Granularity of Partition	Elasticity Capability
WorkloadSpread	Modify Pods via Webhook	High	Medium	Medium
UnitedDeployment	Create multiple workloads via templates	Low	High	High

WorkloadSpread - Interpretation for OpenKruise v0.10.0 new feature

September 22, 2021 · 12 min read

GuangLei Cao

Contributor of OpenKruise

Weixiang Sun

Member of OpenKruise

Background

Deploying an application in different zones, different hardware types, and even different clusters and cloud vendors is becoming a very common requirement with the development of cloud native techniques. For examples, these are some cases:

Cases about disaster tolerant:

Application pods is scattered according to the nodes to avoid stacking.
Application pods is scattered according to available zones.
Different nodes/zones/domains require different scale of pods.

Cases about cost control:

People deploy an applications preferentially to their own resource pool, and then deployed to elastic resource pool, such as ECI on Aliyun and Fragate on AWS, when own resources are insufficient. When shrinking, the elastic node is preferred to shrink to save cost.

In the most of the cases, people always split their application into multiple workloads (such as several Deployment) to deploy. However，this solution often requires manual management by SRE team, or a deeply customized PAAS to support the careful management of multiple workloads for this one application.

In order to solve this problem, WorkloadSpread feature has been proposed in version v0.10.0 OpenKruise. It can support multi-kind of workloads, such as Deployment, Replicaset, Job, and Cloneset, to manage the partition deployment or elastic scaling. The application scenario and implementation principle of WorkloadSpread will be introduced in detail below to help users better understand this feature.

Introduction

More details about WorkloadSpread can be found in Offical Document.

In short, WorkloadSpread can distribute pods of a workload to different types of nodes according to certain rules, so as to meet the above fragmentation and elasticity scenarios. WorkloadSpread is non-invasive, "plug and play", and can be effective for stock workloads.

Let's make a simple comparison with some related works in the community.

「1」Pod Topology Spread Constrains

Pod topology spread constraints is a solution provided by Kubernetes community. It can horizontally scatter pods according to topology key. The scheduler will select the node that matches the conditions according to the configuration if users defined this rule.

Since Pod Topology Spread is evenly dispersed, it cannot support exact customized partition number and proportion configuration. Furthermore, the distribution of pods will be destroyed when scaling down. Using WorkloadSpread can avoid these problems.

「2」UnitedDeploymen

UnitedDeployment is a solution provided by the OpenKruise community. It can manage pods in multiple regions by creating and managing multiple workloads.

UnitedDeployment supports the requirements of fragmentation and flexibility very well. But, it is a new workload, and the use cost and migration costs will be relatively high, whereas WorkloadSpread is a lightweight solution, which only needs to apply a simple configuration to associate the workload.

Use Case

In the section, I will list some application scenarios of WorkloadSpread and give corresponding configurations to help users quickly understand the WorkloadSpread feature.

「1」Deploy 100 pods to normal node pool, rest pods to elastic node pool

case-1

subsets:
- name: subset-normal
  maxReplicas: 100
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - normal
- name: subset-elastic 
# maxReplicas==nil means no limit for replicas
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - elastic

When the workload has less than 100 replicas, all pods will be deployed to the normal node pool, and more than 100 are deployed to the elastic node pool. When scaling down, the pods on the elastic node will be deleted first.

Since workload spread limits the distribution of workload, but does not invade workload. Users can also dynamically adjust the number of replicas according to the resource load in combination with HPA.

In this way, it will be automatically scheduled to the elastic node pool when receiving peak flow, and give priority to releasing the resources in the elastic resource pool when the peak gone.

「1」Deploy pods to normal node pool first, to elastic resource pool when normal node pool is insufficient

case-2

scheduleStrategy:
  type: Adaptive
  adaptive:
    rescheduleCriticalSeconds: 30
    disableSimulationSchedule: false
subsets:
- name: subset-normal
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - normal
- name: subset-elastic
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - elastic

Both subsets have no limit on the number of replicas, and the Adaptive rescheduling policy are enabled. The goal is to preferentially deploy to the normal node pool. When normal resources are insufficient, webhook will select elastic nodes through simulated scheduling. When the pod in the normal node pool is in the pending state and exceeds the 30s threshold, the WorkloadSpread controller will delete the pod to trigger pod reconstruction, and the new pod will be scheduled to the elastic node pool. During volume reduction, the pod on the elastic node is also preferentially reduced to save costs for users.

「3」Scatter to 3 zones, the scale is 1:1:3

case-3

subsets:
- name: subset-a
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-a
- name: subset-b
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-b
- name: subset-c
  maxReplicas: 60%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-c   

WorkloadSpread ensures that the pods are scheduled according to the defined proportion when scaling up and down.

「4」Configures different resource quotas on different CPU architecture

case-4

subsets:
- name: subset-x86-arch
  # maxReplicas...
  # requiredNodeSelectorTerm...
  patch:
    metadata:
      labels:
        resource.cpu/arch: x86
    spec: 
      containers:
      - name: main
        resources:
          limits:
            cpu: "500m"
            memory: "800Mi"
- name: subset-arm-arch
  # maxReplicas...
  # requiredNodeSelectorTerm...
  patch:
    metadata:
      labels:
        resource.cpu/arch: arm
    spec: 
      containers:
      - name: main
        resources:
          limits:
            cpu: "300m"
            memory: "600Mi"

From the above example, we have patched different labels and container resources for the pods of two subsets, which is convenient for us to manage the pod more finely. When workload pods are distributed on nodes of different CPU architectures, configure different resource quotas to make better use of hardware resources.

Implementation

WorkloadSpread is a pure bypass elastic/topology control solution. Users only need to create a corresponding WorkloadSpread config for their Deployment/Cloneset/Job/ReplicaSet Workloads. There is no need to change the them, and users will be no additional cost to use the WorkloadSpread.

arch

「1」 How to decide the priority when scaling up?

Multiple subsets are defined in WorkloadSpread, and each subset represents a logical domain. Users can freely define subsets according to node configuration, hardware type, zone, etc. In particular, we defined the priority of subsets:

The priority is defined from high to low in the order from front to back, for example subset[i] has higher priority than subset[j] if i < j.
The pods will be scheduled to the subsets with higher priority first.

「2」 How to decide the priority when scaling down?

Theoretically, the bypass solution of WorkloadSpread cannot interfere with the scaling logic in the workload controller.

However, this problem has been solved in the near future. Through the unremitting efforts (feedback) of users, k8s since version 1.21, it has been supported for ReplicaSet (deployment) to specify the "deletion cost" of the pods by setting the annotation controller.kubernetes.io/pod-deletion-cost: the higher the deletion cost, the lower the priority of deletion.

Since version v0.9.0 OpenKruise, the deletion cost feature has been supported in cloneset.

Therefore, the WorkloadSpread controller controls the scaling down order of the pods by adjusting their deletion cost.

For example, an WorkloadSpread associated a CloneSet with 10 replicas is as follows:

  subsets:
  - name: subset-a
    maxReplicas: 8
  - name: subset-b

Then the deletion cost value and deletion order are as follows:

8 pods in subset-a will have 200 deletion cost;
2 pods in subset-b will have 100 deletion cost, and will be deleted first;

If user modify WorkloadSpread as:

  subsets:
  - name: subset-a
    maxReplicas: 5 # 8->5, 
  - name: subset-b

Then the deletion cost value and deletion order will also changed as follows:

5 pods in subset-a will have 200 deletion cost;
3 pods in subset-a will have -100 deletion cost, and will be deleted first;
2 pods in subset-b will have 100 deletion cost;

In this way, workload can preferentially scale down those pods that exceed the subset maxReplicas limit.

「3」 How to solve the counting problems?

How to ensure that webhook injects pod rules in strict accordance with the priority order of subset and the number of maxReplicas is a key problem at the implementation of WorkloadSpread.

3.1 solving concurrency consistency problem

Sine there may be several kruise-controller-manager pods and lots of webhook Goroutines to process the same WorkloadSpread, the concurrency consistency problem must exist.

In the status of WorkloadSpread, there are the subsetStatuses field corresponding to each subset. The missingReplicas field in it indicates the number of pods required by the subset, and - 1 indicates that there is no quantity limit (subset.maxReplicas == nil).

spec:
  subsets:
  - name: subset-a
    maxReplicas: 1
  - name: subset-b
  # ...
status:
  subsetStatuses:
  - name: subset-a
    missingReplicas: 1
  - name: subset-b
    missingReplicas: -1
  # ...

When webhook receives a pod create request:

Find a suitable subset with missingReplicas greater than 0 or equals to -1 according to the subset order.
After finding a suitable subset, if missingReplicas is greater than 0, subtract 1 first and try to update the WorkloadSpread status.
If the update is successful, inject the rules defined by the subset into the pod.
If the update fails, get the WorkloadSpread again to get the latest status, and return to step 1 (there is a certain limit on the number of retries).

Similarly, when webhook receives a pod delete or eviction request, MisingReplicas will add 1 to missingreplicas and update it.

There is no doubt that we are using optimistic locks to solve update conflicts. However, it is not appropriate to only use optimistic locks, because workload will create a large number of pods in parallel, and APIServer will send many pod create requests to webhook in an instant, resulting in a lot of conflicts in parallel processing. As we all know, optimistic lock is not suitable for too many conflicts, because the retry cost of solving conflicts is very high. To this end, we also added a WorkloadSpread level mutex to limit parallel processing to serial processing. There is a new problem in adding mutex locks, that is, after the current root obtains the lock, it is very likely that the WorkloadSpread obtained from infomer is not up-to-date, and will conflict as well. Therefore, after updating the WorkloadSpread, the Goroutine caches the latest WorkloadSpread and then releases the lock, so that the new Goroutine can directly get the latest WorkloadSpread from the cache after obtaining the lock. Of course, in the case of multiple webhooks, we still need to combine the optimistic lock mechanism to solve the conflict.

3.2 solving data consistency problem

So, is the missingReplicas field controlled by the webhook? The answer is NO, because:

The pod create request received by webhook may not really succeed in the end (for example, pod is illegal or fails in subsequent quota verification).
The pod delete/eviction request received by webhook may not really succeed in the end (for example, it is intercepted by PDB, PUB, etc.).
There are always various possibilities in k8s, leading to the end or disappearance of the pods without going through webhook (for example, phase enters succeeded/failed, or ETCD data is lost, etc.).
At the same time, this is not in line with the end state oriented design concept.

Therefore, the WorkloadSpread status is controlled by webhook in collaboration with the controller:

Webhook requests link interception in pod create/delete/ eviction, and modifies the missingReplicas.
At the same time, the controller's reconcile will also get all pods under the current workload, classify them according to the subset, and update missingReplicas to the actual missing quantity.
From the above analysis, it is likely that there is a delay for the controller to obtain the pod from the informer, so we also added the creatingPods map in the status. When the pod is injected at webhook, the key will be recorded as pod name and value are timestamp to the map, and the controller maintains the real missingReplicas in combination with the map. Similarly, there is also a deleteingPods map to record the delete/eviction event of the pod.

「4」How to do if pod schedule failed?

The configuration of reschedule strategy is supported in WorkloadSpread. By default, the type is fixed, that is, the pod is scheduled to the corresponding subset according to the sequence of each subset and the maxReplicas limit.

However, in real scenarios, many times, the resources of subset may not fully meet the number of maxReplicas due to some reasons, such as insufficient resources. Users need a more flexible reschedule strategy.

The adaptive capabilities provided by WorkloadSpread are logically divided into two types:

SimulationSchedule: scheduling records exists in informer, so we want to simulate the scheduling of pods in webhook. That is, simple filtering is performed through nodeSelector/Affinity, Tolerances, and basic resources resources. (not applicable to virtual-kubelet)
Reschedule: After scheduling the pod to a subset, if the scheduling failure exceeds the rescheduleCriticalSeconds time, mark the subset as unscheduled temporarily, and delete the pod to trigger reconstruction. By default, unscheduled will be reserved for 5min, that is, pod creation within 5min will skip this subset.

Conclusion

WorkloadSpread combines some existing features of Kubernetes to give workload the ability of elastic and multi-domain deployment in the form of bypass. We hope that users can reduce workload deployment complexity by using WorkloadSpread and effectively reduce costs by taking advantage of its elastic scalability.

At present, WorkloadSpread is applied to some project in Alibaba, and adjustments in the use will be fed back to the community in time. In the future, there are some new capability plans for WorkloadSpread, such as managing the existing pods, supporting batch workloads, and even using label to match the pod across different workloads. Some of these capabilities need to actually consider the needs and scenarios of community users. I hope you can participate in kruise community, mention Issues and PRs, help users solve the problems of more cloud native deployment, and build a better community.

Reference

WorkloadSpread: https://openkruise.io/docs/user-manuals/workloadspread
Pod Topology Spread Constrains: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
UnitedDeployment: https://openkruise.io/docs/user-manuals/uniteddeployment

OpenKruise 0.9.0, SidecarSet Helps Mesh Container Hot Upgrade

June 10, 2021 · 8 min read

Mingshan Zhao

Member of OpenKruise

OpenKruise is an open source management suite developed by Alibaba Cloud for cloud native application automation. It is currently a Sandbox project hosted under the Cloud Native Computing Foundation (CNCF). Based on years of Alibaba's experience in container and cloud native technologies, OpenKruise is a Kubernetes-based standard extension component that has been widely used in the Alibaba internal production environment, together with technical concepts and best practices for large-scale Internet scenarios.

OpenKruise released v0.8.0 on March 4, 2021, with enhanced SidecarSet capabilities, especially for log management of Sidecar.

Background - How to Upgrading Mesh Containers Independently

SidecarSet is a workload provided by Kruise to manage sidecar containers. Users can complete automatic injection and independent upgrades conveniently using SidecarSet.

By default, sidecar upgrade will first stop the old container and start a new one. This method is particularly suitable for sidecar containers that do not affect Pod service availability, such as log collection agents. However, for many proxies or sidecar containers for runtime, such as Istio Envoy, this upgrade method does not work. Envoy functions as a Proxy container in the Pod to handle all traffic. If users restart in this scenario to upgrade directly, the service availability of the Pod will be affected. Therefore, the release and capacity of the application should be taken into consideration. The sidecar release cannot be independent of the application.

how update mesh sidecar

Tens of thousands of pods in Alibaba Group communicate with each other based on Service Mesh. Mesh container upgrades may make business pods unavailable. Therefore, the upgrade of the mesh containers hinders the iteration of Service Mesh. To address this scenario, we worked with the Service Mesh team to implement the hot upgrade capability of the mesh container. This article focuses on the important role SidecarSet is playing during the implementation of the hot upgrade capability of mesh containers.

SidecarSet Helps Lossless Hot Upgrade of Mesh Containers

Mesh containers cannot perform direct in-place upgrades like the log collection class container. The mesh container must provide services without interruption, but an independent upgrade will make the mesh service unavailable for some time. Some well-known mesh services in the community, such as Envoy and Mosn, provide smooth upgrade capabilities by default. However, these upgrade methods cannot be integrated properly with cloud-native, and Kubernetes does not have an upgrade solution for such sidecar containers.

OpenKruise SidecarSet provides the sidecar hot upgrade mechanism for the mesh container. Thus, lossless Mesh container hot upgrade can be implemented in a cloud-native manner.

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: hotupgrade-sidecarset
spec:
  selector:
    matchLabels:
      app: hotupgrade
  containers:
  - name: sidecar
    image: openkruise/hotupgrade-sample:sidecarv1
    imagePullPolicy: Always
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - /migrate.sh
    upgradeStrategy:
      upgradeType: HotUpgrade
      hotUpgradeEmptyImage: openkruise/hotupgrade-sample:empty

upgradeType: “HotUpgrade” indicates this type of sidecar container, which is hot upgrade.
hotUpgradeEmptyImage: When performing hot upgrade on sidecar containers, businesses need to provide an empty container for container switchover. The Empty container has the same configuration as the sidecar container (except for the image address), such as command, lifecycle, and probe.

The SidecarSet hot upgrade mechanism includes two steps: injection of Sidecar containers of the hot upgrade type and Mesh container smooth upgrade.

Inject Sidecar Containers of the Hot Upgrade Type

For Sidecar containers of the hot upgrade type, two containers will be injected by SidercarSet Webhook when creating the Pod:

{sidecar.name}-1: As shown in the following figure, envoy-1 represents a running sidecar container, for example, envoy:1.16.0.
{sidecar.name}-2: As shown in the following figure, envoy-2 represents the “hotUpgradeEmptyImage” container provided by the business, for example, empty:1.0.

inject sidecar

This Empty container does not have any practical work while running the Mesh container.

Smooth Mesh Container Upgrade

The hot upgrade process is divided into three steps:

Upgrade: Replace the Empty container with the sidecar container of the latest version, for example, envoy-2.Image = envoy:1.17.0
Migration: Run the “PostStartHook” script of the sidecar container to upgrade the mesh service smoothly
Reset: After the mesh service is upgraded, replace the sidecar container of the earlier version with an Empty container, for example, envoy-1.Image = empty:1.0

update sidecar

The preceding three steps represent the entire process of the hot upgrade. If multiple hot upgrades on a Pod are required, users only need to repeat the three steps listed above.

Core Logic of Migration

The SidecarSet hot upgrade mechanism completes the mesh container switching and provides the coordination mechanism (PostStartHook) for containers of old and new versions. However, this is only the first step. The Mesh container also needs to provide the PostStartHook script to upgrade the mesh service smoothly (please see the preceding migration process), such as Envoy hot restart and Mosn lossless restart.

Mesh containers generally provide external services by listening to a fixed port. The migration process of mesh containers can be summarized as: pass ListenFD through UDS, stop Accept, and start drainage. For mesh containers that do not support hot restart, you can follow this process to modify the mesh containers. The logic is listed below:

migration

Migration Demo

Different mesh containers provide different services and have different internal implementation logics, so the specific Migrations are also different. The preceding logic only presents some important points, with hopes to benefit everyone in need. We have also provided a hot upgrade Migration Demo on GitHub for reference. Next, we will introduce some of the key codes:

Consultation Mechanism First, users must check whether it is the first startup or hot upgrade smooth migration to start the Mesh container. Kruise injects two environment variables called SIDECARSET_VERSION and SIDECARSET_VERSION_ALT to two sidecar containers to reduce the communication cost of the mesh container. The two environment variables determine whether it is running the hot upgrade process and whether the current sidecar container version is new or old.

// return two parameters:
// 1. (bool) indicates whether it is hot upgrade process
// 2. (bool ) when isHotUpgrading=true, the current sidecar is newer or older
func isHotUpgradeProcess() (bool, bool) {
  // Version of the current sidecar container
  version := os.Getenv("SIDECARSET_VERSION")
  // Version of the peer sidecar container
  versionAlt := os.Getenv("SIDECARSET_VERSION_ALT")
  // If the version of the peer sidecar container is "0", hot upgrade is not underway
  if versionAlt == "0" {
    return false, false
  }
  // Hot upgrade is underway
  versionInt, _ := strconv.Atoi(version)
  versionAltInt, _ := strconv.Atoi(versionAlt)
  // version is of int type and monotonically increases, which means the version value of the new-version container will be greater
  return true, versionInt > versionAltInt
}

ListenFD Migration Use the Unix Domain Socket to migrate ListenFD between containers. This step is also a critical step in the hot upgrade. The code example is listed below:

  // For code conciseness, all failures will not be captured

  /* The old sidecar migrates ListenFD to the new sidecar through Unix Domain Socket */
  // tcpLn *net.TCPListener
  f, _ := tcpLn.File()
  fdnum := f.Fd()
  data := syscall.UnixRights(int(fdnum))
  // Establish a connection with the new sidecar container through Unix Domain Socket
  raddr, _ := net.ResolveUnixAddr("unix", "/dev/shm/migrate.sock")
  uds, _ := net.DialUnix("unix", nil, raddr)
  // Use UDS to send ListenFD to the new sidecar container
  uds.WriteMsgUnix(nil, data, nil)
  // Stop receiving new requests and start the drainage phase, for example, http2 GOAWAY
  tcpLn.Close()

  /* The new sidecar receives ListenFD and starts to provide external services */
  // Listen to UDS
  addr, _ := net.ResolveUnixAddr("unix", "/dev/shm/migrate.sock")
  unixLn, _ := net.ListenUnix("unix", addr)
  conn, _ := unixLn.AcceptUnix()
  buf := make([]byte, 32)
  oob := make([]byte, 32)
  // Receive ListenFD
  _, oobn, _, _, _ := conn.ReadMsgUnix(buf, oob)
  scms, _ := syscall.ParseSocketControlMessage(oob[:oobn])
  if len(scms) > 0 {
    // Parse FD and convert to *net.TCPListener
    fds, _ := syscall.ParseUnixRights(&(scms[0]))
    f := os.NewFile(uintptr(fds[0]), "")
    ln, _ := net.FileListener(f)
    tcpLn, _ := ln.(*net.TCPListener)
    // Start to provide external services based on the received Listener. The http service is used as an example
    http.Serve(tcpLn, serveMux)
  }

Successful Mesh Container Hot Upgrade Cases

Alibaba Cloud Service Mesh (ASM) provides a fully managed service mesh platform compatible with open-source Istio service mesh from the community. Currently, ASM implements the Sidecar hot upgrade capability (Beta) in the data plane based on the hot upgrade capability of OpenKruise SidecarSet. Users can upgrade the data plane version of service mesh without affecting applications.

In addition to hot upgrades, ASM supports capabilities, such as configuration diagnosis, operation audit, log access, monitoring, and service registration, to improve the user experience of service mesh. You are welcome to try it out!

Summary

The hot upgrade of mesh containers in cloud-native has always been an urgent but thorny problem. The solution in this article is only one exploration of Alibaba Group, giving feedback to the community with hopes of encouraging better ideas. We also welcome everyone to participate in the OpenKruise community. Together, we can build mature Kubernetes application management, delivery, and extension capabilities that can be applied to more large-scale, complex, and high-performance scenarios.

OpenKruise 0.8.0, A Powerful Tool for Sidecar Container Management

March 15, 2021 · 11 min read

Mingshan Zhao

Member of OpenKruise

OpenKruise released v0.8.0 on March 4, 2021, with enhanced SidecarSet capabilities, especially for log management of Sidecar.

Background

Sidecar is a very important cloud native container design mode. It can create an independent Sidecar container by separating the auxiliary capabilities from the main container. In microservice architectures, the Sidecar mode is also used to separate general capabilities such as configuration management, service discovery, routing, and circuit breaking from main programs, thus making the microservice architectures less complicated. Since the popularity of Service Mesh has led to the prevalence of the Sidecar mode, the Sidecar mode has also been widely used within Alibaba Group to implement common capabilities such as O&M, security, and message-oriented middleware.

In Kubernetes clusters, pods can not only support the construction of main containers and Sidecar containers, but also many powerful workloads, such as deployment and statefulset to manage and upgrade the main containers and Sidecar containers. However, with the ever-growing businesses in Kubernetes clusters day by day, there have also been various Sidecar containers with a larger scale. Therefore, management and upgrades of online Sidecar containers are more complex:

A business pod contains multiple Sidecar containers, such as O&M, security, and proxy containers. The business team should not only configure the main containers, but also learn to configure these Sidecar containers. This increases the workloads of the business team and the risks in Sidecar container configuration.
The Sidecar container needs to be restarted together with the main business container after the upgrade. The Sidecar container supports hundreds of online businesses, so it is extremely difficult to coordinate and promote the upgrades of a large number of online Sidecar containers.
If there are no effective updates for Sidecar containers with different online configurations and versions, it will pose great potential risks to the management of Sidecar containers.

Alibaba Group has millions of containers with thousands of businesses. Therefore, the management and upgrades of Sidecar containers have become a major target for improvement. To this end, many internal requirements for the Sidecar containers have been summarized and integrated into OpenKruise. Finally, these requirements were abstracted as SidecarSet, a powerful tool to manage and upgrade a wide range of Sidecar containers.

OpenKruise SidecarSet

SidecarSet is an abstracted concept for Sidecar from OpenKruise. As one of the core workloads of OpenKruise, it is used to inject and upgrade the Sidecar containers in Kubernetes clusters. SidecarSet provides a variety of features so that users can easily manage Sidecar containers. The main features are as follows:

Separate configuration management: Each Sidecar container is configured with separate SidecarSet configuration to facilitate management.
Automatic injection: Automatic Sidecar container injection is implemented in scenarios of Pod creation, Pod scale-out and Pod reconstruction.
In-place upgrade: Sidecar containers can be upgraded in-place without the reconstruction of any pods, so that the main business container is not affected. In addition, a wide range of gray release policies are included.

Note: For a Pod that contains multiple container modes, the container that provides the main business logic to the external is the main container. Other containers provide auxiliary capabilities such as log collection, security, and proxy are Sidecar containers. For example, if a pod provides web capabilities outward, the nginx container that provides major web server capabilities is the main container. The logtail container is the Sidecar container that is responsible for collecting and reporting nginx logs. The SidecarSet resource abstraction in this article also solves some problems of the Sidecar containers.

Sidecar logging architectures

Application logs allow you to see the internal running status of your application. Logs are useful for debugging problems and monitoring cluster activities. After the application is containerized, the simplest and most widely used logging is to write standard output and errors.

However, in the current distributed systems and large-scale clusters, the above solution is not enough to meet the production environment standards. First, for distributed systems, logs are scattered in every single container, without a unified place for congregation. Logs may be lost in scenarios such as container crashes and Pod eviction. Therefore, there is a need for a more reliable log solution that is independent of the container lifecycle.

Sidecar logging architectures places the logging agent in an independent Sidecar container to collect container logs by sharing the log directory. Then, the logs are stored in the back-end storage of the log platform. logsidecar

This architecture is also used by Alibaba and Ant Group to realize the log collection of containers. Next, this article will explain how OpenKruise SidecarSet helps a large-scale implementation of the Sidecar log architecture in Kubernetes clusters.

Automatic Injection

OpenKruise SidecarSet has implemented automatic Sidecar container injection based on Kubernetes AdmissionWebhook mechanism. Therefore, as long as the Sidecar is configured in SidecarSet, the defined Sidecar container will be injected into the scaled pods with any deployment patterns, such as CloneSet, Deployment, or StatefulSet. inject sidecar

The owner of Sidecar containers only needs to configure SidecarSet to inject the Sidecar containers without affecting the business. This greatly reduces the threshold for using the Sidecar containers, and facilitates the management of Sidecar owners. In addition to containers, SidecarSet also extends the following fields to meet various scenarios of Sidecar injection:

# sidecarset.yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: test-sidecarset
spec:
  # Select Pods through the selector
  selector:
    matchLabels:
      app: web-server
  # Specify a namespace to take effect
  namespace: ns-1
  # container definition
  containers:
  - name: logtail
    image: logtail:1.0.0
    # Share the specified volume
    volumeMounts:
    - name: web-log
      mountPath: /var/log/web
    # Share all volumes
    shareVolumePolicy:
      type: disabled
    # Share environment variables
    transferEnv:
    - sourceContainerName: web-server
      envName: TZ
  volumes:
    - name: web-log
      emptyDir: {}

Pod selector

The selector is supported to select the pods to be injected. In the above example, the pod of labels[app] = web-server is selected to inject the logtail container. Alternatively, labels[inject/logtail] = true can be added in all pods to inject a global Sidecar.
namespace: SidecarSet is globally valid by default. This parameter can be configured to make it valid to a specific namespace.

Share the specified volume: Use volumeMounts and volumes to share a specified volume with the main container. In the above example, a web-log volume is shared to achieve log collection.
Share all volumes: Use shareVolumePolicy = enabled | disabled to specify whether to mount all volumes in the Pod's main container, which is often used for Sidecar containers such as log collection. If the configuration is enabled, all mount points in the application container are injected into the same Sidecar path, unless there are data volumes and mount points declared by Sidecar.）

Use transferEnv to obtain environment variables from other containers, which copies the environment variable named envName in the sourceContainerName container to the current Sidecar container. In the above example, the Sidecar container of logs shares the time zone of the main container, which is especially common in overseas environments.

Note: The number of containers for the created Pods cannot be changed in the Kubernetes community. Therefore, the injection capability described above can only occur during the Pod creation phase. For the created Pods, Pod reconstruction is required for injection.

In-place Upgrade

SidecarSet not only allows to inject the Sidecar containers, but also reuses the in-place update feature of OpenKruise. This realizes the upgrade of the Sidecar containers without restarting the Pod and the main container. Since this upgrade method does not affect the business, upgrading the Sidecar containers is no longer a pain point. Thus, it brings a lot of conveniences to Sidecar owners and speeds up the Sidecar version iteration.

inplace sidecar

Note: Only the modification on the container.image fields for the created Pods is allowed by the Kubernetes community. Therefore, the modification on other fields of Sidecar containers requires the reconstruction of Pod, and the in-place upgrade is not supported.

To meet the requirements in some complex Sidecar upgrade scenarios, SidecarSet provides the in-place upgrade and a wide range of gray release strategies.

Gray Release

Gray release is a common method that allows a Sidecar container to be released smoothly. It is highly recommended that this method is used in large-scale clusters. Here is an example of Pod's rolling release based on the maximum unavailability after the first batch of Pod release is suspended. Suppose that there are 1,000 Pods to be released:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: sidecarset
spec:
  # ...
  updateStrategy:
    type: RollingUpdate
    partition: 980
    maxUnavailable: 10%

The configuration above is suspended after the former release of 20 pods (1000 – 980 = 20). After the Sidecar container has been normal for a period of time in business, adjust the update SidecarSet configuration:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: sidecarset
spec:
  # ...
  updateStrategy:
    type: RollingUpdate
    maxUnavailable: 10%

As such, the remaining 980 Pods will be released in the order of the maximum unavailable numbers (10% * 1000=100) until all Pods are released.

Partition indicates that the number or percentage of Pods of the old version is retained, with the default value of 0. Here, the partition does not represent any order number. If the partition is set up during the release process:

If it is a number, the controller will update the pods with (replicas – partition) to the latest version.
If it is a percentage, the controller will update the pods with (replicas * (100% - partition)) to the latest version.

MaxUnavailable indicates the maximum unavailable number of pods at the same time during the release, with the default value of 1. Users can set the MaxUnavailable value as absolute value or percentage. The percentage is used by the controller to calculate the absolute value based on the number of selected pods.

Note: The values of maxUnavailable and partition are not necessarily associated. For example:

Under {matched pod} = 100, partition = 50, and maxUnavailable = 10, the controller will release 50 pods to the new version, but the release is limited to 10. That is, only 10 pods are released at the same time. A pod is released one after another until 50 pods are all released.
Under {matched pod} = 100, partition = 80, and maxUnavailable = 30, the controller will release 20 Pods to the new version. The controller releases all 20 pods at the same time because the number of maxUnavailable is met.

Canary Release

For businesses that require canary release, strategy.selector can be considered as a choice. The solution is to mark the labels[canary.release] = true into the Pods that require canary release, and then use strategy.selector.matchLabels to select the pods.

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: sidecarset
spec:
  # ...
  updateStrategy:
    type: RollingUpdate
    selector:
      matchLabels:
        canary.release: "true"
    maxUnavailable: 10%

The above configuration only releases the containers marked with canary labels. After the canary verification is completed, rolling release is performed based on the maximum unavailability by removing the configuration of updateStrategy.selector.

Scatter Release

The upgrade sequence of pods in SidecarSet is subject to the following rules by default:

For the pod set upgrade, multiple upgrades with the same order are guaranteed.
The selection priority is (the smaller, the more prioritized): unscheduled < scheduled, pending < unknown < running, not-ready < ready, newer pods < older pods.

In addition to the above default release order, the scatter release policy allows users to scatter the pods that match certain labels to the entire release process. For example, for a global sidecar container like logtail, dozens of business pods may be injected into a cluster. Thus, the logtail can be released after being scattered based on the application name, realizing scattered and gray release among applications. And it can be performed together with the maximum unavailability.

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: sidecarset
spec:
# ...
  updateStrategy:
    type: RollingUpdate
    scatterStrategy:
    - key: app_name
      value: nginx
    - key: app_name
      value: web-server
    - key: app_name
      value: api-gateway
    maxUnavailable: 10%

Note: In the current version, all application names must be listed. In the next version, an intelligent scattered release will be supported with only the label key configured.

Summary

In the OpenKruise v0.8.0, the SidecarSet has been improved in terms of log management in Sidecar scenarios. In the later exploration of the stability and performance of SidecarSet, more scenarios will be covered at the same time. For example, Service Mesh scenario will be supported in the next version. Moreover, more people are welcomed to participate OpenKruise community to improve the application management and delivery extensibility of Kubernetes for scenarios featuring large scale, complexity and extreme performance.

UnitedDeploymemt - Supporting Multi-domain Workload Management

November 20, 2019 · 7 min read

Fei Guo

Maintainer of OpenKruise

Ironically, probably every cloud user knew (or should realized that) failures in Cloud resources are inevitable. Hence, high availability is probably one of the most desirable features that Cloud Provider offers for cloud users. For example, in AWS, each geographic region has multiple isolated locations known as Availability Zones (AZs). AWS provides various AZ-aware solutions to allow the compute or storage resources of the user applications to be distributed across multiple AZs in order to tolerate AZ failure, which indeed happened in the past.

In Kubernetes, the concept of AZ is not realized by an API object. Instead, an AZ is usually represented by a group of hosts that have the same location label. Although hosts within the same AZ can be identified by labels, the capability of distributing Pods across AZs was missing in Kubernetes default scheduler. Hence it was difficult to use single StatefulSet or Deployment to perform AZ-aware Pods deployment. Fortunately, in Kubernetes 1.16, a new feature called "Pod Topology Spread Constraints" was introduced. Users now can add new constraints in the Pod Spec, and scheduler will enforce the constraints so that Pods can be distributed across failure domains such as AZs, regions or nodes, in a uniform fashion.

In Kruise, UnitedDeploymemt provides an alternative to achieve high availability in a cluster that consists of multiple fault domains - that is, managing multiple homogeneous workloads, and each workload is dedicated to a single Subset. Pod distribution across AZs is determined by the replica number of each workload. Since each Subset is associated with a workload, UnitedDeployment can support finer-grained rollout and deployment strategies. In addition, UnitedDeploymemt can be further extended to support multiple clusters! Let us reveal how UnitedDeployment is designed.

Using `Subsets` to describe domain topology

UnitedDeploymemt uses Subset to represent a failure domain. Subset API primarily specifies the nodes that forms the domain and the number of replicas, or the percentage of total replicas, run in this domain. UnitedDeployment manages subset workloads against a specific domain topology, described by a Subset array.

type Topology struct {
	// Contains the details of each subset.
	Subsets []Subset
}

type Subset struct {
	// Indicates the name of this subset, which will be used to generate
	// subset workload name prefix in the format '<deployment-name>-<subset-name>-'.
	Name string

	// Indicates the node select strategy to form the subset.
	NodeSelector corev1.NodeSelector

	// Indicates the number of the subset replicas or percentage of it on the
	// UnitedDeployment replicas.
	Replicas *intstr.IntOrString
}

The specification of the subset workload is saved in Spec.Template. UnitedDeployment only supports StatefulSet subset workload as of now. An interesting part of Subset design is that now user can specify customized Pod distribution across AZs, which is not necessarily a uniform distribution in some cases. For example, if the AZ utilization or capacity are not homogeneous, evenly distributing Pods may lead to Pod deployment failure due to lack of resources. If users have prior knowledge about AZ resource capacity/usage, UnitedDeployment can help to apply an optimal Pod distribution to ensure overall cluster utilization remains balanced. Of course, if not specified, a uniform Pod distribution will be applied to maximize availability.

Customized subset rollout `Partitions`

User can update all the UnitedDeployment subset workloads by providing a new version of subset workload template. Note that UnitedDeployment does not control the entire rollout process of all subset workloads, which is typically done by another rollout controller built on top of it. Since the replica number in each Subset can be different, it will be much more convenient to allow user to specify the individual rollout Partition of each subset workload instead of using one Partition to rule all, so that they can be upgraded in the same pace. UnitedDeployment provides ManualUpdate strategy to customize per subset rollout Partition.

type UnitedDeploymentUpdateStrategy struct {
	// Type of UnitedDeployment update.
	Type UpdateStrategyType
	// Indicates the partition of each subset.
	ManualUpdate *ManualUpdate
}

type ManualUpdate struct {
	// Indicates number of subset partition.
	Partitions map[string]int32
}

multi-cluster controller

This makes it fairly easy to coordinate multiple subsets rollout. For example, as illustrated in Figure 1, assuming UnitedDeployment manages three subsets and their replica numbers are 4, 2, 2 respectively, a rollout controller can realize a canary release plan of upgrading 50% of Pods in each subset at a time by setting subset partitions to 2, 1, 1 respectively. The same cannot be easily achieved by using a single workload controller like StatefulSet or Deployment.

Multi-Cluster application management (In future)

UnitedDeployment can be extended to support multi-cluster workload management. The idea is that Subsets may not only reside in one cluster, but also spread over multiple clusters. More specifically, domain topology specification will associate a ClusterRegistryQuerySpec, which describes the clusters that UnitedDeployment may distribute Pods to. Each cluster is represented by a custom resource managed by a ClusterRegistry controller using Kubernetes cluster registry APIs.

type Topology struct {
  // ClusterRegistryQuerySpec is used to find the all the clusters that
  // the workload may be deployed to. 
  ClusterRegistry *ClusterRegistryQuerySpec
  // Contains the details of each subset including the target cluster name and
  // the node selector in target cluster.
  Subsets []Subset
}

type ClusterRegistryQuerySpec struct {
  // Namespaces that the cluster objects reside.
  // If not specified, default namespace is used.
  Namespaces []string
  // Selector is the label matcher to find all qualified clusters.
  Selector   map[string]string
  // Describe the kind and APIversion of the cluster object.
  ClusterType metav1.TypeMeta
}

type Subset struct {
  Name string

  // The name of target cluster. The controller will validate that
  // the TargetCluster exits based on Topology.ClusterRegistry.
  TargetCluster *TargetCluster

  // Indicate the node select strategy in the Subset.TargetCluster.
  // If Subset.TargetCluster is not set, node selector strategy refers to
  // current cluster.
  NodeSelector corev1.NodeSelector

  Replicas *intstr.IntOrString 
}

type TargetCluster struct {
  // Namespace of the target cluster CRD
  Namespace string
  // Target cluster name
  Name string
}

A new TargetCluster field is added to the Subset API. If it presents, the NodeSelector indicates the node selection logic in the target cluster. Now UnitedDeployment controller can distribute application Pods to multiple clusters by instantiating a StatefulSet workload in each target cluster with a specific replica number (or a percentage of total replica), as illustrated in Figure 2.

multi-cluster controller

At a first glance, UnitedDeployment looks more like a federation controller following the design pattern of Kubefed, but it isn't. The fundamental difference is that Kubefed focuses on propagating arbitrary object types to remote clusters instead of managing an application across clusters. In this example, had a Kubefed style controller been used, each StatefulSet workload in individual cluster would have a replica of 100. UnitedDeployment focuses more on providing the ability of managing multiple workloads in multiple clusters on behalf of one application, which is absent in Kubernetes community to the best of our knowledge.

Summary

This blog post introduces UnitedDeployment, a new controller which helps managing application spread over multiple domains (in arbitrary clusters). It not only allows evenly distributing Pods over AZs, which arguably can be more efficiently done using the new Pod Topology Spread Constraint APIs though, but also enables flexible workload deployment/rollout and supports multi-cluster use cases in the future.

Learning Concurrent Reconciling

November 10, 2019 · 4 min read

Fei Guo

Maintainer of OpenKruise

The concept of controller in Kubernete is one of the most important reasons that make it successful. Controller is the core mechanism that supports Kubernetes APIs to ensure the system reaches the desired state. By leveraging CRDs/controllers and operators, it is fairly easy for other systems to integrate with Kubernetes.

Controller runtime library and the corresponding controller tool KubeBuilder are widely used by many developers to build their customized Kubernetes controllers. In Kruise project, we also use Kubebuilder to generate scaffolding codes that implement the "reconciling" logic. In this blog post, I will share some learnings from Kruise controller development, particularly, about concurrent reconciling.

Some people may already notice that controller runtime supports concurrent reconciling. Check for the options (source) used to create new controller:

type Options struct {
	// MaxConcurrentReconciles is the maximum number of concurrent Reconciles which can be run. Defaults to 1.
	MaxConcurrentReconciles int

	// Reconciler reconciles an object
	Reconciler reconcile.Reconciler
}

Concurrent reconciling is quite useful when the states of the controller's watched objects change so frequently that a large amount of reconcile requests are sent to and queued in the reconcile queue. Multiple reconcile loops do help drain the reconcile queue much more quickly compared to the default single reconcile loop case. Although this is a great feature for performance, without digging into the code, an immediate concern that a developer may raise is that will this introduce consistency issue? i.e., is it possible that two reconcile loops handle the same object at the same time?

The answer is NO, as you may expect. The "magic" is enforced by the workqueue implementation in Kubernetes client-go, which is used by controller runtime reconcile queue. The workqueue algorithm (source) is demonstrated in Figure 1.

workqueue

Basically, the workqueue uses a queue and two sets to coordinate the process of handling multiple reconciling requests against the same object. Figure 1(a) presents the initial state of handling four reconcile requests, two of which target the same object A. When a request arrives, the target object is first added to the dirty set or dropped if it presents in dirty set, and then pushed to the queue only if it is not presented in processing set. Figure 1(b) shows the case of adding three requests consecutively. When a reconciling loop is ready to serve a request, it gets the target object from the front of the queue. The object is also added to the processing set and removed from the dirty set (Figure 1(c)). Now if a request of the processing object arrives, the object is only added to the dirty set, not to the queue (Figure 1(d)). This guarantees that an object is only handled by one reconciling loop. When reconciling is done, the object is removed from the processing set. If the object is also shown in the dirty set, it is added back to the back of the queue (Figure 1(e)).

The above algorithm has following implications:

It avoids concurrent reconciling for the same object.
The object processing order can be different from arriving order even if there is only one reconciling thread. This usually would not be a problem since the controller still reconciles to the final cluster state. However, the out of order reconciling may cause a significant delay for a request. .... For example, as illustrated in Figure 2, assuming there is only one reconciling thread and two requests targeting the same object A arrive, one of them will be processed and object A will be added to the dirty set (Figure 2(b)). If the reconciling takes a long time and during which a large number of new reconciling requests arrive, the queue will be filled up by the new requests (Figure 2(c)). When reconciling is done, object A will be added to the back of the queue (Figure 2(d)). It would not be handled until all the requests coming after had been handled, which can cause a noticeable long delay. The workaround is actually simple - USE CONCURRENT RECONCILES. Since the cost of an idle go routine is fairly small, the overhead of having multiple reconcile threads is low even if the controller is idle. It seems that the MaxConcurrentReconciles value should be overwritten to a value larger than the default 1 (CloneSet uses 10 for example).
Last but not the least, reconcile requests can be dropped (if the target exists in dirty set). This means that we cannot assume that the controller can track all the object state change events. Recalling a presentation given by Tim Hockin, Kubernetes controller is level triggered, not edge triggered. It reconciles for state, not for events.

Thanks for reading the post, hope it helps.

Kruise Workload Classification Guidance

October 10, 2019 · 6 min read

Fei Guo

Maintainer of OpenKruise

Siyu Wang

Maintainer of OpenKruise

Kubernetes does not provide a clear guidance about which controller is the best fit for a user application. Sometimes, this does not seem to be a big problem if users understand both the application and workload well. For example, users usually know when to choose Job/CronJob or DaemonSet since the concepts of these workload are straightforward - the former is designed for temporal batch style applications and the latter is suitable for long running Pod which is distributed in every node. On the other hand, the usage boundary between Deployment and StatefulSet is vague. An application managed by a Deployment conceptually can be managed by a StatefulSet as well, the opposite may also apply as long as the Pod OrderedReady capability of StatefulSet is not mandatory. Furthermore, as more and more customized controllers/operators become available in Kubernetes community, finding suitable controller can be a nonnegligible user problem especially when some controllers have functional overlaps.

Kruise attempts to mitigate the problem from two aspects:

Carefully design the new controllers in the Kruise suite to avoid unnecessary functional duplications that may confuse users.
Establish a classification mechanism for existing workload controllers so that user can more easily understand the use cases of them. We will elaborate this more in this post. The first and most intuitive criterion for classification is the controller name.

Controller Name Convention

An easily understandable controller name can certainly help adoption. After consulting with many internal/external Kubernetes users, we decide to use the following naming conventions in Kruise. Note that these conventions are not contradicted with the controller names used in upstream controllers.

Set -suffix names: This type of controller manages Pods directly. Examples include CloneSet, ReplicaSet and SidecarSet. It supports various depolyment/rollout strategies in Pod level.
Deployment -suffix names: This type of controller does not manage Pods directly. Instead, it manages one or many Set -suffix workload instances which are created on behalf of one application. The controller can provide capabilities to orchestrate the deployment/rollout of multiple instances. For example, Deployment manages ReplicaSet and provides rollout capability which is not available in ReplicaSet. UnitedDeployment (planned in M3 release) manages multiple StatefulSet created in respect of multiple domains (i.e., fault domains) within one cluster.
Job -suffix names: This type of controller manages batch style applications with different depolyment/rollout strategies. For example, BroadcastJob distributes a job style Pod to every node in the cluster.

Set, Deployment and Job are widely adopted terms in Kubernetes community. Kruise leverages them with certain extensions.

Can we further distinguish controllers with the same name suffix? Normally the string prior to the suffix should be self-explainable, but in many cases it is hard to find a right word to describe what the controller does. Check to see how StatefulSet is originated in this thread. It takes four months for community to decide to use the name StatefulSet to replace the original name PetSet although the new name still confuse people by looking at its API documentation. This example showcases that sometimes a well-thought-out name may not be helpful to identify controller. Again, Kruise does not plan to resolve this problem. As an incremental effort, Kruise considers the following criterion to help classify Set -suffix controllers.

Fixed Pod Name

One unique property of StatefulSet is that it maintains consistent identities for Pod network and storage. Essentially, this is done by fixing Pod names. Pod name can identify both network and storage since it is part of DNS record and can be used to name Pod volume claim. Why is this property needed given that all Pods in StatefulSet are created from the same Pod template? A well known use case is to manage distributed coordination server application such as etcd or Zookeeper. This type of application requires the cluster member (i.e., the Pod) to access the same data (in Pod volume) whenever a member is reconstructed upon failure, in order to function correctly. To differentiate the term State in StatefulSet from the same term used in other computer science areas, I'd like to associate State with Pod name in this document. That being said, controllers like ReplicaSet and DaemonSet are Stateless since they don't require to reuse the old Pod name when a Pod is recreated.

Supporting Stateful does lead to inflexibility for controller. StatefulSet relies on ordinal numbers to realize fixing Pod names. The workload rollout and scaling has to be done in a strict order. As a consequence, some useful enhancements to StatefulSet become impossible. For example,

Selective Pod upgrade and Pod deletion (when scale in). These features can be helpful when Pods are spread across different regions or fault domains.
The ability of taking control over existing Pods with arbitrary names. There are cases where Pod creation is done by one controller but Pod lifecycle management is done by another controller (e.g., StatefulSet).

We found that many containerized applications do not require the Stateful property of fixing Pod names, and StatefulSet is hard to be extended for those applications in many cases. To fill the gap, Kruise has released a new controller called CloneSet to manage the Stateless applications. In a nutshell, CloneSet provides PVC support and enriched rollout and management capabilities. The following table roughly compares Advanced StatefulSet and CloneSet in a few aspects.

Features	Advanced StatefulSet	CloneSet
PVC	Yes	Yes
Pod name	Ordered	Random
Inplace upgrade	Yes	Yes
Max unavailable	Yes	Yes
Selective deletion	No	Yes
Selective upgrade	No	Yes
Change Pod ownership	No	Yes

Now, a clear recommendation to Kruise users is if your applications require fixed Pod names (identities for Pod network and storage), you can start with Advanced StatefulSet. Otherwise, CloneSet is the primary choice of Set -suffix controllers (if DaemonSet is not applicable).

Summary

Kruise aims to provide intuitive names for new controllers. As a supplement, this post provides additional guidance for Kruise users to pick the right controller for their applications. Hope it helps!

Overview of Elastic Scenarios

Capabilities and Advantageous Scenarios of Two Components

WorkloadSpread: An Elastic Strategy Plugin Based on Pod Mutating Webhook

Example Configuration​

Powerful Partitioning Capability​

Flexible Scheduling Configuration​

Detailed Pod Customization​

WorkloadSpread's Pod Mutating Webhook Mechanism​

Limitations of WorkloadSpread​

Potential Risks of Webhook​

Limitations of Acting on Pods​

Case Study 1: Bandwidth Package Allocation in Large-Scale Load Testing​

Case Study 2: Compatibility for Scaling Managed K8S Cluster Services to Serverless Instances​

UnitedDeployment: A Native Workload with Built-in Elasticity

Example Configuration​

Advantages of UnitedDeployment​

All-In-One Elastic Application Management​

Advanced Subset Management​

Adaptive Elasticity​

Limitations of UnitedDeployment​

Case Study 1: Elastic Scaling of Pods to Virtual Nodes with Adaptation for Serverless Containers​

Case Study 2: Allocating Different Resources to Pods with Different CPU Types​

Summary

Background​

Introduction​

Comparison with related works​

「1」Pod Topology Spread Constrains​

「2」UnitedDeploymen​

Use Case​

「1」Deploy 100 pods to normal node pool, rest pods to elastic node pool​

「1」Deploy pods to normal node pool first, to elastic resource pool when normal node pool is insufficient​

「3」Scatter to 3 zones, the scale is 1:1:3​

「4」Configures different resource quotas on different CPU architecture​

Implementation​

「1」 How to decide the priority when scaling up?​

「2」 How to decide the priority when scaling down?​

「3」 How to solve the counting problems?​

3.1 solving concurrency consistency problem​

3.2 solving data consistency problem​

「4」How to do if pod schedule failed?​

Conclusion​

Reference​

Background - How to Upgrading Mesh Containers Independently

SidecarSet Helps Lossless Hot Upgrade of Mesh Containers

Inject Sidecar Containers of the Hot Upgrade Type​

Smooth Mesh Container Upgrade​

Core Logic of Migration​

Migration Demo​

Successful Mesh Container Hot Upgrade Cases

Summary

Background

OpenKruise SidecarSet

Sidecar logging architectures​

Automatic Injection​

Pod selector​

Data volume sharing​

Share environment variables​

In-place Upgrade​

Gray Release​

Canary Release​

Scatter Release​

Summary

Using Subsets to describe domain topology​

Customized subset rollout Partitions​

Multi-Cluster application management (In future)​

Summary​

Controller Name Convention​

Fixed Pod Name​

Summary​

Example Configuration

Powerful Partitioning Capability

Flexible Scheduling Configuration

Detailed Pod Customization

WorkloadSpread's Pod Mutating Webhook Mechanism

Limitations of WorkloadSpread

Potential Risks of Webhook

Limitations of Acting on Pods

Case Study 1: Bandwidth Package Allocation in Large-Scale Load Testing

Case Study 2: Compatibility for Scaling Managed K8S Cluster Services to Serverless Instances

Example Configuration

Advantages of UnitedDeployment

All-In-One Elastic Application Management

Advanced Subset Management

Adaptive Elasticity

Limitations of UnitedDeployment

Case Study 1: Elastic Scaling of Pods to Virtual Nodes with Adaptation for Serverless Containers

Case Study 2: Allocating Different Resources to Pods with Different CPU Types

Background

Introduction

Comparison with related works

「1」Pod Topology Spread Constrains

「2」UnitedDeploymen

Use Case

「1」Deploy 100 pods to normal node pool, rest pods to elastic node pool

「1」Deploy pods to normal node pool first, to elastic resource pool when normal node pool is insufficient

「3」Scatter to 3 zones, the scale is 1:1:3

「4」Configures different resource quotas on different CPU architecture

Implementation

「1」 How to decide the priority when scaling up?

「2」 How to decide the priority when scaling down?

「3」 How to solve the counting problems?

3.1 solving concurrency consistency problem

3.2 solving data consistency problem

「4」How to do if pod schedule failed?

Conclusion

Reference

Inject Sidecar Containers of the Hot Upgrade Type

Smooth Mesh Container Upgrade

Core Logic of Migration

Migration Demo

Sidecar logging architectures

Automatic Injection

Pod selector

Data volume sharing

Share environment variables

In-place Upgrade

Gray Release

Canary Release

Scatter Release

Using `Subsets` to describe domain topology

Customized subset rollout `Partitions`

Multi-Cluster application management (In future)

Summary

Controller Name Convention

Fixed Pod Name

Summary