Skip to main content

2 posts tagged with "workloadspread"

View All Tags

A Flexible and Configurable Serverless Elastic Solution at the Workload Level

· 13 min read
Tianyun Zhong
Member of OpenKruise

Serverless represents an extension of cloud computing, inheriting its most significant feature: on-demand elastic scaling. This model design allows developers to focus on application logic without concerning themselves with deployment resources, thereby fully leveraging resource scalability to provide superior elasticity capabilities. Enterprises can also genuinely benefit from true pay-as-you-go characteristics. Consequently, more cloud providers are converging towards this new architectural paradigm.

The core capability of "flexible configurability" in Serverless technology focuses on enabling specific cloud usage scenarios to fully utilize cloud resources through simple, minimally invasive, and highly configurable methods. Its essence lies in resolving the conflict between capacity planning and actual cluster load configuration. This article will sequentially introduce two configurable components — WorkloadSpread and UnitedDeployment — discussing their core capabilities, technical principles, advantages and disadvantages, as well as real-world applications. Through these discussions, we aim to share OpenKruise's technical evolution and considerations in addressing Serverless workload elasticity.

Overview of Elastic Scenarios

As Serverless technology matures, more enterprises prefer using cloud resources (such as Alibaba Cloud ACS Serverless container instances) over on-premise resources (like managed resource pools or on-premise IDC data centers) to host applications with temporary, tidal, or bursty characteristics. This approach enhances resource utilization efficiency and reduces overall costs by adopting a pay-as-you-go model. Below are some typical elastic scenarios:

  1. Prioritize using on-premise resources in offline IDC data centers; scale application to the cloud when resources are insufficient.
  2. Prefer using pre-paid resource pool in the cloud; use pay-as-you-go Serverless instances for additional replicas when resources are insufficient.
  3. Use high-quality stable compute power (e.g., dedicated cloud server instances) first; then use lower-quality compute power (e.g., Spot instances).
  4. Configure different resource quantities for container replicas deployed on different compute platforms (e.g., X86, ARM, Serverless instances) to achieve similar performance.
  5. Inject different middleware configurations into replicas deployed on nodes versus Serverless environments (e.g., shared Daemon on nodes, Sidecar injection on Serverless).

These components introduced in this article offer distinct advantages in solving the above problems. Users can choose appropriate capabilities based on their specific scenarios to effectively leverage elastic compute power.

Capabilities and Advantageous Scenarios of Two Components

  • WorkloadSpread: Utilizes a Mutating Webhook to intercept Pod creation requests that meet certain criteria and apply Patch operations to inject differentiated configurations. Suitable for existing applications requiring multiple elastic partitions with customized Pod Metadata and Spec fields.
  • UnitedDeployment: A workload with built-in capability of elastic partitioning and pod customization, offering stronger elasticity and capacity planning capabilities. Ideal for new applications needing detailed partitioning and individual configurations for each partition.

WorkloadSpread: An Elastic Strategy Plugin Based on Pod Mutating Webhook

WorkloadSpread is a bypass component provided by the OpenKruise community that spreads target workload Pods across different types of subsets according to specific rules, enhancing multi-region and elastic deployment capabilities without modifying the original workload. It supports almost all native or custom Kubernetes workloads, ensuring adaptability and flexibility in various environments.

Example Configuration

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
name: workloadspread-demo
spec:
targetRef: # Supports almost all native or custom Kubernetes workloads
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
kind: Deployment | CloneSet
name: workload-xxx
subsets:
- name: subset-a
# The first three replicas will be scheduled to this Subset
maxReplicas: 3
# Pod affinity configuration
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
patch:
# Inject a custom label to Pods scheduled to this Subset
metadata:
labels:
xxx-specific-label: xxx
- name: subset-b
# Deploy to Serverless clusters, no capacity and unlimited replicas
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- acs-cn-hangzhou
scheduleStrategy:
# Adaptive mode will reschedule failed Pods to other Subsets
type: Adaptive | Fixed
adaptive:
rescheduleCriticalSeconds: 30

Powerful Partitioning Capability

WorkloadSpread spreads Pods into different elastic partitions using Subsets, scaling up forward and scaling down backward based on Subset order.

Flexible Scheduling Configuration

At the Subset level, WorkloadSpread supports selecting nodes via Labels and configuring advanced options such as taints and tolerations. For example, requiredNodeSelectorTerm specifies mandatory node attributes, preferredNodeSelectorTermssets preferred node attributes, and tolerations configures Pod tolerance for node taints. These configurations allow precise control over Pod scheduling and distribution.

At the global level, WorkloadSpread supports two scheduling strategies via the scheduleStrategy field: Fixed and Adaptive. The Fixed strategy ensures strict adherence to predefined Subset distributions, while the Adaptive strategy provides higher flexibility by automatically rescheduling Pods to other available Subsets when necessary.

Detailed Pod Customization

In Subset configurations, the patch field allows for fine-grained customization of Pods scheduled to that subset. Supported fields include container images, resource limits, environment variables, volume mounts, startup commands, probe configurations, and labels. This decouples Pod specifications from environment adaptations, enabling flexible workload adjustments for various partition environments.

...
# patch pod with a topology label:
patch:
metadata:
labels:
topology.application.deploy/zone: "zone-a"
...

The example above demonstrates how to add or modify a label to all Pods in a Subset.

...
# patch pod container resources:
patch:
spec:
containers:
- name: main
resources:
limit:
cpu: "2"
memory: 800Mi
...

The example above demonstrates how to add or modify the Pod Spec.

...
# patch pod container env with a zone name:
patch:
spec:
containers:
- name: main
env:
- name: K8S_AZ_NAME
value: zone-a
...

The example above demonstrates how to add or modify a container environment variable.

WorkloadSpread's Pod Mutating Webhook Mechanism

WorkloadSpread operates directly on Pods created by the target workload via Pod Mutating Webhook, ensuring non-intrusive operation. When a Pod creation request meets the criteria, the Webhook intercepts it, reads the corresponding WorkloadSpread configuration, selects an appropriate Subset, and modifies the Pod configuration accordingly. The controller maintains the controller.kubernetes.io/pod-deletion-cost label to ensure correct downsizing order.

Limitations of WorkloadSpread

Potential Risks of Webhook

WorkloadSpread depends on Pod Mutating Webhook to function, which intercepts all Pod creation requests in the cluster. If the Webhook Pod (kruise-manager) experiences performance issues or failures, it may prevent new Pods from being created. Additionally, during large-scale scaling operations, Webhook can become a performance bottleneck.

Limitations of Acting on Pods

While acting on Pods reduces business intrusion, it introduces limitations. For instance, CloneSet's gray release ratio cannot be controlled per Subset.

Case Study 1: Bandwidth Package Allocation in Large-Scale Load Testing

A company needed to perform load testing before a major shopping festival. They developed a load-agent program to generate requests and used a CloneSet to manage agent replicas. To save costs, they purchased 10 shared bandwidth packages (each supporting 300 Pods) and aimed to dynamically allocate them to elastic agent replicas.

They configured a WorkloadSpread with 11 Subsets: the first 10 Subsets had a capacity of 300 and patched Pod Annotations to bind specific bandwidth packages; the last Subset had no capacity and no bandwidth package, preventing extra bandwidth allocation if more than 3000 replicas were created.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
name: bandwidth-spread
namespace: loadtest
spec:
targetRef:
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: load-agent-XXXXX
subsets:
- name: bandwidthPackage-1
maxReplicas: 300
patch:
metadata:
annotations:
k8s.aliyun.com/eip-common-bandwidth-package-id: <id1>

- ...

- name: bandwidthPackage-10
maxReplicas: 300
patch:
metadata:
annotations:
k8s.aliyun.com/eip-common-bandwidth-package-id: <id10>
- name: no-eip

Case Study 2: Compatibility for Scaling Managed K8S Cluster Services to Serverless Instances

A company had a web service running on an IDC that needed to scale up due to business growth but could not expand the local data center. They chose to use virtual nodes to access cloud-based Serverless elastic compute power, forming a hybrid cloud. Their application used acceleration services like Fluid, which were pre-deployed on nodes in the IDC but not available in the serverless subset. Therefore, they needed to inject a sidecar into cloud Pods to provide acceleration capabilities.

To achieve this without modifying the existing Deployment's 8 replicas, they used WorkloadSpread to add a label to Pods scaled to each subset, which controlled the Fluid sidecar injection.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
name: data-processor-spread
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: data-processor
subsets:
- name: local
maxReplicas: 8
patch:
metadata:
labels:
serverless.fluid.io/inject: "false"
- name: aliyun-acs
patch:
metadata:
labels:
serverless.fluid.io/inject: "true"

UnitedDeployment: A Native Workload with Built-in Elasticity

UnitedDeployment is an advanced workload provided by the OpenKruise community that natively supports partition management. Unlike WorkloadSpread, which enhances basic workloads, UnitedDeployment offers a new mode for managing partitioned elastic applications. It defines applications through a single template, and the controller creates and manages multiple secondary workloads to match different subsets. UnitedDeployment manages the entire lifecycle of applications within a single resource, including definition, partitioning, scaling, and upgrades.

Example Configuration

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
name: sample-ud
spec:
replicas: 6
selector:
matchLabels:
app: sample
template:
cloneSetTemplate:
metadata:
labels:
app: sample
spec:
# CloneSet Spec
...
topology:
subsets:
- name: ecs
nodeSelectorTerm:
matchExpressions:
- key: node-type
operator: In
values:
- ecs
maxReplicas: 2
- name: acs-serverless
nodeSelectorTerm:
matchExpressions:
- key: node-type
operator: In
values:
- acs-virtual-kubelet

Advantages of UnitedDeployment

All-In-One Elastic Application Management

UnitedDeployment offers comprehensive all-in-one application management, enabling users to define applications, manage subsets, scale, and upgrade using a single resource.

The UnitedDeployment controller manages a corresponding type of secondary workload for each subset based on the workload template, without requiring additional attention from the user. Users only need to manage the application template and subsets; the UnitedDeployment controller will handle subsequent management tasks for each secondary workload, including creation, modification, and deletion. The controller also monitors the status of Pods created by these workloads when necessary to make corresponding adjustments.

It is the secondary workload controllers implement the specific scaling and updating operations. Thus, scaling and updating using UnitedDeployment produces exactly the same effect as directly using the corresponding workload. For example, a UnitedDeployment will inherit the same grayscale publishing and in-place upgrade capabilities from CloneSet when created with a CloneSet template.

Advanced Subset Management

UnitedDeployment incorporates two capacity allocation algorithms, enabling users to handle various scenarios of elastic applications through detailed subset capacity configurations.

The elastic allocation algorithm implements a classic elastic capacity allocation method similar to WorkloadSpread: by setting upper and lower capacity limits for each subset, Pods are scaled up in the defined order of subsets and scaled down in reverse order. This method has been thoroughly introduced earlier, so it will not be elaborated further here.

The specified allocation algorithm represents a new approach to capacity allocation. It directly assigns fixed numbers or percentages to some subsets and reserves at least one elastic subset to distribute the remaining replicas.

In addition to capacity allocation, UnitedDeployment also allows customizing any Pod Spec fields (including container images) for each subset, which is similar to WorkloadSpread. This grants UnitedDeployment's subset configuration with powerful flexibility.

Adaptive Elasticity

UnitedDeployment offers robust adaptive elasticity, automating scaling and rescheduling operations to reduce operational overhead. It supports Kubernetes Horizontal Pod Autoscaler (HPA), enabling automatic scaling based on predefined conditions while adhering strictly to subset configurations.

UnitedDeployment also offers adaptive Pod rescheduling capabilities similar to WorkloadSpread. Additionally, it allows configuration of timeout durations for scheduling failures and recovery times for subsets from unscheduable status, providing enhanced control over adaptive scheduling.

Limitations of UnitedDeployment

The many advantages of UnitedDeployment stem from its all-in-one management capabilities as an independent workload. However, this also leads to the drawback of higher business transformation intrusiveness. For users' existing application, it is necessary to modify PaaS systems and tools (such as operation and maintenance systems, release systems, etc.) to switch from existing workloads like Deployment and CloneSet to UnitedDeployment.

Case Study 1: Elastic Scaling of Pods to Virtual Nodes with Adaptation for Serverless Containers

Cloud providers typically offer three types of Kubernetes services:

  1. Managed clusters with fixed nodes using cloud servers purchased by users.
  2. Serverless clusters delivering container computing power directly via virtual node technology.
  3. Hybrid clusters containing both managed nodes and virtual nodes.

In this case, a company planned to launch a new service with significant peak-to-valley traffic differences (up to tenfold). To handle this characteristic, they purchased a batch of cloud servers to form a managed cluster nodepool for handling baseline traffic and intended to quickly scale out new replicas to a serverless subset during peak hours. Additionally, their application required extra configuration to run in the Serverless environment. Below is an example configuration:

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
name: elastic-app
spec:
# Omitted business workload template
...
topology:
# Enable Adaptive scheduling to dispatch Pod replicas to ECS node pools and ACS instances adaptively
scheduleStrategy:
type: Adaptive
adaptive:
# Start scheduling to ACS Serverless instances 10 seconds after ECS node scheduling failure
rescheduleCriticalSeconds: 10
# Do not schedule to ECS nodes within one hour after the above scheduling failure
unschedulableLastSeconds: 3600
subsets:
# Prioritize ECS without an upper limit; only schedule to ACS when ECS fails
# During scale-in, delete ACS instances first, then ECS node pool Pods
- name: ecs
nodeSelectorTerm:
matchExpressions:
- key: type
operator: NotIn
values:
- acs-virtual-kubelet
- name: acs-serverless
nodeSelectorTerm:
matchExpressions:
- key: type
operator: In
values:
- acs-virtual-kubelet
# Use patch to modify environment variables for Pods scheduled to elastic computing power, enabling Serverless mode
patch:
spec:
containers:
- name: main
env:
- name: APP_RUNTIME_MODE
value: SERVERLESS
---
# Combine with HPA for automatic scaling
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: elastic-app-hpa
spec:
minReplicas: 1
maxReplicas: 100
metrics:
- resource:
name: cpu
targetAverageUtilization: 2
type: Resource
scaleTargetRef:
apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
name: elastic-app

Case Study 2: Allocating Different Resources to Pods with Different CPU Types

In this case, a company purchased several cloud server instances with Intel, AMD, and ARM platform CPUs to prepare for launching a new service. They wanted Pods scheduled on different platforms to exhibit similar performance. After stress testing, it was found that, compared to Intel CPUs as the benchmark, AMD platforms needed more CPU cores, while ARM platforms required more memory.

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
name: my-app
spec:
replicas: 4
selector:
matchLabels:
app: my-app
template:
deploymentTemplate:
... # Omitted business workload template
topology:
# Intel, AMD, and Yitian 710 ARM machines carry 50%, 25%, and 25% of the replicas respectively
subsets:
- name: intel
replicas: 50%
nodeSelectorTerm:
... # Select Intel node pool through labels
patch:
spec:
containers:
- name: main
resources:
limits:
cpu: 2000m
memory: 4000Mi
- name: amd64
replicas: 25%
nodeSelectorTerm:
... # Select AMD node pool through labels
# Allocate more CPU to AMD platform
patch:
spec:
containers:
- name: main
resources:
limits:
cpu: 3000m
memory: 4000Mi
- name: yitian-arm
replicas: 25%
nodeSelectorTerm:
... # Select ARM node pool through labels
# Allocate more memory to ARM platform
patch:
spec:
containers:
- name: main
resources:
limits:
cpu: 2000m
memory: 6000Mi

Summary

Elastic computing power can significantly reduce business costs and effectively increase the performance ceiling of services. To make good use of elastic computing power, it is necessary to choose appropriate elastic components based on specific application characteristics. The following table summarizes the capabilities of the two components introduced in this article, hoping to provide some reference.

ComponentPartition PrincipleEase of ModificationGranularity of PartitionElasticity Capability
WorkloadSpreadModify Pods via WebhookHighMediumMedium
UnitedDeploymentCreate multiple workloads via templatesLowHighHigh

WorkloadSpread - Interpretation for OpenKruise v0.10.0 new feature

· 12 min read
GuangLei Cao
Contributor of OpenKruise
Weixiang Sun
Member of OpenKruise

Background

Deploying an application in different zones, different hardware types, and even different clusters and cloud vendors is becoming a very common requirement with the development of cloud native techniques. For examples, these are some cases:

  1. Cases about disaster tolerant:
  • Application pods is scattered according to the nodes to avoid stacking.
  • Application pods is scattered according to available zones.
  • Different nodes/zones/domains require different scale of pods.
  1. Cases about cost control:
  • People deploy an applications preferentially to their own resource pool, and then deployed to elastic resource pool, such as ECI on Aliyun and Fragate on AWS, when own resources are insufficient. When shrinking, the elastic node is preferred to shrink to save cost.

In the most of the cases, people always split their application into multiple workloads (such as several Deployment) to deploy. However,this solution often requires manual management by SRE team, or a deeply customized PAAS to support the careful management of multiple workloads for this one application.

In order to solve this problem, WorkloadSpread feature has been proposed in version v0.10.0 OpenKruise. It can support multi-kind of workloads, such as Deployment, Replicaset, Job, and Cloneset, to manage the partition deployment or elastic scaling. The application scenario and implementation principle of WorkloadSpread will be introduced in detail below to help users better understand this feature.


Introduction

More details about WorkloadSpread can be found in Offical Document.

In short, WorkloadSpread can distribute pods of a workload to different types of nodes according to certain rules, so as to meet the above fragmentation and elasticity scenarios. WorkloadSpread is non-invasive, "plug and play", and can be effective for stock workloads.


Let's make a simple comparison with some related works in the community.

「1」Pod Topology Spread Constrains

Pod topology spread constraints is a solution provided by Kubernetes community. It can horizontally scatter pods according to topology key. The scheduler will select the node that matches the conditions according to the configuration if users defined this rule.

Since Pod Topology Spread is evenly dispersed, it cannot support exact customized partition number and proportion configuration. Furthermore, the distribution of pods will be destroyed when scaling down. Using WorkloadSpread can avoid these problems.

「2」UnitedDeploymen

UnitedDeployment is a solution provided by the OpenKruise community. It can manage pods in multiple regions by creating and managing multiple workloads.

UnitedDeployment supports the requirements of fragmentation and flexibility very well. But, it is a new workload, and the use cost and migration costs will be relatively high, whereas WorkloadSpread is a lightweight solution, which only needs to apply a simple configuration to associate the workload.


Use Case

In the section, I will list some application scenarios of WorkloadSpread and give corresponding configurations to help users quickly understand the WorkloadSpread feature.

「1」Deploy 100 pods to normal node pool, rest pods to elastic node pool

case-1

subsets:
- name: subset-normal
maxReplicas: 100
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- normal
- name: subset-elastic
# maxReplicas==nil means no limit for replicas
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- elastic

When the workload has less than 100 replicas, all pods will be deployed to the normal node pool, and more than 100 are deployed to the elastic node pool. When scaling down, the pods on the elastic node will be deleted first.

Since workload spread limits the distribution of workload, but does not invade workload. Users can also dynamically adjust the number of replicas according to the resource load in combination with HPA.

In this way, it will be automatically scheduled to the elastic node pool when receiving peak flow, and give priority to releasing the resources in the elastic resource pool when the peak gone.

「1」Deploy pods to normal node pool first, to elastic resource pool when normal node pool is insufficient

case-2

scheduleStrategy:
type: Adaptive
adaptive:
rescheduleCriticalSeconds: 30
disableSimulationSchedule: false
subsets:
- name: subset-normal
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- normal
- name: subset-elastic
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- elastic

Both subsets have no limit on the number of replicas, and the Adaptive rescheduling policy are enabled. The goal is to preferentially deploy to the normal node pool. When normal resources are insufficient, webhook will select elastic nodes through simulated scheduling. When the pod in the normal node pool is in the pending state and exceeds the 30s threshold, the WorkloadSpread controller will delete the pod to trigger pod reconstruction, and the new pod will be scheduled to the elastic node pool. During volume reduction, the pod on the elastic node is also preferentially reduced to save costs for users.

「3」Scatter to 3 zones, the scale is 1:1:3

case-3

subsets:
- name: subset-a
maxReplicas: 20%
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
- name: subset-b
maxReplicas: 20%
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b
- name: subset-c
maxReplicas: 60%
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-c

WorkloadSpread ensures that the pods are scheduled according to the defined proportion when scaling up and down.

「4」Configures different resource quotas on different CPU architecture

case-4

subsets:
- name: subset-x86-arch
# maxReplicas...
# requiredNodeSelectorTerm...
patch:
metadata:
labels:
resource.cpu/arch: x86
spec:
containers:
- name: main
resources:
limits:
cpu: "500m"
memory: "800Mi"
- name: subset-arm-arch
# maxReplicas...
# requiredNodeSelectorTerm...
patch:
metadata:
labels:
resource.cpu/arch: arm
spec:
containers:
- name: main
resources:
limits:
cpu: "300m"
memory: "600Mi"

From the above example, we have patched different labels and container resources for the pods of two subsets, which is convenient for us to manage the pod more finely. When workload pods are distributed on nodes of different CPU architectures, configure different resource quotas to make better use of hardware resources.


Implementation

WorkloadSpread is a pure bypass elastic/topology control solution. Users only need to create a corresponding WorkloadSpread config for their Deployment/Cloneset/Job/ReplicaSet Workloads. There is no need to change the them, and users will be no additional cost to use the WorkloadSpread.

arch

「1」 How to decide the priority when scaling up?

Multiple subsets are defined in WorkloadSpread, and each subset represents a logical domain. Users can freely define subsets according to node configuration, hardware type, zone, etc. In particular, we defined the priority of subsets:

  • The priority is defined from high to low in the order from front to back, for example subset[i] has higher priority than subset[j] if i < j.

  • The pods will be scheduled to the subsets with higher priority first.

「2」 How to decide the priority when scaling down?

Theoretically, the bypass solution of WorkloadSpread cannot interfere with the scaling logic in the workload controller.

However, this problem has been solved in the near future. Through the unremitting efforts (feedback) of users, k8s since version 1.21, it has been supported for ReplicaSet (deployment) to specify the "deletion cost" of the pods by setting the annotation controller.kubernetes.io/pod-deletion-cost: the higher the deletion cost, the lower the priority of deletion.

Since version v0.9.0 OpenKruise, the deletion cost feature has been supported in cloneset.

Therefore, the WorkloadSpread controller controls the scaling down order of the pods by adjusting their deletion cost.

For example, an WorkloadSpread associated a CloneSet with 10 replicas is as follows:

  subsets:
- name: subset-a
maxReplicas: 8
- name: subset-b

Then the deletion cost value and deletion order are as follows:

  • 8 pods in subset-a will have 200 deletion cost;
  • 2 pods in subset-b will have 100 deletion cost, and will be deleted first;

If user modify WorkloadSpread as:

  subsets:
- name: subset-a
maxReplicas: 5 # 8->5,
- name: subset-b

Then the deletion cost value and deletion order will also changed as follows:

  • 5 pods in subset-a will have 200 deletion cost;
  • 3 pods in subset-a will have -100 deletion cost, and will be deleted first;
  • 2 pods in subset-b will have 100 deletion cost;

In this way, workload can preferentially scale down those pods that exceed the subset maxReplicas limit.

「3」 How to solve the counting problems?

How to ensure that webhook injects pod rules in strict accordance with the priority order of subset and the number of maxReplicas is a key problem at the implementation of WorkloadSpread.

3.1 solving concurrency consistency problem

Sine there may be several kruise-controller-manager pods and lots of webhook Goroutines to process the same WorkloadSpread, the concurrency consistency problem must exist.

In the status of WorkloadSpread, there are the subsetStatuses field corresponding to each subset. The missingReplicas field in it indicates the number of pods required by the subset, and - 1 indicates that there is no quantity limit (subset.maxReplicas == nil).

spec:
subsets:
- name: subset-a
maxReplicas: 1
- name: subset-b
# ...
status:
subsetStatuses:
- name: subset-a
missingReplicas: 1
- name: subset-b
missingReplicas: -1
# ...

When webhook receives a pod create request:

  1. Find a suitable subset with missingReplicas greater than 0 or equals to -1 according to the subset order.
  2. After finding a suitable subset, if missingReplicas is greater than 0, subtract 1 first and try to update the WorkloadSpread status.
  3. If the update is successful, inject the rules defined by the subset into the pod.
  4. If the update fails, get the WorkloadSpread again to get the latest status, and return to step 1 (there is a certain limit on the number of retries).

Similarly, when webhook receives a pod delete or eviction request, MisingReplicas will add 1 to missingreplicas and update it.

There is no doubt that we are using optimistic locks to solve update conflicts. However, it is not appropriate to only use optimistic locks, because workload will create a large number of pods in parallel, and APIServer will send many pod create requests to webhook in an instant, resulting in a lot of conflicts in parallel processing. As we all know, optimistic lock is not suitable for too many conflicts, because the retry cost of solving conflicts is very high. To this end, we also added a WorkloadSpread level mutex to limit parallel processing to serial processing. There is a new problem in adding mutex locks, that is, after the current root obtains the lock, it is very likely that the WorkloadSpread obtained from infomer is not up-to-date, and will conflict as well. Therefore, after updating the WorkloadSpread, the Goroutine caches the latest WorkloadSpread and then releases the lock, so that the new Goroutine can directly get the latest WorkloadSpread from the cache after obtaining the lock. Of course, in the case of multiple webhooks, we still need to combine the optimistic lock mechanism to solve the conflict.

3.2 solving data consistency problem

So, is the missingReplicas field controlled by the webhook? The answer is NO, because:

  1. The pod create request received by webhook may not really succeed in the end (for example, pod is illegal or fails in subsequent quota verification).

  2. The pod delete/eviction request received by webhook may not really succeed in the end (for example, it is intercepted by PDB, PUB, etc.).

  3. There are always various possibilities in k8s, leading to the end or disappearance of the pods without going through webhook (for example, phase enters succeeded/failed, or ETCD data is lost, etc.).

  4. At the same time, this is not in line with the end state oriented design concept.

Therefore, the WorkloadSpread status is controlled by webhook in collaboration with the controller:

  • Webhook requests link interception in pod create/delete/ eviction, and modifies the missingReplicas.

  • At the same time, the controller's reconcile will also get all pods under the current workload, classify them according to the subset, and update missingReplicas to the actual missing quantity.

  • From the above analysis, it is likely that there is a delay for the controller to obtain the pod from the informer, so we also added the creatingPods map in the status. When the pod is injected at webhook, the key will be recorded as pod name and value are timestamp to the map, and the controller maintains the real missingReplicas in combination with the map. Similarly, there is also a deleteingPods map to record the delete/eviction event of the pod.

「4」How to do if pod schedule failed?

The configuration of reschedule strategy is supported in WorkloadSpread. By default, the type is fixed, that is, the pod is scheduled to the corresponding subset according to the sequence of each subset and the maxReplicas limit.

However, in real scenarios, many times, the resources of subset may not fully meet the number of maxReplicas due to some reasons, such as insufficient resources. Users need a more flexible reschedule strategy.

The adaptive capabilities provided by WorkloadSpread are logically divided into two types:

  1. SimulationSchedule: scheduling records exists in informer, so we want to simulate the scheduling of pods in webhook. That is, simple filtering is performed through nodeSelector/Affinity, Tolerances, and basic resources resources. (not applicable to virtual-kubelet)

  2. Reschedule: After scheduling the pod to a subset, if the scheduling failure exceeds the rescheduleCriticalSeconds time, mark the subset as unscheduled temporarily, and delete the pod to trigger reconstruction. By default, unscheduled will be reserved for 5min, that is, pod creation within 5min will skip this subset.


Conclusion

WorkloadSpread combines some existing features of Kubernetes to give workload the ability of elastic and multi-domain deployment in the form of bypass. We hope that users can reduce workload deployment complexity by using WorkloadSpread and effectively reduce costs by taking advantage of its elastic scalability.

At present, WorkloadSpread is applied to some project in Alibaba, and adjustments in the use will be fed back to the community in time. In the future, there are some new capability plans for WorkloadSpread, such as managing the existing pods, supporting batch workloads, and even using label to match the pod across different workloads. Some of these capabilities need to actually consider the needs and scenarios of community users. I hope you can participate in kruise community, mention Issues and PRs, help users solve the problems of more cloud native deployment, and build a better community.


Reference