A Flexible and Configurable Serverless Elastic Solution at the Workload Level

February 19, 2025 · 13 min read

Member of OpenKruise

Serverless represents an extension of cloud computing, inheriting its most significant feature: on-demand elastic scaling. This model design allows developers to focus on application logic without concerning themselves with deployment resources, thereby fully leveraging resource scalability to provide superior elasticity capabilities. Enterprises can also genuinely benefit from true pay-as-you-go characteristics. Consequently, more cloud providers are converging towards this new architectural paradigm.

The core capability of "flexible configurability" in Serverless technology focuses on enabling specific cloud usage scenarios to fully utilize cloud resources through simple, minimally invasive, and highly configurable methods. Its essence lies in resolving the conflict between capacity planning and actual cluster load configuration. This article will sequentially introduce two configurable components — WorkloadSpread and UnitedDeployment — discussing their core capabilities, technical principles, advantages and disadvantages, as well as real-world applications. Through these discussions, we aim to share OpenKruise's technical evolution and considerations in addressing Serverless workload elasticity.

Overview of Elastic Scenarios

As Serverless technology matures, more enterprises prefer using cloud resources (such as Alibaba Cloud ACS Serverless container instances) over on-premise resources (like managed resource pools or on-premise IDC data centers) to host applications with temporary, tidal, or bursty characteristics. This approach enhances resource utilization efficiency and reduces overall costs by adopting a pay-as-you-go model. Below are some typical elastic scenarios:

Prioritize using on-premise resources in offline IDC data centers; scale application to the cloud when resources are insufficient.
Prefer using pre-paid resource pool in the cloud; use pay-as-you-go Serverless instances for additional replicas when resources are insufficient.
Use high-quality stable compute power (e.g., dedicated cloud server instances) first; then use lower-quality compute power (e.g., Spot instances).
Configure different resource quantities for container replicas deployed on different compute platforms (e.g., X86, ARM, Serverless instances) to achieve similar performance.
Inject different middleware configurations into replicas deployed on nodes versus Serverless environments (e.g., shared Daemon on nodes, Sidecar injection on Serverless).

These components introduced in this article offer distinct advantages in solving the above problems. Users can choose appropriate capabilities based on their specific scenarios to effectively leverage elastic compute power.

Capabilities and Advantageous Scenarios of Two Components

WorkloadSpread: Utilizes a Mutating Webhook to intercept Pod creation requests that meet certain criteria and apply Patch operations to inject differentiated configurations. Suitable for existing applications requiring multiple elastic partitions with customized Pod Metadata and Spec fields.
UnitedDeployment: A workload with built-in capability of elastic partitioning and pod customization, offering stronger elasticity and capacity planning capabilities. Ideal for new applications needing detailed partitioning and individual configurations for each partition.

WorkloadSpread: An Elastic Strategy Plugin Based on Pod Mutating Webhook

WorkloadSpread is a bypass component provided by the OpenKruise community that spreads target workload Pods across different types of subsets according to specific rules, enhancing multi-region and elastic deployment capabilities without modifying the original workload. It supports almost all native or custom Kubernetes workloads, ensuring adaptability and flexibility in various environments.

Example Configuration

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: workloadspread-demo
spec:
  targetRef: # Supports almost all native or custom Kubernetes workloads
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet
    name: workload-xxx
  subsets:
    - name: subset-a
      # The first three replicas will be scheduled to this Subset
      maxReplicas: 3
      # Pod affinity configuration
      requiredNodeSelectorTerm:
        matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
              - zone-a
      patch:
        # Inject a custom label to Pods scheduled to this Subset
        metadata:
          labels:
            xxx-specific-label: xxx
    - name: subset-b
      # Deploy to Serverless clusters, no capacity and unlimited replicas
      requiredNodeSelectorTerm:
        matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
              - acs-cn-hangzhou
  scheduleStrategy:
    # Adaptive mode will reschedule failed Pods to other Subsets
    type: Adaptive | Fixed
    adaptive:
      rescheduleCriticalSeconds: 30

Powerful Partitioning Capability

WorkloadSpread spreads Pods into different elastic partitions using Subsets, scaling up forward and scaling down backward based on Subset order.

Flexible Scheduling Configuration

At the Subset level, WorkloadSpread supports selecting nodes via Labels and configuring advanced options such as taints and tolerations. For example, requiredNodeSelectorTerm specifies mandatory node attributes, preferredNodeSelectorTermssets preferred node attributes, and tolerations configures Pod tolerance for node taints. These configurations allow precise control over Pod scheduling and distribution.

At the global level, WorkloadSpread supports two scheduling strategies via the scheduleStrategy field: Fixed and Adaptive. The Fixed strategy ensures strict adherence to predefined Subset distributions, while the Adaptive strategy provides higher flexibility by automatically rescheduling Pods to other available Subsets when necessary.

Detailed Pod Customization

In Subset configurations, the patch field allows for fine-grained customization of Pods scheduled to that subset. Supported fields include container images, resource limits, environment variables, volume mounts, startup commands, probe configurations, and labels. This decouples Pod specifications from environment adaptations, enabling flexible workload adjustments for various partition environments.

...
# patch pod with a topology label:
patch:
  metadata:
    labels:
      topology.application.deploy/zone: "zone-a"
...

The example above demonstrates how to add or modify a label to all Pods in a Subset.

...
# patch pod container resources:
patch:
  spec:
    containers:
      - name: main
        resources:
          limit:
            cpu: "2"
            memory: 800Mi
...

The example above demonstrates how to add or modify the Pod Spec.

...
# patch pod container env with a zone name:
patch:
  spec:
    containers:
      - name: main
        env:
          - name: K8S_AZ_NAME
            value: zone-a
...

The example above demonstrates how to add or modify a container environment variable.

WorkloadSpread's Pod Mutating Webhook Mechanism

WorkloadSpread operates directly on Pods created by the target workload via Pod Mutating Webhook, ensuring non-intrusive operation. When a Pod creation request meets the criteria, the Webhook intercepts it, reads the corresponding WorkloadSpread configuration, selects an appropriate Subset, and modifies the Pod configuration accordingly. The controller maintains the controller.kubernetes.io/pod-deletion-cost label to ensure correct downsizing order.

Limitations of WorkloadSpread

Potential Risks of Webhook

WorkloadSpread depends on Pod Mutating Webhook to function, which intercepts all Pod creation requests in the cluster. If the Webhook Pod (kruise-manager) experiences performance issues or failures, it may prevent new Pods from being created. Additionally, during large-scale scaling operations, Webhook can become a performance bottleneck.

Limitations of Acting on Pods

While acting on Pods reduces business intrusion, it introduces limitations. For instance, CloneSet's gray release ratio cannot be controlled per Subset.

Case Study 1: Bandwidth Package Allocation in Large-Scale Load Testing

A company needed to perform load testing before a major shopping festival. They developed a load-agent program to generate requests and used a CloneSet to manage agent replicas. To save costs, they purchased 10 shared bandwidth packages (each supporting 300 Pods) and aimed to dynamically allocate them to elastic agent replicas.

They configured a WorkloadSpread with 11 Subsets: the first 10 Subsets had a capacity of 300 and patched Pod Annotations to bind specific bandwidth packages; the last Subset had no capacity and no bandwidth package, preventing extra bandwidth allocation if more than 3000 replicas were created.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: bandwidth-spread
  namespace: loadtest
spec:
  targetRef:
    apiVersion: apps.kruise.io/v1alpha1
    kind: CloneSet
    name: load-agent-XXXXX
  subsets:
    - name: bandwidthPackage-1
      maxReplicas: 300
      patch:
        metadata:
          annotations:
            k8s.aliyun.com/eip-common-bandwidth-package-id: <id1>

    - ...

    - name: bandwidthPackage-10
      maxReplicas: 300
      patch:
        metadata:
          annotations:
            k8s.aliyun.com/eip-common-bandwidth-package-id: <id10>
    - name: no-eip

Case Study 2: Compatibility for Scaling Managed K8S Cluster Services to Serverless Instances

A company had a web service running on an IDC that needed to scale up due to business growth but could not expand the local data center. They chose to use virtual nodes to access cloud-based Serverless elastic compute power, forming a hybrid cloud. Their application used acceleration services like Fluid, which were pre-deployed on nodes in the IDC but not available in the serverless subset. Therefore, they needed to inject a sidecar into cloud Pods to provide acceleration capabilities.

To achieve this without modifying the existing Deployment's 8 replicas, they used WorkloadSpread to add a label to Pods scaled to each subset, which controlled the Fluid sidecar injection.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: data-processor-spread
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: data-processor
  subsets:
    - name: local
      maxReplicas: 8
      patch:
        metadata:
          labels:
            serverless.fluid.io/inject: "false"
    - name: aliyun-acs
      patch:
        metadata:
          labels:
            serverless.fluid.io/inject: "true"

UnitedDeployment: A Native Workload with Built-in Elasticity

UnitedDeployment is an advanced workload provided by the OpenKruise community that natively supports partition management. Unlike WorkloadSpread, which enhances basic workloads, UnitedDeployment offers a new mode for managing partitioned elastic applications. It defines applications through a single template, and the controller creates and manages multiple secondary workloads to match different subsets. UnitedDeployment manages the entire lifecycle of applications within a single resource, including definition, partitioning, scaling, and upgrades.

Example Configuration

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
  name: sample-ud
spec:
  replicas: 6
  selector:
    matchLabels:
      app: sample
  template:
    cloneSetTemplate:
      metadata:
        labels:
          app: sample
      spec:
        # CloneSet Spec
        ...
  topology:
    subsets:
      - name: ecs
        nodeSelectorTerm:
          matchExpressions:
            - key: node-type
              operator: In
              values:
                - ecs
        maxReplicas: 2
      - name: acs-serverless
        nodeSelectorTerm:
          matchExpressions:
            - key: node-type
              operator: In
              values:
                - acs-virtual-kubelet

Advantages of UnitedDeployment

All-In-One Elastic Application Management

UnitedDeployment offers comprehensive all-in-one application management, enabling users to define applications, manage subsets, scale, and upgrade using a single resource.

The UnitedDeployment controller manages a corresponding type of secondary workload for each subset based on the workload template, without requiring additional attention from the user. Users only need to manage the application template and subsets; the UnitedDeployment controller will handle subsequent management tasks for each secondary workload, including creation, modification, and deletion. The controller also monitors the status of Pods created by these workloads when necessary to make corresponding adjustments.

It is the secondary workload controllers implement the specific scaling and updating operations. Thus, scaling and updating using UnitedDeployment produces exactly the same effect as directly using the corresponding workload. For example, a UnitedDeployment will inherit the same grayscale publishing and in-place upgrade capabilities from CloneSet when created with a CloneSet template.

Advanced Subset Management

UnitedDeployment incorporates two capacity allocation algorithms, enabling users to handle various scenarios of elastic applications through detailed subset capacity configurations.

The elastic allocation algorithm implements a classic elastic capacity allocation method similar to WorkloadSpread: by setting upper and lower capacity limits for each subset, Pods are scaled up in the defined order of subsets and scaled down in reverse order. This method has been thoroughly introduced earlier, so it will not be elaborated further here.

The specified allocation algorithm represents a new approach to capacity allocation. It directly assigns fixed numbers or percentages to some subsets and reserves at least one elastic subset to distribute the remaining replicas.

In addition to capacity allocation, UnitedDeployment also allows customizing any Pod Spec fields (including container images) for each subset, which is similar to WorkloadSpread. This grants UnitedDeployment's subset configuration with powerful flexibility.

Adaptive Elasticity

UnitedDeployment offers robust adaptive elasticity, automating scaling and rescheduling operations to reduce operational overhead. It supports Kubernetes Horizontal Pod Autoscaler (HPA), enabling automatic scaling based on predefined conditions while adhering strictly to subset configurations.

UnitedDeployment also offers adaptive Pod rescheduling capabilities similar to WorkloadSpread. Additionally, it allows configuration of timeout durations for scheduling failures and recovery times for subsets from unscheduable status, providing enhanced control over adaptive scheduling.

Limitations of UnitedDeployment

The many advantages of UnitedDeployment stem from its all-in-one management capabilities as an independent workload. However, this also leads to the drawback of higher business transformation intrusiveness. For users' existing application, it is necessary to modify PaaS systems and tools (such as operation and maintenance systems, release systems, etc.) to switch from existing workloads like Deployment and CloneSet to UnitedDeployment.

Case Study 1: Elastic Scaling of Pods to Virtual Nodes with Adaptation for Serverless Containers

Cloud providers typically offer three types of Kubernetes services:

Managed clusters with fixed nodes using cloud servers purchased by users.
Serverless clusters delivering container computing power directly via virtual node technology.
Hybrid clusters containing both managed nodes and virtual nodes.

In this case, a company planned to launch a new service with significant peak-to-valley traffic differences (up to tenfold). To handle this characteristic, they purchased a batch of cloud servers to form a managed cluster nodepool for handling baseline traffic and intended to quickly scale out new replicas to a serverless subset during peak hours. Additionally, their application required extra configuration to run in the Serverless environment. Below is an example configuration:

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
  name: elastic-app
spec:
  # Omitted business workload template
  ...
  topology:
    # Enable Adaptive scheduling to dispatch Pod replicas to ECS node pools and ACS instances adaptively
    scheduleStrategy:
      type: Adaptive
      adaptive:
        # Start scheduling to ACS Serverless instances 10 seconds after ECS node scheduling failure
        rescheduleCriticalSeconds: 10
        # Do not schedule to ECS nodes within one hour after the above scheduling failure
        unschedulableLastSeconds: 3600
    subsets:
      # Prioritize ECS without an upper limit; only schedule to ACS when ECS fails
      # During scale-in, delete ACS instances first, then ECS node pool Pods
      - name: ecs
        nodeSelectorTerm:
          matchExpressions:
            - key: type
              operator: NotIn
              values:
                - acs-virtual-kubelet
      - name: acs-serverless
        nodeSelectorTerm:
          matchExpressions:
            - key: type
              operator: In
              values:
                - acs-virtual-kubelet
          # Use patch to modify environment variables for Pods scheduled to elastic computing power, enabling Serverless mode
        patch:
          spec:
            containers:
              - name: main
                env:
                  - name: APP_RUNTIME_MODE
                    value: SERVERLESS
---
# Combine with HPA for automatic scaling
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: elastic-app-hpa
spec:
  minReplicas: 1
  maxReplicas: 100
  metrics:
    - resource:
        name: cpu
        targetAverageUtilization: 2
      type: Resource
  scaleTargetRef:
    apiVersion: apps.kruise.io/v1alpha1
    kind: UnitedDeployment
    name: elastic-app

Case Study 2: Allocating Different Resources to Pods with Different CPU Types

In this case, a company purchased several cloud server instances with Intel, AMD, and ARM platform CPUs to prepare for launching a new service. They wanted Pods scheduled on different platforms to exhibit similar performance. After stress testing, it was found that, compared to Intel CPUs as the benchmark, AMD platforms needed more CPU cores, while ARM platforms required more memory.

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
  name: my-app
spec:
  replicas: 4
  selector:
    matchLabels:
      app: my-app
  template:
    deploymentTemplate:
      ... # Omitted business workload template
  topology:
    # Intel, AMD, and Yitian 710 ARM machines carry 50%, 25%, and 25% of the replicas respectively
    subsets:
      - name: intel
        replicas: 50%
        nodeSelectorTerm:
          ... # Select Intel node pool through labels
        patch:
          spec:
            containers:
              - name: main
                resources:
                  limits:
                    cpu: 2000m
                    memory: 4000Mi
      - name: amd64
        replicas: 25%
        nodeSelectorTerm:
          ... # Select AMD node pool through labels
        # Allocate more CPU to AMD platform
        patch:
          spec:
            containers:
              - name: main
                resources:
                  limits:
                    cpu: 3000m
                    memory: 4000Mi
      - name: yitian-arm
        replicas: 25%
        nodeSelectorTerm:
          ... # Select ARM node pool through labels
        # Allocate more memory to ARM platform
        patch:
          spec:
            containers:
              - name: main
                resources:
                  limits:
                    cpu: 2000m
                    memory: 6000Mi

Summary

Elastic computing power can significantly reduce business costs and effectively increase the performance ceiling of services. To make good use of elastic computing power, it is necessary to choose appropriate elastic components based on specific application characteristics. The following table summarizes the capabilities of the two components introduced in this article, hoping to provide some reference.

Component	Partition Principle	Ease of Modification	Granularity of Partition	Elasticity Capability
WorkloadSpread	Modify Pods via Webhook	High	Medium	Medium
UnitedDeployment	Create multiple workloads via templates	Low	High	High

OpenKruise V1.4 Release, New Job Sidecar Terminator Capability

April 18, 2023 · 7 min read

Mingshan Zhao

Member of OpenKruise

OpenKruise (https://github.com/openkruise/kruise) is an open-source cloud-native application automation management suite. It is also a current incubating project hosted by the Cloud Native Computing Foundation (CNCF). It is a standard extension component based on Kubernetes that is widely used in production of internet scale company. It also closely follows upstream community standards and adapts to the technical improvement and best practices for internet-scale scenarios.

OpenKruise has released the latest version v1.4 on March 31, 2023 (ChangeLog), with the addition of the Job Sidecar Terminator feature. This article provides a comprehensive overview of the new version.

Upgrade Notice

To facilitate the use of Kruise's enhanced capabilities, some stable capabilities have been enabled by default, including ResourcesDeletionProtection, WorkloadSpread, PodUnavailableBudgetDeleteGate, InPlaceUpdateEnvFromMetadata, StatefulSetAutoDeletePVC, and PodProbeMarkerGate. Most of these capabilities require special configuration to take effect, so enabling them by default generally has no impact on existing clusters. If you do not want to use some of these features, you can turn them off during the upgrade process.
The leader election method for Kruise-Manager has been migrated from configmaps to configmapsleases to prepare for future migration to the leases method. In addition, this is an officially provided smooth upgrade method that will not affect existing clusters.

2. New Job Sidecar Terminator Capability

In Kubernetes, for Job workloads, it is commonly desired that when the main container completes its task and terminates, the Pod should enter a completed state. However, when these Pods have Long-Running Sidecar containers, the Sidecar container cannot terminate itself after the main container has exited, causing the Pod to remain in an incomplete state. The community's common solution to this problem usually involves modifying both the Main and Sidecar containers to use Volume sharing to achieve the effect of the Sidecar container exiting after the Main container has completed.

While the community's solution can solve this problem, it requires modification of the containers, especially for commonly used Sidecar containers, which incurs high costs for modification and maintenance.

To address this, we have added a controller called SidecarTerminator to Kruise. This controller is specifically designed to listen for completion status of the main container in this scenario and select an appropriate time to terminate the Sidecar container in the Pod, without requiring intrusive modification of the Main and Sidecar containers.

Pods on real nodes

For pods running on regular nodes, it is very easy to use this feature since Kruise daemon can be installed. Users just need to add a special env to identify the target sidecar container in the pod, and the controller will use the ContainerRecreateRequest(CRR) capability provided by Kruise Daemon to terminate these sidecar containers at the appropriate time.

kind: Job
spec:
  template:
    spec:
      containers:
      - name: sidecar
        env:
        - name: KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT
          value: "true"
      - name: main
      ...

Pods on virtual nodes

For some platforms that provide Serverless containers, such as ECI or Fargate, their pods can only run on virtual nodes such as Virtual-Kubelet. However, Kruise Daemon cannot be deployed and work on these virtual nodes, which makes it impossible to use the CRR capability to terminate containers.

Fortunately, we can use the Pod in-place upgrade mechanism provided by native Kubernetes to achieve the same goal: just construct a special image whose only purpose is to make the container exit quickly once started. In this way, when exiting the sidecar, just replace the original sidecar image with the fast exit image to achieve the purpose of exiting the sidecar.

Step 1: Prepare a fast exit image

The image only needs to have a very simple logic: when the container of this image starts, it exits directly with an exit code of 0.
The image needs to be compatible with the commands and args of the original sidecar image to prevent errors when the container starts.

Step 2: Configure the special image in the Sidecar environment variable

kind: Job
spec:
  template:
    spec:
      containers:
      - name: sidecar
        env:
        - name: KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT_WITH_IMAGE
          value: "example/quick-exit:v1.0.0"
      - name: main
      ...

Replace "example/quick-exit:v1.0.0" with the fast exit image that you have prepared in step 1.

Notice

The sidecar container must be able to respond to the SIGTERM signal, and when it receives this signal, the entrypoint process needs to exit (that is, the sidecar container needs to exit), and the exit code should be 0.
This feature applies to any Pod managed by a Job type Workload, as long as their RestartPolicy is Never/OnFailure.
Containers with the environment variable KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT will be treated as sidecar containers, while other containers will be treated as main containers. The sidecar container will only be terminated after all main containers have completed:
- Under the Never restart policy, once the main container exits, it will be considered "completed".
- Under the OnFailure restart policy, the exit code of the main container must be 0 to be considered "completed".
In Pods on real nodes mode, KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT has a higher priority than KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT_WITH_IMAGE

Advanced Workload Improvement

CloneSet Optimization Performance: New FeatureGate CloneSetEventHandlerOptimization

Currently, whether it's a change in the state or metadata of a Pod,, the Pod Update event will trigger the CloneSet reconcile logic. CloneSet Reconcile is configured with three workers by default, which is not a problem for smaller cluster scenarios.

However, for larger or busy clusters, these unneccesary reconciles will block the true CloneSet reconcile and delay changes such as rolling updates of CloneSet. To solve this problem, you can turn on the feature-gate CloneSetEventHandlerOptimization to reduce some unnecessary enqueueing of reconciles.

CloneSet New disablePVCReuse Field

If a Pod is directly deleted or evicted by other controller or user, the PVCs associated with the Pod still remain. When the CloneSet controller creates new Pods, it will reuse existing PVCs.

However, if the Node where the Pod is located experiences a failure, reusing existing PVCs may cause the new Pod to fail to start. For details, please refer to issue 1099. To solve this problem, you can set the disablePVCReuse=true field. After the Pod is evicted or deleted, the PVCs associated with the Pod will be automatically deleted and will no longer be reused.

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
  ...
  replicas: 4
  scaleStrategy:
    disablePVCReuse: true

CloneSet New PreNormal Lifecycle

CloneSet currently supports two lifecycle hooks, PreparingUpdate and PreparingDelete, which are used for graceful application termination. For details, please refer to the Community Documentation. In order to support graceful application deployment, a new state called PreNormal has been added, as follows:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
  # define with finalizer
  lifecycle:
    preNormal:
      finalizersHandler:
      - example.io/unready-blocker

  # or define with label
  # lifecycle:
  #   preNormal:
  #     labelsHandler:
  #       example.io/block-unready: "true"

When CloneSet creates a Pod (including normal scaling and upgrades):

The Pod will only be considered "Available" and enter the "Normal" state if it meets the definition of the PreNormal hook.

This is useful for some post-checks when creating Pods, such as checking if the Pod has been mounted to the SLB backend, so as to avoid traffic loss caused by new instance mounting failure after the old instance is destroyed during rolling upgrade.

4. Enhanced Operations Improvement

ContainerRestart New forceRecreate Field

When creating a CRR resource, if the container is in the process of starting up, the CRR will not restart the container again. If you want to force a container restart, you can enable the following field:

apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
spec:
  ...
  strategy:
    forceRecreate: true

ImagePullJob Support Attach metadata into cri interface

When Kubelet creates a Pod, Kubelet will attach metadata to the container runtime using CRI interface. The image repository can use this metadata information to identify the business related to the starting container. Some container actions of low business value can be degraded to protect the overloaded repository.

OpenKruise's imagePullJob also supports similar capabilities, as follows:

apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
spec:
  ...
  image: nginx:1.9.1
  sandboxConfig:
    annotations:
      io.kubernetes.image.metrics.tags: "cluster=cn-shanghai"
    labels:
      io.kubernetes.image.app: "foo"

Get Involved

Welcome to get involved with OpenKruise by joining us in Github/Slack/DingTalk/WeChat. Have something you’d like to broadcast to our community? Share your voice at our Bi-weekly community meeting (Chinese), or through the channels below:

Join the community on Slack (English).
Join the community on DingTalk: Search GroupID 23330762 (Chinese).
Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).

OpenKruise v1.3, New Custom Pod Probe Capabilities and Significant Performance Improvements for Large-Scale Clusters

October 7, 2022 · 9 min read

Mingshan Zhao

Member of OpenKruise

We’re pleased to announce the release of OpenKruise 1.3, which is a CNCF Sandbox level project.

OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.

What's new?

In release v1.3, OpenKruise provides a new CRD named PodProbeMarker, improves its performance in large-scale clusters, Advanced DaemonSet support pre-download image, and some new features have been added to CloneSet, WorkloadSpread, AdvancedCronJob, SidecarSet etc.

Here we are going to introduce some changes of it.

1. New CRD and Controller: PodProbeMarker

Kubernetes provides three Pod lifecycle management:

Readiness Probe Used to determine whether the business container is ready to respond to user requests. If the probe fails, the Pod will be removed from Service Endpoints.
Liveness Probe Used to determine the health status of the container. If the probe fails, the kubelet will restart the container.
Startup Probe Used to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds.

So the Probe capabilities provided in Kubernetes have defined specific semantics and related behaviors. In addition, there is actually a need to customize Probe semantics and related behaviors, such as:

GameServer defines Idle Probe to determine whether the Pod currently has a game match, if not, from the perspective of cost optimization, the Pod can be scaled down.
K8S Operator defines the main-secondary probe to determine the role of the current Pod (main or secondary). When upgrading, the secondary can be upgraded first, so as to achieve the behavior of selecting the main only once during the upgrade process, reducing the service interruption time during the upgrade process.

OpenKruise provides the ability to customize the Probe and return the result to the Pod Status, and the user can decide the follow-up behavior based on the probe result.

An object of PodProbeMarker may look like this:

apiVersion: apps.kruise.io/v1alpha1
kind: PodProbeMarker
metadata:
  name: game-server-probe
  namespace: ns
spec:
  selector:
    matchLabels:
      app: game-server
  probes:
  - name: Idle
    containerName: game-server
    probe:
      exec: 
        command:
        - /home/game/idle.sh
      initialDelaySeconds: 10
      timeoutSeconds: 3
      periodSeconds: 10
      successThreshold: 1
      failureThreshold: 3
    markerPolicy:
    - state: Succeeded
      labels:
        gameserver-idle: 'true'
      annotations:
        controller.kubernetes.io/pod-deletion-cost: '-10'
    - state: Failed
      labels:
        gameserver-idle: 'false'
      annotations:
        controller.kubernetes.io/pod-deletion-cost: '10'
    podConditionType: game.io/idle

PodProbeMarker results can be viewed at Pod Object:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: game-server
    gameserver-idle: 'true'
  annotations:
    controller.kubernetes.io/pod-deletion-cost: '-10'
  name: game-server-58cb9f5688-7sbd8
  namespace: ns
spec:
  ...
status:
  conditions:
    # podConditionType
  - type: game.io/idle
    # Probe State 'Succeeded' indicates 'True', and 'Failed' indicates 'False'
    status: "True"
    lastProbeTime: "2022-09-09T07:13:04Z"
    lastTransitionTime: "2022-09-09T07:13:04Z"
    # If the probe fails to execute, the message is stderr
    message: ""

2. Performance optimization: significant performance improvements for large-scale clusters

#1026 The introduction of a delayed queueing mechanism significantly optimizes the CloneSet controller work queue buildup problem when kruise-manager is pulled up in large-scale application clusters, ideally reducing initialization time by more than 80%.
#1027 Optimize PodUnavailableBudget controller Event Handler logic to reduce the number of irrelevant Pods in the queue.
#1011 The caching mechanism optimizes the CPU and Memory consumption of Advanced DaemonSet's repetitive simulation of Pod scheduling computations in large-scale clusters.
#1015, #1068 Significantly reduce runtime memory consumption in large clusters. Complete the Disable DeepCopy feature started in v1.1, and reduce the conversion consumption of expressions type label selector.

3. SidecarSet support inject specific historical version

SidecarSet will record historical revision of some fields such as containers, volumes, initContainers, imagePullSecrets and patchPodMetadata via ControllerRevision. Based on this feature, you can easily select a specific historical revision to inject when creating Pods, rather than always inject the latest revision of sidecar.

SidecarSet records ControllerRevision in the same namespace as Kruise Manager. You can execute kubectl get controllerrevisions -n kruise-system -l kruise.io/sidecarset-name=<your-sidecarset-name> to list the ControllerRevisions of your SidecarSet. Moreover, the ControllerRevision name of current SidecarSet revision is shown in status.latestRevision field, so you can record it very easily.

There are two configuration methods as follows:

select revision via ControllerRevision name

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: sidecarset
spec:
  ...
  updateStrategy:
    partition: 90%
  injectionStrategy:
    revision:
      revisionName: <specific-controllerrevision-name>

select revision via custom version label

You can add or update the label apps.kruise.io/sidecarset-custom-version=<your-version-id> to SidecarSet when creating or publishing SidecarSet, to mark each historical revision. SidecarSet will bring this label down to the corresponding ControllerRevision object, and you can easily use the <your-version-id> to describe which historical revision you want to inject.

Assume that you are publishing version-2 in canary way (you wish only 10% Pods will be upgraded), and you want to inject the stable version-1 to newly-created Pods to reduce the risk of the canary publishing:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: sidecarset
  labels:
    apps.kruise.io/sidecarset-custom-version: example/version-2
spec:
  ...
  updateStrategy:
    partition: 90%
  injectionStrategy:
    revision:
      customVersion: example/version-1

4. SidecarSet support inject pod annotations

SidecarSet support inject pod annotations, as follows:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
spec:
  containers:
    ...
  patchPodMetadata:
  - annotations:
      oom-score: '{"log-agent": 1}'
      custom.example.com/sidecar-configuration: '{"command": "/home/admin/bin/start.sh", "log-level": "3"}'
    patchPolicy: MergePatchJson
  - annotations:
      apps.kruise.io/container-launch-priority: Ordered
    patchPolicy: Overwrite | Retain

patchPolicy is the injected policy, as follows:

Retain: By default, if annotation[key]=value exists in the Pod, the original value of the Pod will be retained. Inject annotations[key]=value2 only if annotation[key] does not exist in the Pod.
Overwrite: Corresponding to Retain, when annotation[key]=value exists in the Pod, it will be overwritten value2.
MergePatchJson: Corresponding to Overwrite, the annotations value is a json string. If the annotations[key] does not exist in the Pod, it will be injected directly. If it exists, do a json value merge. For example: annotations[oom-score]='{"main": 2}' exists in the Pod, after injection, the value json is merged into annotations[oom-score]='{"log-agent": 1, "main": 2}'.

Note: When the patchPolicy is Overwrite and MergePatchJson, the annotations can be updated synchronously when the SidecarSet in-place update the Sidecar Container. However, if only the annotations are modified, it will not take effect. It must be in-place update together with the sidecar container image. When patchPolicy is Retain, the annotations will not be updated when the SidecarSet in-place update the Sidecar Container.

According to the above configuration, when the sidecarSet is injected into the sidecar container, it will inject Pod annotations synchronously, as follows:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    apps.kruise.io/container-launch-priority: Ordered
    oom-score: '{"log-agent": 1, "main": 2}'
    custom.example.com/sidecar-configuration: '{"command": "/home/admin/bin/start.sh", "log-level": "3"}'
name: test-pod
spec:
  containers:
    ...

Note: SidecarSet should not modify any configuration outside the sidecar container for security consideration, so if you want to use this capability, you need to first configure SidecarSet_PatchPodMetadata_WhiteList whitelist or turn off whitelist checks via Feature-gate SidecarSetPatchPodMetadataDefaultsAllowed=true.

5. Advanced DaemonSet support pre-downloading image for update

If you have enabled the PreDownloadImageForDaemonSetUpdate feature-gate, DaemonSet controller will automatically pre-download the image you want to update to the nodes of all old Pods. It is quite useful to accelerate the progress of applications upgrade.

The parallelism of each new image pre-downloading by DaemonSet is 1, which means the image is downloaded on nodes one by one. You can change the parallelism using apps.kruise.io/image-predownload-parallelism annotation on DaemonSet according to the capability of image registry, for registries with more bandwidth and P2P image downloading ability, a larger parallelism can speed up the pre-download process.

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
metadata:
  annotations:
    apps.kruise.io/image-predownload-parallelism: "10"

6. CloneSet Scaling with PreparingDelete

CloneSet considers Pods in PreparingDelete state as normal by default, which means these Pods will still be calculated in the replicas number.

In this situation:

if you scale down replicas from N to N-1, when the Pod to be deleted is still in PreparingDelete, you scale up replicas to N, the CloneSet will move the Pod back to Normal.
if you scale down replicas from N to N-1 and put a Pod into podsToDelete, when the specific Pod is still in PreparingDelete, you scale up replicas to N, the CloneSet will not create a new Pod until the specific Pod goes into terminating.
if you specifically delete a Pod without replicas changed, when the specific Pod is still in PreparingDelete, the CloneSet will not create a new Pod until the specific Pod goes into terminating.

Since Kruise v1.3.0, you can put a apps.kruise.io/cloneset-scaling-exclude-preparing-delete: "true" label into CloneSet, which indicates Pods in PreparingDelete will not be calculated in the replicas number.

In this situation:

if you scale down replicas from N to N-1, when the Pod to be deleted is still in PreparingDelete, you scale up replicas to N, the CloneSet will move the Pod back to Normal.
if you scale down replicas from N to N-1 and put a Pod into podsToDelete, even if the specific Pod is still in PreparingDelete, you scale up replicas to N, the CloneSet will create a new Pod immediately.
if you specifically delete a Pod without replicas changed, even if the specific Pod is still in PreparingDelete, the CloneSet will create a new Pod immediately.

7. Advanced CronJob Time zones

All CronJob schedule: times are based on the timezone of the kruise-controller-manager by default, which means the timezone set for the kruise-controller-manager container determines the timezone that the cron job controller uses.

However, we have introduce a spec.timeZone field in v1.3.0. You can set it to the name of a valid time zone name. For example, setting spec.timeZone: "Etc/UTC" instructs Kruise to interpret the schedule relative to Coordinated Universal Time.

A time zone database from the Go standard library is included in the binaries and used as a fallback in case an external database is not available on the system.

8. Other changes

For more changes, their authors and commits, you can read the Github release.

Get Involved

Join the community on Slack (English).
Join the community on DingTalk: Search GroupID 23330762 (Chinese).
Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).

OpenKruise v1.2, new PersistentPodState feature to achieve IP retention

June 7, 2022 · 8 min read

Siyu Wang

Maintainer of OpenKruise

We’re pleased to announce the release of OpenKruise 1.2, which is a CNCF Sandbox level project.

What's new?

In release v1.2, OpenKruise provides a new CRD named PersistentPodState, some new fields of CloneSet status and lifecycle hook, and optimization of PodUnavailableBudget.

Here we are going to introduce some changes of it.

1. New CRD and Controller: PersistentPodState

With the development of cloud native, more and more companies start to deploy stateful services (e.g., Etcd, MQ) using Kubernetes. K8S StatefulSet is a workload for managing stateful services, and it considers the deployment characteristics of stateful services in many aspects. However, StatefulSet persistent only limited pod state, such as Pod Name is ordered and unchanging, PVC persistence, and can not cover other states, e.g. Pod IP retention, priority scheduling to previously deployed Nodes, etc. Typical Cases:

Service Discovery Middleware services are exceptionally sensitive to the Pod IP after deployment, requiring that the IP cannot be changed.
Database services persist data to the host disk, and changes to the Node to which they belong will result in data loss.

In response to the above description, by customizing PersistentPodState CRD, Kruise is able to persistent other states of the Pod, such as "IP Retention".

An object of PersistentPodState may look like this:

apiVersion: apps.kruise.io/v1alpha1
kind: PersistentPodState
metadata:
  name: echoserver
  namespace: echoserver
spec:
  targetRef:
    # Native k8s or kruise StatefulSet
    # only support StatefulSet
    apiVersion: apps.kruise.io/v1beta1
    kind: StatefulSet
    name: echoserver
  # required node affinity. As follows, Pod rebuild will force deployment to the same zone
  requiredPersistentTopology:
    nodeTopologyKeys:
      - failure-domain.beta.kubernetes.io/zone[,other node labels]
  # preferred node affinity. As follows, Pod rebuild will preferred deployment to the same node
  preferredPersistentTopology:
    - preference:
        nodeTopologyKeys:
          - kubernetes.io/hostname[,other node labels]
      # int [1, 100]
      weight: 100

"IP Retention" should be a common requirement for K8S deployments of stateful services. It does not mean "Specified Pod IP", but requires that the Pod IP does not change after the first deployment, either by service release or by machine eviction. To achieve this, we need the K8S network component to support Pod IP retention and the ability to keep the IP as unchanged as possible. In this article, we have modified the Host-local plugin in the flannel network component so that it can achieve the effect of keeping the Pod IP unchanged under the same Node. Related principles will not be stated here, please refer to the code: host-local.

IP retention seems to be supported by the network component, how is it related with PersistentPodState? Well, there are some limitations to the implementation of "Pod IP unchanged" by network components. For example, flannel can only support the same Node to keep the Pod IP unchanged. However, the most important feature of K8S scheduling is "uncertainty", so "how to ensure that Pods are rebuilt and scheduled to the same Node" is the problem that PersistentPodState solves.

Also you can add the annotations below on your StatefulSet or Advanced StatefulSet, to let Kruise automatically create a PersistentPodState object for the StatefulSet. So you don't have to create it manually.

apiVersion: apps.kruise.io/v1alpha1
kind: StatefulSet
metadata:
  annotations:
    # auto generate PersistentPodState
    kruise.io/auto-generate-persistent-pod-state: "true"
    # preferred node affinity, As follows, Pod rebuild will preferred deployment to the same node
    kruise.io/preferred-persistent-topology: kubernetes.io/hostname[,other node labels]
    # required node affinity, As follows, Pod rebuild will force deployment to the same zone
    kruise.io/required-persistent-topology: failure-domain.beta.kubernetes.io/zone[,other node labels]

2. CloneSet percentage partition calculation changed (breaking), and a new field in its status

Previously, CloneSet calculates its partition with round up if it is a percentage value, which means even you set partition to be a percentage less than 100%, it might update no Pods to the new revision. For example, the real partition of a CloneSet with replicas=8 and partition=90% will be calculated as 8 because of 8 * 90% with round up, so it will not update any Pod. This is a little confused, especially when we are using a rollout component like Kruise Rollout or Argo.

So since v1.2, CloneSet will make sure there is at lease one Pod should be updated when partition is a percentage less than 100%, unless the CloneSet has replicas <= 1.

However, it might be difficult for users to understand this arithmetic, but they have to known the expected updated number of Pods after a percentage partition was set.

So we also provide a new field expectedUpdatedReplicas in CloneSet status, which directly shows the expected updated number of Pods based on the given partition. Users only have to compare status.updatedReplicas >= status.expectedUpdatedReplicas to decide whether their CloneSet has finished rolling out new revision under partition restriction or not.

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
  replicas: 8
  updateStrategy:
    partition: 90%
status:
  replicas: 8
  expectedUpdatedReplicas: 1
  updatedReplicas: 1
  updatedReadyReplicas: 1

3. Able to mark Pod not-ready for lifecycle hook

Kruise has already provided lifecycle hook in previous versions. CloneSet and Advanced StatefulSet support both PreDelete and InPlaceUpdate hooks, while Advanced DaemonSet only supports PreDelete hook.

Previously, the hooks only pause the operation and allow users to do something (for example remove pod from service endpoints) during Pod deleting and before/after in-place update. But the Pod is probably Ready during the hook state, so that removing it from some custom service implementation may break the rule of Kubernetes that we'd better only remove NotReady Pods from the endpoints.

So that a new field has been added into the lifecycle hook, markPodNotReady indicates the hooked Pod should be marked as NotReady or not.

type LifecycleStateType string

// Lifecycle contains the hooks for Pod lifecycle.
type Lifecycle struct 
    // PreDelete is the hook before Pod to be deleted. 
    PreDelete *LifecycleHook `json:"preDelete,omitempty"` 
    // InPlaceUpdate is the hook before Pod to update and after Pod has been updated. 
    InPlaceUpdate *LifecycleHook `json:"inPlaceUpdate,omitempty"`
}

type LifecycleHook struct {
    LabelsHandler     map[string]string `json:"labelsHandler,omitempty"`
    FinalizersHandler []string          `json:"finalizersHandler,omitempty"`
	
    /**********************  FEATURE STATE: 1.2.0 ************************/
    // MarkPodNotReady = true means:
    // - Pod will be set to 'NotReady' at preparingDelete/preparingUpdate state.
    // - Pod will be restored to 'Ready' at Updated state if it was set to 'NotReady' at preparingUpdate state.
    // Default to false.
    MarkPodNotReady bool `json:"markPodNotReady,omitempty"`
    /*********************************************************************/	
}

For PreDelete hook, it will set Pod to be NotReady during PreparingDelete state if markPodNotReady is true, and the Pod can not be changed back to normal even if the replicas is increased again.

For InPlaceUpdate hook, it will set Pod to be NotReady during PreparingUpdate state if markPodNotReady is true, and the NotReady condition will be removed during Updated state.

4. PodUnavailableBudget supports any custom workloads and performance optimization

Kubernetes offers PodDisruptionBudget to help users run highly available applications even when you introduce frequent voluntary disruptions, but it can only constrain the voluntary disruption triggered by the Eviction API.

In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which greatly improves the high availability of application services. It can not only protect application Pods from eviction but also deletion, in-place update and other operations that could make Pods not ready.

Previously, PodUnavailableBudget only supports some specific workloads like CloneSet and Deployment. But it can not recognize unknown workloads that may be defined by users themself.

Since v1.2 release, PodUnavailableBudget has supported any custom workloads to protect their Pods from unavailable operations. All you have to do is to declare scale subresource for those custom workloads.

It looks like this in CRD:

    subresources:
      scale:
        labelSelectorPath: .status.labelSelector
        specReplicasPath: .spec.replicas
        statusReplicasPath: .status.replicas

But if you are using kubebuilder or operator-sdk to generate your project, one line comment on your workload struct will be fine:

// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.labelSelector

Besides, PodUnavailableBudget also optimizes its performance for large-scale clusters by disable DeepCopy from client list.

5. Other changes

For more changes, their authors and commits, you can read the Github release.

Get Involved

Join the community on Slack (English).
Join the community on DingTalk: Search GroupID 23330762 (Chinese).
Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).

OpenKruise v1.1, features enhanced, improve performance in large-scale clusters

March 29, 2022 · 8 min read

Siyu Wang

Maintainer of OpenKruise

We’re pleased to announce the release of Kubernetes 1.1, which is a CNCF Sandbox level project.

What's new?

In release v1.1, OpenKruise optimizes some existing features, and improves its performance in large-scale clusters. Here we are going to introduce some changes of it.

Note that OpenKruise v1.1 bumps Kubernetes dependencies to v1.22, which means we can use new fields of up to K8s v1.22 in Pod template of workloads like CloneSet and Advanced StatefulSet. But OpenKruise can still be used in Kubernetes cluster >= 1.16 version.

1. Keep containers order for in-place update

In the release v1.0 we published last year, OpenKruise has intruduced Container Launch Priority, which supports to define different priorities for containers in a Pod and keeps their start order during Pod creation.

But in v1.0, it can only control the order in Pod creation. If you try to update the containers in-place, they will be updated at the same time.

Recently, the community has discussed with some companies such as LinkedIn and get more input from the users. In some scenarios, the containers in Pod may have special relationship, for example base-container should firstly update its configuration before app-container update, or we have to forbid multiple containers updating together to avoid log-container losing the logs of app-container.

So, OpenKruise supports in-place update with container priorities since v1.1.

There is no extra options, just make sure containers have their launch priorities since Pod creation. If you modify them both in once in-place update, Kruise will firstly update the containers with higher priority. Then Kruise will not update the containers with lower priority util the higher one has updated successfully.

The in-place udpate here includes both modification of image and env from metadata, read the concept doc for more details

For pods without container launch priorities, no guarantees of the execution order during in-place update multiple containers.
For pods with container launch priorities:
- keep execution order during in-place update multiple containers with different priorities.
- no guarantees of the execution order during in-place update multiple containers with the same priority.

For example, we have the CloneSet that includes two containers with different priorities:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  ...
spec:
  replicas: 1
  template:
    metadata:
      annotations:
        app-config: "... config v1 ..."
    spec:
      containers:
      - name: sidecar
        env:
        - name: KRUISE_CONTAINER_PRIORITY
          value: "10"
        - name: APP_CONFIG
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['app-config']
      - name: main
        image: main-image:v1
  updateStrategy:
    type: InPlaceIfPossible

When we update the CloneSet to change app-config annotation and image of main container, which means both sidecar and main containers need to update, Kruise will firstly in-place update pods that recreates sidecar container with the new env from annotation.

At this moment, we can find the apps.kruise.io/inplace-update-state annotation in updated Pod and see its value:

{
  "revision": "{CLONESET_NAME}-{HASH}",         // the target revision name of this in-place update
  "updateTimestamp": "2022-03-22T09:06:55Z",    // the start time of this whole update
  "nextContainerImages": {"main": "main-image:v2"},                // the next containers that should update images
  // "nextContainerRefMetadata": {...},                            // the next containers that should update env from annotations/labels
  "preCheckBeforeNext": {"containersRequiredReady": ["sidecar"]},  // the pre-check must be satisfied before the next containers can update
  "containerBatchesRecord":[
    {"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]}  // the first batch of containers that have updated (it just means the spec of containers has updated, such as images in pod.spec.container or annotaions/labels, but dosn't mean the real containers on node have been updated completely)
  ]
}

When the sidecar container has been updated successfully, Kruise will update the next main container. Finally, you will find the apps.kruise.io/inplace-update-state annotation looks like:

{
  "revision": "{CLONESET_NAME}-{HASH}",
  "updateTimestamp": "2022-03-22T09:06:55Z",
  "lastContainerStatuses":{"main":{"imageID":"THE IMAGE ID OF OLD MAIN CONTAINER"}},
  "containerBatchesRecord":[
    {"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]},
    {"timestamp":"2022-03-22T09:07:20Z","containers":["main"]}
  ]
}

Usually, users only have to care about the containerBatchesRecord to make sure the containers are updated in different batches. If the Pod is blocking during in-place update, you should check the nextContainerImages/nextContainerRefMetadata and see if the previous containers in preCheckBeforeNext have been updated successfully and ready.

2. StatefulSetAutoDeletePVC

Since Kubernetes v1.23, the upstream StatefulSet has supported StatefulSetAutoDeletePVC feature, it controls if and how PVCs are deleted during the lifecycle of a StatefulSet, refer to this doc.

So, Advanced StatefulSet has rebased this feature from upstream, which also requires you to enable StatefulSetAutoDeletePVC feature-gate during install/upgrade Kruise.

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
  ...
  persistentVolumeClaimRetentionPolicy:  # optional
    whenDeleted: Retain | Delete
    whenScaled: Retain | Delete

Once enabled, there are two policies you can configure for each StatefulSet:

whenDeleted: configures the volume retention behavior that applies when the StatefulSet is deleted.
whenScaled: configures the volume retention behavior that applies when the replica count of the StatefulSet is reduced; for example, when scaling down the set.

For each policy that you can configure, you can set the value to either Delete or Retain.

Retain (default): PVCs from the volumeClaimTemplate are not affected when their Pod is deleted. This is the behavior before this new feature.
Delete: The PVCs created from the volumeClaimTemplate are deleted for each Pod affected by the policy. With the whenDeleted policy all PVCs from the volumeClaimTemplate are deleted after their Pods have been deleted. With the whenScaled policy, only PVCs corresponding to Pod replicas being scaled down are deleted, after their Pods have been deleted.

Note that:

StatefulSetAutoDeletePVC only deletes PVCs created by volumeClaimTemplate instead of the PVCs created by user or related to StatefulSet Pod.
The policies only apply when Pods are being removed due to the StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node where the new Pod is about to launch.

3. Advanced DaemonSet refactor, lifecycle hook

The behavior of Advanced DaemonSet used to be a little different with the upstream controller, such as it required extra configuration to choose whether not-ready and unschedulable nodes should be handled, which makes users confused and hard to understand.

In release v1.1, we have refactored Advanced DaemonSet to make it rebase with upstream. Now, the default behavior of Advanced DaemonSet should be same with the upstream DaemonSet, which means users can conveniently modify the apiVersion field to convert a built-in DaemonSet to Advanced DaemonSet.

Meanwhile, we also add lifecycle hook for Advanced DaemonSet. Currently it supports preDelete hook, which allows users to do something (for example check node resources) before Pod deleting.

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  ...
  # define with label
  lifecycle:
    preDelete:
      labelsHandler:
        example.io/block-deleting: "true"

When Advanced DaemonSet delete a Pod (including scale in and recreate update):

Delete it directly if no lifecycle hook definition or Pod not matched preDelete hook
Otherwise, Advanced DaemonSet will firstly update Pod to PreparingDelete state and wait for user controller to remove the label/finalizer and Pod not matched preDelete hook

4. Improve performance by disable DeepCopy

By default, when we are writing Operator/Controller with controller-runtime and use the Client interface in sigs.k8s.io/controller-runtime/pkg/client to get/list typed objects, it will always get objects from Informer. That's known by most people.

But what's many people don't know, is that controller-runtime will firstly deep copy all the objects got from Informer and then return the copied objects.

This design aims to avoid developers directly modifying the objects in Informer. After DeepCopy, no matter how developers modify the objected returned by get/list, it will not change the objects in Informer, which are only synced by ListWatch from kube-apiserver.

However, in some large-scale clusters, mutliple controllers of OpenKruise and their workers are reconciling together, which may bring so many DeepCopy operations. For example, there are a lot of application CloneSets and some of them have managed thousands of Pods, then each worker will list all Pod of the CloneSet during Reconcile and there exists multiple workers. It brings CPU and Memory pressure to kruise-manager and even sometimes makes it Out-Of-Memory.

So I have submitted and merged DisableDeepCopy feature in upstream, which contains in controller-runtime >= v0.10 version. It allows developers to specify some resource types that will directly return the objects from Informer without DeepCopy during get/list.

For example, we can add cache options when initialize Manager in main.go to avoid DeepCopy for Pod objects.

    mgr, err := ctrl.NewManager(cfg, ctrl.Options{
		...
		NewCache: cache.BuilderWithOptions(cache.Options{
			UnsafeDisableDeepCopyByObject: map[client.Object]bool{
				&v1.Pod{}: true,
			},
		}),
	})

But in Kruise v1.1, we re-implement Delegating Client instead of using the feature of controller-runtime. It allows developers to avoid DeepCopy with DisableDeepCopy ListOption in any list places, which is more flexible.

    if err := r.List(context.TODO(), &podList, client.InNamespace("default"), utilclient.DisableDeepCopy); err != nil {
		return nil, nil, err
	}

5. Other changes

For more changes, their authors and commits, you can read the Github release.

Get Involved

Join the community on Slack (English).
Join the community on DingTalk: Search GroupID 23330762 (Chinese).
Join the community on WeChat (new): Search User openkruise and let the robot invite you (Chinese).

OpenKruise v1.0, Reaching New Peaks of application automation

December 13, 2021 · 7 min read

Siyu Wang

Maintainer of OpenKruise

We’re pleased to announce the release of Kubernetes 1.0, which is a CNCF Sandbox level project.

openkruise-features|center|450x400

Overall, OpenKruise currently provides features in these areas:

Application workloads: Enhanced strategies of deploy and upgrade for stateless/stateful/daemon applications, such as in-place update, canary/flowing upgrade.
Sidecar container management: supports to define sidecar container alone, which means it can inject sidecar containers, upgrade them with no effect on application containers and even hot upgrade.
Enhanced operations: such as restart containers in-place, pre-download images on specific nodes, keep containers launch priority in a Pod, distribute one resource to multiple namespaces.
Application availability protection: protect availability for applications that deployed in Kubernetes.

What's new?

1. InPlace Update for environments

Author: @FillZpp

OpenKruise has supported InPlace Update since very early version, mostly for workloads like CloneSet and Advanced StatefulSet. Comparing to recreate Pods during upgrade, in-place update only has to modify the fields in existing Pods.

inplace-update-comparation|center|450x400

As the picture shows above, we only modify the image field in Pod during in-place update. So that:

Avoid additional cost of scheduling, allocating IP, allocating and mounting volumes.
Faster image pulling, because of we can re-use most of image layers pulled by the old image and only to pull several new layers.
When a container is in-place updating, the other containers in Pod will not be affected and remain running.

However, OpenKruise only supports to in-place update image field in Pod and has to recreate Pods if other fields need to update. All the way through, more and more users hope OpenKruise could support in-place update more fields such as env -- which is hard to implement, for it is limited by kube-apiserver.

After our unremitting efforts, OpenKruise finally support in-place update environments via Downward API since version v1.0. Take the CloneSet YAML below as an example, user has to set the configuration in annotation and write a env from it. After that, he just needs to modify the annotation value when changing the configuration. Kruise will restart all containers with env from the annotation in such Pod to enable the new configuration.

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  ...
spec:
  replicas: 1
  template:
    metadata:
      annotations:
        app-config: "... the real env value ..."
    spec:
      containers:
      - name: app
        env:
        - name: APP_CONFIG
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['app-config']
  updateStrategy:
    type: InPlaceIfPossible

At the same time, we have removed the limit of imageID for in-place update, which means you can update a new image with the same imageID to the old image.

For more details please read documentation.

2. Distribute resources over multiple namespaces

Author: @veophi

For the scenario, where the namespace-scoped resources such as Secret and ConfigMap need to be distributed or synchronized to different namespaces, the native k8s currently only supports manual distribution and synchronization by users one-by-one, which is very inconvenient.

Typical examples:

When users want to use the imagePullSecrets capability of SidecarSet, they must repeatedly create corresponding Secrets in relevant namespaces, and ensure the correctness and consistency of these Secret configurations;
When users want to configure some common environment variables, they probably need to distribute ConfigMaps to multiple namespaces, and the subsequent modifications of these ConfigMaps might require synchronization among these namespaces.

Therefore, in the face of these scenarios that require the resource distribution and continuously synchronization across namespaces, we provide a tool, namely ResourceDistribution, to do this automatically.

Currently, ResourceDistribution supports the two kind resources --- Secret & ConfigMap.

apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
  name: sample
spec:
  resource:
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: game-demo
    data:
      ...
  targets:
  	namespaceLabelSelector:
      ...
    # or includedNamespaces, excludedNamespaces

So you can see ResourceDistribution is a kind of cluster-scoped CRD, which is mainly composed of two fields: resource and targets.

resource is a complete and correct resource structure in YAML style.
targets indicates the target namespaces that the resource should be distributed into.

For more details please read documentation.

3. Container launch priority

Author: @Concurrensee

Containers in a same Pod in it might have dependence, which means the application in one container runs depending on another container. For example:

Container A has to start first. Container B can start only if A is already running.
Container B has to exit first. Container A can stop only if B has already exited.

Currently, the sequences of containers start and stop are controlled by Kubelet. Kubernetes used to have a KEP, which plans to add a type field for container to identify the priority of start and stop. However, it has been refused because of sig-node thought it may bring a huge change to code.

So OpenKruise provides a feature named Container Launch Priority, which helps user control the sequence of containers start in a Pod.

User only has to put the annotation apps.kruise.io/container-launch-priority: Ordered in a Pod, then Kruise will ensure all containers in this Pod should be started by the sequence of pod.spec.containers list.
If you want to customize the launch sequence, you can add KRUISE_CONTAINER_PRIORITY environment in container. The range of the value is [-2147483647, 2147483647]. The container with higher priority will be guaranteed to start before the others with lower priority.

For more details please read documentation.

4. `kubectl-kruise` commandline tool

Author: @hantmac

OpenKruise used to provide SDK like kruise-api and client-java for some programming languages, which can be imported into users' projects. On the other hand, some users also need to operate the workload resources with commandline in test environment.

However, the rollout, set image commands in original kubectl can only work for built-in workloads, such as Deployment and StatefulSet.

So, OpenKruise now provide a commandline tool named kubectl-kruise, which is a standard plugin of kubectl and can work for OpenKruise workload types.

# rollout undo cloneset
$ kubectl kruise rollout undo cloneset/nginx

#  rollout status advanced statefulset
$ kubectl kruise rollout status statefulsets.apps.kruise.io/sts-demo

# set image of a cloneset
$ kubectl kruise set image cloneset/nginx busybox=busybox nginx=nginx:1.9.1

For more details please read documentation.

5. Other changes

CloneSet:

Add maxUnavailable field in scaleStrategy to support rate limiting of scaling up.
Mark revision stable when all pods updated to it, won't wait all pods to be ready.

WorkloadSpread:

Manage the pods that have created before WorkloadSpread.
Optimize the update and retry logic for webhook injection.

Advanced DaemonSet:

Support in-place update Daemon Pod.
Support progressive annotation to control if pods creation should be limited by partition.

SidecarSet:

Fix SidecarSet filter active pods.
Add SourceContainerNameFrom and EnvNames fields in transferenv to make the container name flexible and the list shorter.

PodUnavailableBudget:

Add no pub-protection annotation to skip validation for the specific Pod.
PodUnavailableBudget controller watches workload replicas changed.

NodeImage:

Add --nodeimage-creation-delay flag to delay NodeImage creation after Node ready.

UnitedDeployment:

Fix pod NodeSelectorTerms length 0 when UnitedDeployment NodeSelectorTerms is nil.

Other optimization:

kruise-daemon list and watch pods using protobuf.
Export cache resync args and defaults to be 0 in chart value.
Fix http checker reloading after webhook certs updated.
Generate CRDs with original controller-tools and markers.

Get Involved

Join the community on Slack (English).
Join the community on DingTalk: Search GroupID 23330762 (Chinese).
Join the community on WeChat: Search User openkruise and let the robot invite you (Chinese).

WorkloadSpread - Interpretation for OpenKruise v0.10.0 new feature

September 22, 2021 · 12 min read

GuangLei Cao

Contributor of OpenKruise

Weixiang Sun

Member of OpenKruise

Background

Deploying an application in different zones, different hardware types, and even different clusters and cloud vendors is becoming a very common requirement with the development of cloud native techniques. For examples, these are some cases:

Cases about disaster tolerant:

Application pods is scattered according to the nodes to avoid stacking.
Application pods is scattered according to available zones.
Different nodes/zones/domains require different scale of pods.

Cases about cost control:

People deploy an applications preferentially to their own resource pool, and then deployed to elastic resource pool, such as ECI on Aliyun and Fragate on AWS, when own resources are insufficient. When shrinking, the elastic node is preferred to shrink to save cost.

In the most of the cases, people always split their application into multiple workloads (such as several Deployment) to deploy. However，this solution often requires manual management by SRE team, or a deeply customized PAAS to support the careful management of multiple workloads for this one application.

In order to solve this problem, WorkloadSpread feature has been proposed in version v0.10.0 OpenKruise. It can support multi-kind of workloads, such as Deployment, Replicaset, Job, and Cloneset, to manage the partition deployment or elastic scaling. The application scenario and implementation principle of WorkloadSpread will be introduced in detail below to help users better understand this feature.

Introduction

More details about WorkloadSpread can be found in Offical Document.

In short, WorkloadSpread can distribute pods of a workload to different types of nodes according to certain rules, so as to meet the above fragmentation and elasticity scenarios. WorkloadSpread is non-invasive, "plug and play", and can be effective for stock workloads.

Let's make a simple comparison with some related works in the community.

「1」Pod Topology Spread Constrains

Pod topology spread constraints is a solution provided by Kubernetes community. It can horizontally scatter pods according to topology key. The scheduler will select the node that matches the conditions according to the configuration if users defined this rule.

Since Pod Topology Spread is evenly dispersed, it cannot support exact customized partition number and proportion configuration. Furthermore, the distribution of pods will be destroyed when scaling down. Using WorkloadSpread can avoid these problems.

「2」UnitedDeploymen

UnitedDeployment is a solution provided by the OpenKruise community. It can manage pods in multiple regions by creating and managing multiple workloads.

UnitedDeployment supports the requirements of fragmentation and flexibility very well. But, it is a new workload, and the use cost and migration costs will be relatively high, whereas WorkloadSpread is a lightweight solution, which only needs to apply a simple configuration to associate the workload.

Use Case

In the section, I will list some application scenarios of WorkloadSpread and give corresponding configurations to help users quickly understand the WorkloadSpread feature.

「1」Deploy 100 pods to normal node pool, rest pods to elastic node pool

case-1

subsets:
- name: subset-normal
  maxReplicas: 100
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - normal
- name: subset-elastic 
# maxReplicas==nil means no limit for replicas
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - elastic

When the workload has less than 100 replicas, all pods will be deployed to the normal node pool, and more than 100 are deployed to the elastic node pool. When scaling down, the pods on the elastic node will be deleted first.

Since workload spread limits the distribution of workload, but does not invade workload. Users can also dynamically adjust the number of replicas according to the resource load in combination with HPA.

In this way, it will be automatically scheduled to the elastic node pool when receiving peak flow, and give priority to releasing the resources in the elastic resource pool when the peak gone.

「1」Deploy pods to normal node pool first, to elastic resource pool when normal node pool is insufficient

case-2

scheduleStrategy:
  type: Adaptive
  adaptive:
    rescheduleCriticalSeconds: 30
    disableSimulationSchedule: false
subsets:
- name: subset-normal
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - normal
- name: subset-elastic
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - elastic

Both subsets have no limit on the number of replicas, and the Adaptive rescheduling policy are enabled. The goal is to preferentially deploy to the normal node pool. When normal resources are insufficient, webhook will select elastic nodes through simulated scheduling. When the pod in the normal node pool is in the pending state and exceeds the 30s threshold, the WorkloadSpread controller will delete the pod to trigger pod reconstruction, and the new pod will be scheduled to the elastic node pool. During volume reduction, the pod on the elastic node is also preferentially reduced to save costs for users.

「3」Scatter to 3 zones, the scale is 1:1:3

case-3

subsets:
- name: subset-a
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-a
- name: subset-b
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-b
- name: subset-c
  maxReplicas: 60%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-c   

WorkloadSpread ensures that the pods are scheduled according to the defined proportion when scaling up and down.

「4」Configures different resource quotas on different CPU architecture

case-4

subsets:
- name: subset-x86-arch
  # maxReplicas...
  # requiredNodeSelectorTerm...
  patch:
    metadata:
      labels:
        resource.cpu/arch: x86
    spec: 
      containers:
      - name: main
        resources:
          limits:
            cpu: "500m"
            memory: "800Mi"
- name: subset-arm-arch
  # maxReplicas...
  # requiredNodeSelectorTerm...
  patch:
    metadata:
      labels:
        resource.cpu/arch: arm
    spec: 
      containers:
      - name: main
        resources:
          limits:
            cpu: "300m"
            memory: "600Mi"

From the above example, we have patched different labels and container resources for the pods of two subsets, which is convenient for us to manage the pod more finely. When workload pods are distributed on nodes of different CPU architectures, configure different resource quotas to make better use of hardware resources.

Implementation

WorkloadSpread is a pure bypass elastic/topology control solution. Users only need to create a corresponding WorkloadSpread config for their Deployment/Cloneset/Job/ReplicaSet Workloads. There is no need to change the them, and users will be no additional cost to use the WorkloadSpread.

arch

「1」 How to decide the priority when scaling up?

Multiple subsets are defined in WorkloadSpread, and each subset represents a logical domain. Users can freely define subsets according to node configuration, hardware type, zone, etc. In particular, we defined the priority of subsets:

The priority is defined from high to low in the order from front to back, for example subset[i] has higher priority than subset[j] if i < j.
The pods will be scheduled to the subsets with higher priority first.

「2」 How to decide the priority when scaling down?

Theoretically, the bypass solution of WorkloadSpread cannot interfere with the scaling logic in the workload controller.

However, this problem has been solved in the near future. Through the unremitting efforts (feedback) of users, k8s since version 1.21, it has been supported for ReplicaSet (deployment) to specify the "deletion cost" of the pods by setting the annotation controller.kubernetes.io/pod-deletion-cost: the higher the deletion cost, the lower the priority of deletion.

Since version v0.9.0 OpenKruise, the deletion cost feature has been supported in cloneset.

Therefore, the WorkloadSpread controller controls the scaling down order of the pods by adjusting their deletion cost.

For example, an WorkloadSpread associated a CloneSet with 10 replicas is as follows:

  subsets:
  - name: subset-a
    maxReplicas: 8
  - name: subset-b

Then the deletion cost value and deletion order are as follows:

8 pods in subset-a will have 200 deletion cost;
2 pods in subset-b will have 100 deletion cost, and will be deleted first;

If user modify WorkloadSpread as:

  subsets:
  - name: subset-a
    maxReplicas: 5 # 8->5, 
  - name: subset-b

Then the deletion cost value and deletion order will also changed as follows:

5 pods in subset-a will have 200 deletion cost;
3 pods in subset-a will have -100 deletion cost, and will be deleted first;
2 pods in subset-b will have 100 deletion cost;

In this way, workload can preferentially scale down those pods that exceed the subset maxReplicas limit.

「3」 How to solve the counting problems?

How to ensure that webhook injects pod rules in strict accordance with the priority order of subset and the number of maxReplicas is a key problem at the implementation of WorkloadSpread.

3.1 solving concurrency consistency problem

Sine there may be several kruise-controller-manager pods and lots of webhook Goroutines to process the same WorkloadSpread, the concurrency consistency problem must exist.

In the status of WorkloadSpread, there are the subsetStatuses field corresponding to each subset. The missingReplicas field in it indicates the number of pods required by the subset, and - 1 indicates that there is no quantity limit (subset.maxReplicas == nil).

spec:
  subsets:
  - name: subset-a
    maxReplicas: 1
  - name: subset-b
  # ...
status:
  subsetStatuses:
  - name: subset-a
    missingReplicas: 1
  - name: subset-b
    missingReplicas: -1
  # ...

When webhook receives a pod create request:

Find a suitable subset with missingReplicas greater than 0 or equals to -1 according to the subset order.
After finding a suitable subset, if missingReplicas is greater than 0, subtract 1 first and try to update the WorkloadSpread status.
If the update is successful, inject the rules defined by the subset into the pod.
If the update fails, get the WorkloadSpread again to get the latest status, and return to step 1 (there is a certain limit on the number of retries).

Similarly, when webhook receives a pod delete or eviction request, MisingReplicas will add 1 to missingreplicas and update it.

There is no doubt that we are using optimistic locks to solve update conflicts. However, it is not appropriate to only use optimistic locks, because workload will create a large number of pods in parallel, and APIServer will send many pod create requests to webhook in an instant, resulting in a lot of conflicts in parallel processing. As we all know, optimistic lock is not suitable for too many conflicts, because the retry cost of solving conflicts is very high. To this end, we also added a WorkloadSpread level mutex to limit parallel processing to serial processing. There is a new problem in adding mutex locks, that is, after the current root obtains the lock, it is very likely that the WorkloadSpread obtained from infomer is not up-to-date, and will conflict as well. Therefore, after updating the WorkloadSpread, the Goroutine caches the latest WorkloadSpread and then releases the lock, so that the new Goroutine can directly get the latest WorkloadSpread from the cache after obtaining the lock. Of course, in the case of multiple webhooks, we still need to combine the optimistic lock mechanism to solve the conflict.

3.2 solving data consistency problem

So, is the missingReplicas field controlled by the webhook? The answer is NO, because:

The pod create request received by webhook may not really succeed in the end (for example, pod is illegal or fails in subsequent quota verification).
The pod delete/eviction request received by webhook may not really succeed in the end (for example, it is intercepted by PDB, PUB, etc.).
There are always various possibilities in k8s, leading to the end or disappearance of the pods without going through webhook (for example, phase enters succeeded/failed, or ETCD data is lost, etc.).
At the same time, this is not in line with the end state oriented design concept.

Therefore, the WorkloadSpread status is controlled by webhook in collaboration with the controller:

Webhook requests link interception in pod create/delete/ eviction, and modifies the missingReplicas.
At the same time, the controller's reconcile will also get all pods under the current workload, classify them according to the subset, and update missingReplicas to the actual missing quantity.
From the above analysis, it is likely that there is a delay for the controller to obtain the pod from the informer, so we also added the creatingPods map in the status. When the pod is injected at webhook, the key will be recorded as pod name and value are timestamp to the map, and the controller maintains the real missingReplicas in combination with the map. Similarly, there is also a deleteingPods map to record the delete/eviction event of the pod.

「4」How to do if pod schedule failed?

The configuration of reschedule strategy is supported in WorkloadSpread. By default, the type is fixed, that is, the pod is scheduled to the corresponding subset according to the sequence of each subset and the maxReplicas limit.

However, in real scenarios, many times, the resources of subset may not fully meet the number of maxReplicas due to some reasons, such as insufficient resources. Users need a more flexible reschedule strategy.

The adaptive capabilities provided by WorkloadSpread are logically divided into two types:

SimulationSchedule: scheduling records exists in informer, so we want to simulate the scheduling of pods in webhook. That is, simple filtering is performed through nodeSelector/Affinity, Tolerances, and basic resources resources. (not applicable to virtual-kubelet)
Reschedule: After scheduling the pod to a subset, if the scheduling failure exceeds the rescheduleCriticalSeconds time, mark the subset as unscheduled temporarily, and delete the pod to trigger reconstruction. By default, unscheduled will be reserved for 5min, that is, pod creation within 5min will skip this subset.

Conclusion

WorkloadSpread combines some existing features of Kubernetes to give workload the ability of elastic and multi-domain deployment in the form of bypass. We hope that users can reduce workload deployment complexity by using WorkloadSpread and effectively reduce costs by taking advantage of its elastic scalability.

At present, WorkloadSpread is applied to some project in Alibaba, and adjustments in the use will be fed back to the community in time. In the future, there are some new capability plans for WorkloadSpread, such as managing the existing pods, supporting batch workloads, and even using label to match the pod across different workloads. Some of these capabilities need to actually consider the needs and scenarios of community users. I hope you can participate in kruise community, mention Issues and PRs, help users solve the problems of more cloud native deployment, and build a better community.

Reference

WorkloadSpread: https://openkruise.io/docs/user-manuals/workloadspread
Pod Topology Spread Constrains: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
UnitedDeployment: https://openkruise.io/docs/user-manuals/uniteddeployment

OpenKruise 0.10.0, New features of multi-domain management, application protection

September 6, 2021 · 5 min read

Siyu Wang

Maintainer of OpenKruise

On Sep 6th, 2021, OpenKruise released the latest version v0.10.0, with new features, such as WorkloadSpread and PodUnavailableBudget. This article provides an overview of this new version.

WorkloadSpread

WorkloadSpread can distribute Pods of workload to different types of Node according to some polices, which empowers single workload the abilities for multi-domain deployment and elastic deployment.

Some common policies include:

fault toleration spread (for example, spread evenly among hosts, az, etc)
spread according to the specified ratio (for example, deploy Pod to several specified az according to the proportion)
subset management with priority, such as
- deploy Pods to ecs first, and then deploy to eci when its resources are insufficient.
- deploy a fixed number of Pods to ecs first, and the rest Pods are deployed to eci.
subset management with customization, such as
- control how many pods in a workload are deployed in different cpu arch
- enable pods in different cpu arch to have different resource requirements

The feature of WorkloadSpread is similar with UnitedDeployment in OpenKruise community. Each WorkloadSpread defines multi-domain called subset. Each domain may provide the limit to run the replicas number of pods called maxReplicas. WorkloadSpread injects the domain configuration into the Pod by Webhook, and it also controls the order of scale in and scale out.

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: workloadspread-demo
spec:
  targetRef:
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet
    name: workload-xxx
  subsets:
  - name: subset-a
    requiredNodeSelectorTerm:
      matchExpressions:
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - zone-a
    maxReplicas: 10 | 30%
  - name: subset-b
    requiredNodeSelectorTerm:
      matchExpressions:
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - zone-b

The WorkloadSpread is related to a Workload via targetRef. When a Pod is created by the Workload, it will be injected topology policies by Kruise according to the rules in WorkloadSpread.

Note that WorkloadSpread uses Pod Deletion Cost to control the priority of scale down. So:

If the Workload type is CloneSet, it already supports the feature.
If the Workload type is Deployment or ReplicaSet, it requires your Kubernetes version >= 1.22.

Also you have to enable WorkloadSpread feature-gate when you install or upgrade Kruise.

PodUnavailableBudget

Kubernetes offers Pod Disruption Budget to help you run highly available applications even when you introduce frequent voluntary disruptions. PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. However, it can only constrain the voluntary disruption triggered by the Eviction API. For example, when you run kubectl drain, the tool tries to evict all of the Pods on the Node you're taking out of service.

In the following voluntary disruption scenarios, there are still business disruption or SLA degradation situations:

The application owner update deployment's pod template for general upgrading, while cluster administrator drain nodes to scale the cluster down(learn about Cluster Autoscaling).
The middleware team is using SidecarSet to rolling upgrade the sidecar containers of the cluster, e.g. ServiceMesh envoy, while HPA triggers the scale-down of business applications.
The application owner and middleware team release the same Pods at the same time based on OpenKruise cloneSet, sidecarSet in-place upgrades

apiVersion: apps.kruise.io/v1alpha1
kind: PodUnavailableBudget
metadata:
  name: web-server-pub
  namespace: web
spec:
  targetRef:
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet | StatefulSet | ...
    name: web-server
  # selector label query over pods managed by the budget
  # selector and TargetReference are mutually exclusive, targetRef is priority to take effect.
  # selector is commonly used in scenarios where applications are deployed using multiple workloads,
  # and targetRef is used for protection against a single workload.
# selector:
#   matchLabels:
#     app: web-server
  # maximum number of Pods unavailable for the current cloneset, the example is cloneset.replicas(5) * 60% = 3
  # maxUnavailable and minAvailable are mutually exclusive, maxUnavailable is priority to take effect
  maxUnavailable: 60%
  # Minimum number of Pods available for the current cloneset, the example is cloneset.replicas(5) * 40% = 2
# minAvailable: 40%

You have to enable the feature-gates when install or upgrade Kruise:

PodUnavailableBudgetDeleteGate: protect Pod deletion or eviction.
PodUnavailableBudgetUpdateGate: protect Pod update operations, such as in-place update.

CloneSet supports scaledown priority by Spread Constraints

When replicas of a CloneSet decreased, it has the arithmetic to choose Pods and delete them.

Node unassigned < assigned
PodPending < PodUnknown < PodRunning
Not ready < ready
Lower pod-deletion cost < higher pod-deletion-cost
Higher spread rank < lower spread rank
Been ready for empty time < less time < more time
Pods with containers with higher restart counts < lower restart counts
Empty creation time pods < newer pods < older pods

"4" has provided in Kruise v0.9.0 and it is also used by WorkloadSpread to control the Pod deletion. "5" is added in Kruise v0.10.0 to sort Pods by their Topology Spread Constraints during scaledown.

Advanced StatefulSet supports scaleup with rate limit

To avoid a large amount of failed Pods after user created an incorrect Advanced StatefulSet, Kruise add a maxUnavailable field into its scaleStrategy.

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
  # ...
  replicas: 100
  scaleStrategy:
    maxUnavailable: 10% # percentage or absolute number

When the field is set, Advanced StatefulSet will guarantee that the number of unavailable Pods should not bigger than the strategy number during Pod creation.

Note that the feature can only be used in StatefulSet with podManagementPolicy=Parallel.

For more changes, please refer to the release page or ChangeLog.

OpenKruise 0.9.0, SidecarSet Helps Mesh Container Hot Upgrade

June 10, 2021 · 8 min read

Mingshan Zhao

Member of OpenKruise

OpenKruise is an open source management suite developed by Alibaba Cloud for cloud native application automation. It is currently a Sandbox project hosted under the Cloud Native Computing Foundation (CNCF). Based on years of Alibaba's experience in container and cloud native technologies, OpenKruise is a Kubernetes-based standard extension component that has been widely used in the Alibaba internal production environment, together with technical concepts and best practices for large-scale Internet scenarios.

OpenKruise released v0.8.0 on March 4, 2021, with enhanced SidecarSet capabilities, especially for log management of Sidecar.

Background - How to Upgrading Mesh Containers Independently

SidecarSet is a workload provided by Kruise to manage sidecar containers. Users can complete automatic injection and independent upgrades conveniently using SidecarSet.

By default, sidecar upgrade will first stop the old container and start a new one. This method is particularly suitable for sidecar containers that do not affect Pod service availability, such as log collection agents. However, for many proxies or sidecar containers for runtime, such as Istio Envoy, this upgrade method does not work. Envoy functions as a Proxy container in the Pod to handle all traffic. If users restart in this scenario to upgrade directly, the service availability of the Pod will be affected. Therefore, the release and capacity of the application should be taken into consideration. The sidecar release cannot be independent of the application.

how update mesh sidecar

Tens of thousands of pods in Alibaba Group communicate with each other based on Service Mesh. Mesh container upgrades may make business pods unavailable. Therefore, the upgrade of the mesh containers hinders the iteration of Service Mesh. To address this scenario, we worked with the Service Mesh team to implement the hot upgrade capability of the mesh container. This article focuses on the important role SidecarSet is playing during the implementation of the hot upgrade capability of mesh containers.

SidecarSet Helps Lossless Hot Upgrade of Mesh Containers

Mesh containers cannot perform direct in-place upgrades like the log collection class container. The mesh container must provide services without interruption, but an independent upgrade will make the mesh service unavailable for some time. Some well-known mesh services in the community, such as Envoy and Mosn, provide smooth upgrade capabilities by default. However, these upgrade methods cannot be integrated properly with cloud-native, and Kubernetes does not have an upgrade solution for such sidecar containers.

OpenKruise SidecarSet provides the sidecar hot upgrade mechanism for the mesh container. Thus, lossless Mesh container hot upgrade can be implemented in a cloud-native manner.

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
  name: hotupgrade-sidecarset
spec:
  selector:
    matchLabels:
      app: hotupgrade
  containers:
  - name: sidecar
    image: openkruise/hotupgrade-sample:sidecarv1
    imagePullPolicy: Always
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - /migrate.sh
    upgradeStrategy:
      upgradeType: HotUpgrade
      hotUpgradeEmptyImage: openkruise/hotupgrade-sample:empty

upgradeType: “HotUpgrade” indicates this type of sidecar container, which is hot upgrade.
hotUpgradeEmptyImage: When performing hot upgrade on sidecar containers, businesses need to provide an empty container for container switchover. The Empty container has the same configuration as the sidecar container (except for the image address), such as command, lifecycle, and probe.

The SidecarSet hot upgrade mechanism includes two steps: injection of Sidecar containers of the hot upgrade type and Mesh container smooth upgrade.

Inject Sidecar Containers of the Hot Upgrade Type

For Sidecar containers of the hot upgrade type, two containers will be injected by SidercarSet Webhook when creating the Pod:

{sidecar.name}-1: As shown in the following figure, envoy-1 represents a running sidecar container, for example, envoy:1.16.0.
{sidecar.name}-2: As shown in the following figure, envoy-2 represents the “hotUpgradeEmptyImage” container provided by the business, for example, empty:1.0.

inject sidecar

This Empty container does not have any practical work while running the Mesh container.

Smooth Mesh Container Upgrade

The hot upgrade process is divided into three steps:

Upgrade: Replace the Empty container with the sidecar container of the latest version, for example, envoy-2.Image = envoy:1.17.0
Migration: Run the “PostStartHook” script of the sidecar container to upgrade the mesh service smoothly
Reset: After the mesh service is upgraded, replace the sidecar container of the earlier version with an Empty container, for example, envoy-1.Image = empty:1.0

update sidecar

The preceding three steps represent the entire process of the hot upgrade. If multiple hot upgrades on a Pod are required, users only need to repeat the three steps listed above.

Core Logic of Migration

The SidecarSet hot upgrade mechanism completes the mesh container switching and provides the coordination mechanism (PostStartHook) for containers of old and new versions. However, this is only the first step. The Mesh container also needs to provide the PostStartHook script to upgrade the mesh service smoothly (please see the preceding migration process), such as Envoy hot restart and Mosn lossless restart.

Mesh containers generally provide external services by listening to a fixed port. The migration process of mesh containers can be summarized as: pass ListenFD through UDS, stop Accept, and start drainage. For mesh containers that do not support hot restart, you can follow this process to modify the mesh containers. The logic is listed below:

migration

Migration Demo

Different mesh containers provide different services and have different internal implementation logics, so the specific Migrations are also different. The preceding logic only presents some important points, with hopes to benefit everyone in need. We have also provided a hot upgrade Migration Demo on GitHub for reference. Next, we will introduce some of the key codes:

Consultation Mechanism First, users must check whether it is the first startup or hot upgrade smooth migration to start the Mesh container. Kruise injects two environment variables called SIDECARSET_VERSION and SIDECARSET_VERSION_ALT to two sidecar containers to reduce the communication cost of the mesh container. The two environment variables determine whether it is running the hot upgrade process and whether the current sidecar container version is new or old.

// return two parameters:
// 1. (bool) indicates whether it is hot upgrade process
// 2. (bool ) when isHotUpgrading=true, the current sidecar is newer or older
func isHotUpgradeProcess() (bool, bool) {
  // Version of the current sidecar container
  version := os.Getenv("SIDECARSET_VERSION")
  // Version of the peer sidecar container
  versionAlt := os.Getenv("SIDECARSET_VERSION_ALT")
  // If the version of the peer sidecar container is "0", hot upgrade is not underway
  if versionAlt == "0" {
    return false, false
  }
  // Hot upgrade is underway
  versionInt, _ := strconv.Atoi(version)
  versionAltInt, _ := strconv.Atoi(versionAlt)
  // version is of int type and monotonically increases, which means the version value of the new-version container will be greater
  return true, versionInt > versionAltInt
}

ListenFD Migration Use the Unix Domain Socket to migrate ListenFD between containers. This step is also a critical step in the hot upgrade. The code example is listed below:

  // For code conciseness, all failures will not be captured

  /* The old sidecar migrates ListenFD to the new sidecar through Unix Domain Socket */
  // tcpLn *net.TCPListener
  f, _ := tcpLn.File()
  fdnum := f.Fd()
  data := syscall.UnixRights(int(fdnum))
  // Establish a connection with the new sidecar container through Unix Domain Socket
  raddr, _ := net.ResolveUnixAddr("unix", "/dev/shm/migrate.sock")
  uds, _ := net.DialUnix("unix", nil, raddr)
  // Use UDS to send ListenFD to the new sidecar container
  uds.WriteMsgUnix(nil, data, nil)
  // Stop receiving new requests and start the drainage phase, for example, http2 GOAWAY
  tcpLn.Close()

  /* The new sidecar receives ListenFD and starts to provide external services */
  // Listen to UDS
  addr, _ := net.ResolveUnixAddr("unix", "/dev/shm/migrate.sock")
  unixLn, _ := net.ListenUnix("unix", addr)
  conn, _ := unixLn.AcceptUnix()
  buf := make([]byte, 32)
  oob := make([]byte, 32)
  // Receive ListenFD
  _, oobn, _, _, _ := conn.ReadMsgUnix(buf, oob)
  scms, _ := syscall.ParseSocketControlMessage(oob[:oobn])
  if len(scms) > 0 {
    // Parse FD and convert to *net.TCPListener
    fds, _ := syscall.ParseUnixRights(&(scms[0]))
    f := os.NewFile(uintptr(fds[0]), "")
    ln, _ := net.FileListener(f)
    tcpLn, _ := ln.(*net.TCPListener)
    // Start to provide external services based on the received Listener. The http service is used as an example
    http.Serve(tcpLn, serveMux)
  }

Successful Mesh Container Hot Upgrade Cases

Alibaba Cloud Service Mesh (ASM) provides a fully managed service mesh platform compatible with open-source Istio service mesh from the community. Currently, ASM implements the Sidecar hot upgrade capability (Beta) in the data plane based on the hot upgrade capability of OpenKruise SidecarSet. Users can upgrade the data plane version of service mesh without affecting applications.

In addition to hot upgrades, ASM supports capabilities, such as configuration diagnosis, operation audit, log access, monitoring, and service registration, to improve the user experience of service mesh. You are welcome to try it out!

Summary

The hot upgrade of mesh containers in cloud-native has always been an urgent but thorny problem. The solution in this article is only one exploration of Alibaba Group, giving feedback to the community with hopes of encouraging better ideas. We also welcome everyone to participate in the OpenKruise community. Together, we can build mature Kubernetes application management, delivery, and extension capabilities that can be applied to more large-scale, complex, and high-performance scenarios.

OpenKruise 0.9.0, Supports Pod Restart and Deletion Protection

May 20, 2021 · 13 min read

Siyu Wang

Maintainer of OpenKruise

On May 20, 2021, OpenKruise released the latest version v0.9.0, with new features, such as Pod restart and resource cascading deletion protection. This article provides an overview of this new version.

Pod Restart and Recreation

Restarting container is a necessity in daily operation and a common technical method for recovery. In the native Kubernetes, the container granularity is inoperable. Pod, as the minimum operation unit, can only be created or deleted.

Some may ask: why do users still need to pay attention to the operation such as container restart in the cloud-native era? Aren't the services the only thing for users to focus on in the ideal Serverless model?

To answer this question, we need to see the differences between cloud-native architecture and traditional infrastructures. In the era of traditional physical and virtual machines, multiple application instances are deployed and run on one machine, but the lifecycles of the machine and applications are separated. Thus, application instance restart may only require a systemctl or supervisor command but not the restart of the entire machine. However, in the era of containers and cloud-native, the lifecycle of the application is bound to that of the Pod container. In other words, under normal circumstances, one container only runs one application process, and one Pod provides services for only one application instance.

Due to these restrictions, current native Kubernetes provides no API for the container (application) restart for upper-layer services. OpenKruise v0.9.0 supports restarting containers in a single Pod, compatible with standard Kubernetes clusters of version 1.16 or later. After installing or upgrading OpenKruise, users only need to create a ContainerRecreateRequest (CRR) object to initiate a restart process. The simplest YAML file is listed below:

apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
metadata:
  namespace: pod-namespace
  name: xxx
spec:
  podName: pod-name
  containers:
  - name: app
  - name: sidecar

The value of namespace must be the same as the namespace of the Pod to be operated. The name can be set as needed. The podName in the spec clause indicates the Pod name. The containers indicate a list that specifies one or more container names in the Pod to restart.

In addition to the required fields above, CRR also provides a variety of optional restart policies:

spec:
  # ...
  strategy:
    failurePolicy: Fail
    orderedRecreate: false
    terminationGracePeriodSeconds: 30
    unreadyGracePeriodSeconds: 3
    minStartedSeconds: 10
  activeDeadlineSeconds: 300
  ttlSecondsAfterFinished: 1800

failurePolicy: Values: Fail or Ignore. Default value: Fail. If any container stops or fails to recreate, CRR ends immediately.
orderedRecreate: Default value: false. Value true indicates when the list contains multiple containers, the new container will only be recreated after the previous recreation is finished.
terminationGracePeriodSeconds: The time for the container to gracefully exit. If this parameter is not specified, the time defined for the Pod is used.
unreadyGracePeriodSeconds: Set the Pod to the unready state before recreation and wait for the time expiration to execute recreation.
- Note: This feature needs the feature-gate KruisePodReadinessGate to be enabled, which will inject a readinessGate when a Pod is created. Otherwise, only the pods created by the OpenKruise workload are injected with readinessGate by default. It means only these Pods can use the unreadyGracePeriodSeconds parameter during the CRR recreation.
minStartedSeconds: The minimal period that the new container remains running to judge whether the container is recreated successfully.
activeDeadlineSeconds: The expiration period set for CRR execution to mark as ended (unfinished container will be marked as failed.)
ttlSecondsAfterFinished: The period after which the CRR will be deleted automatically after the execution ends.

How it works under the hood: After it is created, a CRR is processed by the kruise-manager. Then, it will be sent to the kruise-daemon (contained by the node where Pod resides) for execution. The execution process is listed below:

If preStop is specified for a Pod, the kruise-daemon will first call the CRI to run the command specified by preStop in the container.
If no preStop exists or preStop execution is completed, the kruise-daemon will call the CRI to stop the container.
When the kubelet detects the container exiting, it creates a new container with an increasing "serial number" and starts it. postStart will be executed at the same time.
When the kruise-daemon detects the start of the new container, it reports to CRR that the restart is completed.

ContainerRecreateRequest

The container "serial number" corresponds to the restartCount reported by kubelet in the Pod status. Therefore, the restartCount of the Pod increases after the container is restarted. Temporary files written to the rootfs in the old container will be lost due to the container recreation, but data in the volume mount remains.

Cascading Deletion Protection

The level triggered automation of Kubernetes is a double-edged sword. It brings declarative deployment capabilities to applications while potentially enlarging the influence of mistakes at a final-state scale. For example, with the cascading deletion mechanism, once an owning resource is deleted under normal circumstances (non-orphan deletion), all owned resources associated will be deleted by the following rules:

If a CRD is deleted, all its corresponding CR will be cleared.
If a namespace is deleted, all resources in this namespace, including Pods, will be cleared.
If a workload (Deployment, StatefulSet, etc) is deleted, all Pods under it will be cleared.

Due to failures caused by cascading deletion, we have heard many complaints from Kubernetes users and developers in the community. It is unbearable for any enterprise to mistakenly delete objects at such a large scale in the production environment.

Therefore, in OpenKruise v0.9.0, we applied the feature of cascading deletion protection to community in the hope of ensuring stability for more users. If you want to use this feature in the current version, the feature-gate of ResourcesDeletionProtection needs to be explicitly enabled when installing or upgrading OpenKruise.

A label of policy.kruise.io/delete-protection can be given on the resource objects that require protection. Its value can be the following two things:

Always: The object cannot be deleted unless the label is removed.
Cascading: The object cannot be deleted if any subordinate resources are available.

The following table lists the supported resource types and cascading relationships:

Kind	Group	Version	Cascading judgement
`Namespace`	core	v1	whether there is active Pods in this namespace
`CustomResourceDefinition`	apiextensions.k8s.io	v1beta1, v1	whether there is existing CRs of this CRD
`Deployment`	apps	v1	whether the replicas is 0
`StatefulSet`	apps	v1	whether the replicas is 0
`ReplicaSet`	apps	v1	whether the replicas is 0
`CloneSet`	apps.kruise.io	v1alpha1	whether the replicas is 0
`StatefulSet`	apps.kruise.io	v1alpha1, v1beta1	whether the replicas is 0
`UnitedDeployment`	apps.kruise.io	v1alpha1	whether the replicas is 0

New Features of CloneSet

Deletion Priority

The controller.kubernetes.io/pod-deletion-cost annotation was added to Kubernetes after version 1.21. ReplicaSet will sort the Kubernetes resources according to this cost value during scale in. CloneSet has supported the same feature since OpenKruise v0.9.0.

Users can configure this annotation in the pod. The int type of its value indicates the deletion cost of a certain pod compared to other pods under the same CloneSet. Pods with a lower cost have a higher deletion priority. If this annotation is not set, the deletion cost of the pod is 0 by default.

Note: This deletion order is not determined solely by deletion cost. The real order serves like this:

Not scheduled < scheduled
PodPending < PodUnknown < PodRunning
Not ready < ready
Smaller pod-deletion cost < larger pod-deletion cost
Period in the Ready state: short < long
Containers restart: more times < fewer times
Creation time: short < long

Image Pre-Download for In-Place Update

When CloneSet is used for the in-place update of an application, only the container image is updated, while the Pod is not rebuilt. This ensures that the node where the Pod is located will not change. Therefore, if the CloneSet pulls the new image from all the Pod nodes in advance, the Pod in-place update speed will be improved substantially in subsequent batch releases.

If you want to use this feature in the current version, the feature-gate of PreDownloadImageForInPlaceUpdate needs to be explicitly enabled when installing or upgrading OpenKruise. If you update the images in the CloneSet template and the publish policy supports in-place update, CloneSet will create an ImagePullJob object automatically (the batch image pre-download function provided by OpenKruise) to download new images in advance on the node where the Pod is located.

By default, CloneSet sets the parallelism to 1 for ImagePullJob, which means images are pulled for one node and then another. For any adjustment, you can set the parallelism in the CloneSet annotation by executing the following code:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  annotations:
    apps.kruise.io/image-predownload-parallelism: "5"

Pod Replacement by Scale Out and Scale In

In previous versions, the maxUnavailable and maxSurge policies of CloneSet only take effect during the application release process. In OpenKruise v0.9.0 and later versions, these two policies also function when deleting a specified Pod.

When the user specifies one or more Pods to be deleted through podsToDelete or apps.kruise.io/specified-delete: true, CloneSet will only execute deletion when the number of unavailable Pods (of the total replicas) is less than the value of maxUnavailable. In addition, if the user has configured the maxSurge policy, the CloneSet will possibly create a new Pod first, wait for the new Pod to be ready, and then delete the old specified Pod.

The replacement method depends on the value of maxUnavailable and the number of unavailable Pods. For example:

For a CloneSet, maxUnavailable=2, maxSurge=1 and only pod-a is unavailable. If you specify pod-b to be deleted, CloneSet will delete it promptly and create a new Pod.
For a CloneSet, maxUnavailable=1, maxSurge=1 and only pod-a is unavailable. If you specify pod-b to be deleted, CloneSet will create a new Pod, wait for it to be ready, and then delete the pod-b.
For a CloneSet, maxUnavailable=1, maxSurge=1 and only pod-a is unavailable. If you specify this pod-a to be deleted, CloneSet will delete it promptly and create a new Pod.

Efficient Rollback Based on Partition Final State

In the native workload, Deployment does not support phased release, while StatefulSet provides partition semantics to allow users to control the times of gray scale upgrades. OpenKruise workloads, such as CloneSet and Advanced StatefulSet, also provide partitions to support phased release.

For CloneSet, the semantics of Partition is the number or percentage of Pods remaining in the old version. For example, for a CloneSet with 100 replicas, if the partition value is changed in the sequence of 80 ➡️ 60 ➡️ 40 ➡️ 20 ➡️ 0 by steps during the image upgrade, the CloneSet is released in five batches.

However, in the past, whether it is Deployment, StatefulSet, or CloneSet, if rollback is required during the release process, the template information (image) must be changed back to the old version. During the phased release of StatefulSet and CloneSet, reducing partition value will trigger the upgrade to a new version. Increasing partition value will not trigger rollback to the old version.

The partition of CloneSet supports the "final state rollback" function after v0.9.0. If the feature-gate CloneSetPartitionRollback is enabled when installing or upgrading OpenKruise, increasing the partition value will trigger CloneSet to roll back the corresponding number of new Pods to the old version.

There is a clear advantage here. During the phased release, only the partition value needs to be adjusted to flexibly control the numbers of old and new versions. However, the "old and new versions" for CloneSet correspond to updateRevision and currentRevision in its status:

updateRevision: The version of the template defined by the current CloneSet.
currentRevision: The template version of CloneSet during the previous successful full release.

Short Hash

By default, the value of controller-revision-hash in Pod label set by CloneSet is the full name of the ControllerRevision. For example:

apiVersion: v1
kind: Pod
metadata:
  labels:
    controller-revision-hash: demo-cloneset-956df7994

The name is concatenated with the CloneSet name and the ControllerRevision hash value. Generally, the hash value is 8 to 10 characters in length. In Kubernetes, a label cannot exceed 63 characters in length. Therefore, the name of CloneSet cannot exceed 52 characters in length, or the Pod cannot be created.

In v0.9.0, the new feature-gate CloneSetShortHash is introduced. If it is enabled, CloneSet will set the value of controller-revision-hash in the Pod to a hash value only, like 956df7994. Therefore, the length restriction of the CloneSet name is eliminated. (CloneSet can still recognize and manage the Pod with revision labels in the full format, even if this function is enabled.)

New Features of SidecarSet

Sidecar Hot Upgrade Function

SidecarSet is a workload provided by OpenKruise to manage sidecar containers separately. Users can inject and upgrade specified sidecar containers within a certain range of Pods using SidecarSet.

By default, for the independent in-place sidecar upgrade, the sidecar stops the container of the old version first and then creates a container of the new version. This method applies to sidecar containers that do not affect the Pod service availability, such as the log collection agent. However, for sidecar containers acting as a proxy such as Istio Envoy, this upgrade method is defective. Envoy, as a proxy container in the Pod, handles all the traffic. If users restart and upgrade directly, service availability will be affected. Thus, you need a complex grace termination and coordination mechanism to upgrade the envoy sidecar separately. Therefore, we offer a new solution for the upgrade of this kind of sidecar containers, namely, hot upgrade:

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
spec:
  # ...
  containers:
  - name: nginx-sidecar
    image: nginx:1.18
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/bash
          - -c
          - /usr/local/bin/nginx-agent migrate
    upgradeStrategy:
      upgradeType: HotUpgrade
      hotUpgradeEmptyImage: empty:1.0.0

upgradeType: HotUpgrade indicates that the type of the sidecar container is a hot upgrade, so the hot upgrade solution, hotUpgradeEmptyImage, will be executed. When performing a hot upgrade on the sidecar container, an empty container is required to switch services during the upgrade. The empty container has almost the same configuration as the sidecar container, except the image address, for example, command, lifecycle, and probe, but it does no actual work.
lifecycle.postStart: State migration. This procedure completes the state migration during the hot upgrade. The script needs to be executed according to business characteristics. For example, NGINX hot upgrade requires shared Listen FD and traffic reloading.

For more changes, please refer to the release page or ChangeLog.

Overview of Elastic Scenarios

Capabilities and Advantageous Scenarios of Two Components

WorkloadSpread: An Elastic Strategy Plugin Based on Pod Mutating Webhook

Example Configuration​

Powerful Partitioning Capability​

Flexible Scheduling Configuration​

Detailed Pod Customization​

WorkloadSpread's Pod Mutating Webhook Mechanism​

Limitations of WorkloadSpread​

Potential Risks of Webhook​

Limitations of Acting on Pods​

Case Study 1: Bandwidth Package Allocation in Large-Scale Load Testing​

Case Study 2: Compatibility for Scaling Managed K8S Cluster Services to Serverless Instances​

UnitedDeployment: A Native Workload with Built-in Elasticity

Example Configuration​

Advantages of UnitedDeployment​

All-In-One Elastic Application Management​

Advanced Subset Management​

Adaptive Elasticity​

Limitations of UnitedDeployment​

Case Study 1: Elastic Scaling of Pods to Virtual Nodes with Adaptation for Serverless Containers​

Case Study 2: Allocating Different Resources to Pods with Different CPU Types​

Summary

Upgrade Notice​

2. New Job Sidecar Terminator Capability​

Pods on real nodes​

Pods on virtual nodes​

Step 1: Prepare a fast exit image​

Step 2: Configure the special image in the Sidecar environment variable​

Notice​

Advanced Workload Improvement​

CloneSet Optimization Performance: New FeatureGate CloneSetEventHandlerOptimization​

CloneSet New disablePVCReuse Field​

CloneSet New PreNormal Lifecycle​

4. Enhanced Operations Improvement​

ContainerRestart New forceRecreate Field​

ImagePullJob Support Attach metadata into cri interface​

Get Involved​

What's new?​

1. New CRD and Controller: PodProbeMarker​

2. Performance optimization: significant performance improvements for large-scale clusters​

3. SidecarSet support inject specific historical version​

select revision via ControllerRevision name​

select revision via custom version label​

4. SidecarSet support inject pod annotations​

5. Advanced DaemonSet support pre-downloading image for update​

6. CloneSet Scaling with PreparingDelete​

7. Advanced CronJob Time zones​

8. Other changes​

Get Involved​

What's new?​

1. New CRD and Controller: PersistentPodState​

2. CloneSet percentage partition calculation changed (breaking), and a new field in its status​

3. Able to mark Pod not-ready for lifecycle hook​

4. PodUnavailableBudget supports any custom workloads and performance optimization​

5. Other changes​

Get Involved​

What's new?​

1. Keep containers order for in-place update​

2. StatefulSetAutoDeletePVC​

3. Advanced DaemonSet refactor, lifecycle hook​

4. Improve performance by disable DeepCopy​

5. Other changes​

Get Involved​

What's new?​

1. InPlace Update for environments​

2. Distribute resources over multiple namespaces​

3. Container launch priority​

4. kubectl-kruise commandline tool​

5. Other changes​

Get Involved​

Background​

Introduction​

Comparison with related works​

「1」Pod Topology Spread Constrains​

「2」UnitedDeploymen​

Use Case​

「1」Deploy 100 pods to normal node pool, rest pods to elastic node pool​

「1」Deploy pods to normal node pool first, to elastic resource pool when normal node pool is insufficient​

「3」Scatter to 3 zones, the scale is 1:1:3​

Example Configuration

Powerful Partitioning Capability

Flexible Scheduling Configuration

Detailed Pod Customization

WorkloadSpread's Pod Mutating Webhook Mechanism

Limitations of WorkloadSpread

Potential Risks of Webhook

Limitations of Acting on Pods

Case Study 1: Bandwidth Package Allocation in Large-Scale Load Testing

Case Study 2: Compatibility for Scaling Managed K8S Cluster Services to Serverless Instances

Example Configuration

Advantages of UnitedDeployment

All-In-One Elastic Application Management

Advanced Subset Management

Adaptive Elasticity

Limitations of UnitedDeployment

Case Study 1: Elastic Scaling of Pods to Virtual Nodes with Adaptation for Serverless Containers

Case Study 2: Allocating Different Resources to Pods with Different CPU Types

Upgrade Notice

2. New Job Sidecar Terminator Capability

Pods on real nodes

Pods on virtual nodes

Step 1: Prepare a fast exit image

Step 2: Configure the special image in the Sidecar environment variable

Notice

Advanced Workload Improvement

CloneSet Optimization Performance: New FeatureGate CloneSetEventHandlerOptimization

CloneSet New disablePVCReuse Field

CloneSet New PreNormal Lifecycle

4. Enhanced Operations Improvement

ContainerRestart New forceRecreate Field

ImagePullJob Support Attach metadata into cri interface

Get Involved

What's new?

1. New CRD and Controller: PodProbeMarker

2. Performance optimization: significant performance improvements for large-scale clusters

3. SidecarSet support inject specific historical version

select revision via ControllerRevision name

select revision via custom version label

4. SidecarSet support inject pod annotations

5. Advanced DaemonSet support pre-downloading image for update

6. CloneSet Scaling with PreparingDelete

7. Advanced CronJob Time zones

8. Other changes

Get Involved

What's new?

1. New CRD and Controller: PersistentPodState

2. CloneSet percentage partition calculation changed (breaking), and a new field in its status

3. Able to mark Pod not-ready for lifecycle hook

4. PodUnavailableBudget supports any custom workloads and performance optimization

5. Other changes

Get Involved

What's new?

1. Keep containers order for in-place update

2. StatefulSetAutoDeletePVC

3. Advanced DaemonSet refactor, lifecycle hook

4. Improve performance by disable DeepCopy

5. Other changes

Get Involved

What's new?

1. InPlace Update for environments

2. Distribute resources over multiple namespaces

3. Container launch priority

4. `kubectl-kruise` commandline tool

5. Other changes

Get Involved

Background

Introduction

Comparison with related works

「1」Pod Topology Spread Constrains

「2」UnitedDeploymen

Use Case

「1」Deploy 100 pods to normal node pool, rest pods to elastic node pool

「1」Deploy pods to normal node pool first, to elastic resource pool when normal node pool is insufficient

「3」Scatter to 3 zones, the scale is 1:1:3

「4」Configures different resource quotas on different CPU architecture

Implementation

「1」 How to decide the priority when scaling up?

「2」 How to decide the priority when scaling down?

「3」 How to solve the counting problems?