Overview
Component Overview
Kubernetes' scheduling logic operates based on the Pod's Request. The schedulable resources on the node are occupied by the Pod's Request amount and cannot free up. The native node dedicated scheduler is a scheduling plugin developed by Tencent Kubernetes Engine (TKE) based on the native Kube-scheduler Extender mechanism of Kubernetes, which can virtually magnify the capacity of the node, resolving the issue of the node's resources being occupied while maintaining a low utilization rate.
Kubernetes objects deployed in a cluster
|
crane-scheduler-controller | Deployment | Each instance is endowed with 200m CPU and 200Mi memory, totaling one instance | kube-system |
crane-descheduler | Deployment | Each instance is endowed with 200m CPU and 200Mi memory, totaling one instance | kube-system |
crane-scheduler | Deployment | Each instance is endowed with 200m CPU and 200Mi memory, totaling three instances | kube-system |
crane-scheduler-controller | Service | - | kube-system |
crane-scheduler | Service | - | kube-system |
crane-scheduler | ClusterRole | - | kube-system |
crane-descheduler | ClusterRole | - | kube-system |
crane-scheduler | ClusterRoleBinding | - | kube-system |
crane-descheduler | ClusterRoleBinding | - | kube-system |
crane-scheduler-policy | ConfigMap | - | kube-system |
crane-descheduler-policy | ConfigMap | - | kube-system |
ClusterNodeResourcePolicy | CRD | - | - |
CraneSchedulerConfiguration | CRD | - | - |
NodeResourcePolicy | CRD | - | - |
crane-scheduler-controller-mutating-webhook | MutatingWebhookConfiguration | - | - |
Application Scenarios
Scenario 1: Resolving the issue of high node box rate but low utilization
Note:
The fundamental concepts are as follows.
Box Rate: It refers to the ratio of the sum of Requests of all Pods on a node to the actual specifications of the node.
Utilization: It refers to the ratio of the total actual usage of all Pods on a node to the actual specifications of the node.
The native Kubernetes scheduler schedules based on the Request resources of Pod. Therefore, even if the actual usage on the node is low at this time, if the sum of Requests of all Pods on the node is close to the actual specifications of the node, new Pods cannot be scheduled, resulting in substantial resource waste. Moreover, businesses tend to apply for surplus resources to ensure the stability of their services, that is, a large Request, leading to the occupation of node resources, unable to free up. At this point, the node's box rate is substantial, but the actual resource utilization is comparatively low.
At such times, you can use the dedicated native node scheduler to virtually enhance the specifications of CPU and memory on a node, thus amplifying its scheduler resources. More pods can thereby be scheduled.
Scenario 2: Setting the watermark of the nodes
The watermark setting of the node is to ensure the stability of the node and set the node's target utilization rate:
Control of the watermark during scheduling: This step determines the native node's target resource utilization rate to guarantee stability. While scheduling the Pods, nodes with resources above this watermark will not be selected. Moreover, from nodes meeting the watermark requirements, as shown in the following figure, nodes with lower actual load watermarks have priority to balance the cluster node's utilization distribution.
Control of the watermark during runtime: This step determines the current target resource utilization rate for native nodes to guarantee stability. At runtime, nodes with resources above this watermark could trigger evictions. Given that eviction is a high-risk action, bear in mind the following notes.
Notes
1. To avoid draining important Pods, this feature is set not to evict Pods by default. For Pods that can be safely drained, it is essential for users to explicitly determine the workload to which the Pod belongs. For example, StatefulSet, Deployment, and other objects can be set as drainable annotations:
descheduler.alpha.kubernetes.io/evictable: 'true'
2. It is recommended to enable event persistence for the cluster, to better monitor component abnormalities and troubleshoot. When evicting a Pod, corresponding events will be generated. You can observe if the Pod is being repeatedly evicted based on the Descheduled event.
3. The eviction action has requirements for nodes: a cluster is required to have 3 or more low-load native nodes, where a low-load definition refers to a Node's load that is lesser than its operational water-level control.
4. After filtering at the node dimension, evacuation begins on the workload on the Node. This necessitates the constraint that the replica count of the workload should be equal to or greater than 2, or at least half of the Workload spec replicas.
5. At the Pod dimension level, if a Pod's load exceeds the eviction watermark of the node, eviction is forbidden to prevent the overloading of other nodes by relocating them there.
Scenario 3: Pods under specified Namespace shall be allocated only to native nodes upon the subsequent scheduling
Native nodes, the newly-launched node types, are introduced by the TKE Tencent Kubernetes Engine team of Tencent Cloud. They are built upon the technical excellence derived from Tencent Cloud's tens of millions of core container operations, thereby delivering native-like, high-stability, and rapid-response K8s node management capabilities. Native nodes, with amplifiable node specifications and recommended Request capabilities, are hence highly advisable for exploiting its advantages fully by scheduling your workload to them. While enabling the native node scheduler, you can opt for Namespace. Consequently, Pods under the specified Namespace shall be scheduled exclusively to native nodes in the following scheduling.
Note:
If the native node resources are insufficient at this stage, it would result in Pod Pending.
Limits
It is required to ensure that the Kubernetes version is v1.22.5-tke.8, v1.20.6-tke.24, v1.18.4-tke.28,v1.16.3-tke.30 or higher. For cluster versions upgrade, see Upgrading a Cluster. Risk Control
After the uninstallation of this component, only the scheduling logic associated with the native node-dedicated scheduler will be eliminated, leaving the scheduling capability of the native Kube-Scheduler untouched. The already scheduled Pods on the native node will not be affected due to their pre-set schedule. However, a reboot of kubelet on the native node might trigger Pod eviction as the sum of Pods' Requests on the native node could exceed the genuine specifications of the native node.
In the event of the amplification coefficient being adjusted downwards, the existing Pods on the native node, due to their already prescribed schedule, will remain unaffected. Nonetheless, if the kubelet on the native node restarts, it might trigger Pod eviction since the aggregate of Pods' Requests on the native node could surpass the amplified specifications of the native node after the amplification.
Users witness the inconsistency between Node resources in the Kubernetes cluster and corresponding CVM node resources.
In the future, issues related to excessive load and instability could possibly arise.
After the amplification of the node specifications, the node kubelet layer and the resource QoS-related modules might be affected. For instance, kubelet's binding cores, when a 4-core node is treated as an 8-core node for scheduling, the Pods' binding cores could possibly be impacted.
Component Permission Description
Crane Scheduler Permission
Permission Description
The permission of this component is the minimal dependency required for the current feature to operate.
Permission Scenarios
|
It is required to keep track of the updates and changes to the node, as well as the utilization of the access node. | nodes | get/watch/list |
Track the updates and changes of pods, and determine the scheduling priority of nodes based on the recent scheduling situation of pods within the cluster. | pods/namespaces | get/watch/list |
It is required to update node utilization to node resources, thereby achieving the decoupling of scheduling and query logic. | nodes/status | patch |
It is required to support multiple replicas to ensure component availability. | leases | create/get/update |
It is required to track the updates and changes of the configmap, implementing the feature of scheduling specified pods to native nodes. | configmap | get/list/watch |
Permission Definition
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: crane-scheduler
rules:
- apiGroups:
- ""
resources:
- pods
- nodes
- namespaces
verbs:
- list
- watch
- get
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- apiGroups:
- extensions
- apps
resources:
- deployments/scale
verbs:
- get
- update
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- create
- get
- update
- apiGroups:
- "scheduling.crane.io"
resources:
- clusternoderesourcepolicies
- noderesourcepolicies
- craneschedulerconfigurations
verbs:
- get
- list
- watch
- update
- create
- patch
Crane Descheduler Permission
Permission Description
The permission of this component is the minimal dependency required for the current feature to operate.
Permission Scenarios
|
It is required to keep track of the updates and changes to the node, as well as the utilization of the access node. | nodes | get/watch/list |
Track the updates and changes of the pods, determining the pods to be evicted first based on the information of the pods within the clusters. | pods | get/watch/list |
Drain the pod. | pods/eviction | create |
It is required to determine whether the number of ready workloads where the pod resides constitutes half or more of the total requirements to decide whether to drain the pod. | replicasets/deployments/statefulsets/statefulsetpluses/job | get |
Report events when draining Pods. | create | events |
Permission Definition
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: crane-descheduler
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["nodes/status"]
verbs: ["patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: ["*"]
resources: ["replicasets"]
verbs: ["get"]
- apiGroups: ["*"]
resources: ["deployments"]
verbs: ["get"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get"]
- apiGroups: ["platform.stke"]
resources: ["statefulsetpluses"]
verbs: ["get"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create"]
- apiGroups: ["*"]
resources: ["jobs"]
verbs: ["get"]
- apiGroups: [ "coordination.k8s.io" ]
resources: [ "leases"
Was this page helpful?