Supported Kubernetes Versions | TKE version ≥ v1.14.x |
Supported Node Types | Support only native nodes. Native nodes, equipped with FinOps concepts and paired with qGPU, can substantially improve the utilization of GPU/CPU resources. |
Supported GPU Card Architectures | Volta (e.g., V100), Turing (e.g., T4), and Ampere (e.g., A100, A10) are supported. |
Supported Driver Versions | The nvidia driver minor version (end version number, e.g., 450.102.04, where the minor version corresponds to 04) needs to satisfy the following conditions: 450: <= 450.102.04 470: <= 470.161.03 515: <= 515.65.01 525: <= 525.89.02 |
Shared Granularity | Each qGPU is assigned a minimum of 1G of vRAM, with a precision unit of 1G. Computing capacity is allocated at a minimum of 5 (representing 5% of one card), up to 100 (a whole card), with a precision unit of 5 (i.e., 5, 10, 15, 20...100). |
Complete Card Allocation | Nodes with qGPU capability enabled can allocate whole cards in the manner of tke.cloud.tencent.com/qgpu-core: 100 | 200 | ... (N * 100, N is the number of whole cards). It is recommended to differentiate the NVIDIA allocation method or convert to qGPU usage through TKE's node pool capability. |
Quantity Limits | Up to 16 qGPU devices can be created on one GPU. It is recommended to determine the number of qGPU shareable per GPU card based on the size of the vRAM requested by the container deployment. |
Kubernetes Object Name | Type | Requested Resource | Namespace |
qgpu-manager | DaemonSet | Each GPU node with one Memory: 300 M, CPU: 0.2 | kube-system |
qgpu-manager | ClusterRole | - | - |
qgpu-manager | ServiceAccount | - | kube-system |
qgpu-manager | ClusterRoleBinding | - | kube-system |
qgpu-scheduler | Deployment | A single replica Memory: 800 M, CPU: 1 | kube-system |
qgpu-scheduler | ClusterRole | - | - |
qgpu-scheduler | ClusterRoleBinding | - | kube-system |
qgpu-scheduler | ServiceAccount | - | kube-system |
qgpu-scheduler | Service | - | kube-system |
Feature | Involved Object | Involved Operation Permission |
Track the status changes of a pod, access pod information, and clean up resources such as qgpu devices when a pod is deleted. | pods | get/list/watch |
Monitor the status changes of a node, access node information, and add labels to nodes based on gpu card driver and version information as well as qgpu version information. | nodes | get/list/watch/update |
The qgpu-scheduler is an extender-based scheduler specifically developed for qgpu resources, based on the Kubernetes scheduler extender mechanism. The permissions it requires align with those of other community scheduler components (such as Volcano), including tracking and access to pod information, updating scheduling results to pod labels and annotations, tracking and access to node information, accessing configuration via the configmap, and creating scheduling events. | pods | get/list/update/patch |
| nodes | get/list/watch |
| configmaps | get/list/watch |
| events | create/patch |
gpu.elasticgpu.io is qgpu's proprietary CRD resource (this feature has been deprecated, but to be compatible with earlier versions, the resource definition must be retained) for recording GPU resource information, managed by qgpu-manager and qgpu-scheduler. It requires permissions for a full range of operations, including creation, deletion, modification, and queries. | gpu.elasticgpu.io and gpu.elasticgpu.io/status | All Permissions |
kind: ClusterRoleapiVersion: rbac.authorization.k8s.io/v1metadata:name: qgpu-managerrules:- apiGroups:- ""resources:- podsverbs:- get- list- watch- apiGroups:- ""resources:- nodesverbs:- update- get- list- watch- apiGroups:- "elasticgpu.io"resources:- gpus- gpus/statusverbs:- '*'---kind: ClusterRoleapiVersion: rbac.authorization.k8s.io/v1metadata:name: qgpu-schedulerrules:- apiGroups:- ""resources:- nodesverbs:- get- list- watch- apiGroups:- ""resources:- eventsverbs:- create- patch- apiGroups:- ""resources:- podsverbs:- update- patch- get- list- watch- apiGroups:- ""resources:- bindings- pods/bindingverbs:- create- apiGroups:- ""resources:- configmapsverbs:- get- list- watch- apiGroups:- "elasticgpu.io"resources:- gpus- gpus/statusverbs:- '*'
tke.cloud.tencent.com/qgpu-schedule-policy
fixed-share
(The full name or abbreviation of the label value can be provided, more values are available in the table below)
Currently, qGPU supports the following isolation policies:Label Value | Abbreviation | Name | Meaning |
best-effort (Default value) | be | Best Effort | Default value. The computing capacity for each Pod is limitless and can be used as long as there is remaining computing capacity on the card. If a total of N Pods are enabled, each Pod bearing a substantial workload, the result would eventually be a computing capacity of 1/N. |
fixed-share | fs | Fixed Share | Each Pod is granted a fixed compute quota that cannot be exceeded, even if the GPU still possesses unused computing capacity. |
burst-share | bs | Guaranteed Share with Burst | The scheduler ensures each Pod has a minimum compute quota but as long as the GPU has spare capacity, it may be used by a Pod. For instance, when the GPU has unused capacity (not assigned to other Pods), a Pod can use the computing capacity beyond its quota. Please note that when this portion of the unused capacity is reassigned, the Pod will revert to its computing quota. |
spec:containers:resources:limits:tke.cloud.tencent.com/qgpu-memory: "5"tke.cloud.tencent.com/qgpu-core: "30"requests:tke.cloud.tencent.com/qgpu-memory: "5"tke.cloud.tencent.com/qgpu-core: "30"
Apakah halaman ini membantu?