tencent cloud

$0 14-Day TrialExperience EdgeOne for acceleration and security protection!

Feedback

Tencent Kubernetes Engine

Obtaining GPU Monitoring Metrics

Last updated: 2024-12-25 15:00:17

Add-On Overview

The Tencent Kubernetes Engine (TKE) add-on elastic-gpu-exporter has been developed for obtaining GPU-related monitoring metrics, including:
GPU utilization
Pod/Container GPU resource utilization

Deployment Mode

elastic-gpu-exporter is deployed to a cluster using DaemonSet.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: elastic-gpu-exporter
namespace: kube-system
labels:
app: elastic-gpu-exporter
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
name: gpu-manager-ds
app: nano-gpu-exporter
template:
metadata:
name: elastic-gpu-exporter
labels:
name: gpu-manager-ds
app: nano-gpu-exporter
spec:
nodeSelector:
qgpu-device-enable: enable
serviceAccount: elastic-gpu-exporter
hostNetwork: true
hostPID: true
hostIPC: true
containers:
- image: ccr.ccs.tencentyun.com/tkeimages/elastic-gpu-exporter:v1.0.8
imagePullPolicy: Always
args:
- --node=$(NODE_NAME)
env:
- name: "PORT"
value: "5678"
- name: "NODE_NAME"
valueFrom:
fieldRef:
fieldPath: spec.nodeName
name: elastic-gpu-exporter
securityContext:
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: cgroup
readOnly: true
mountPath: "/host/sys"
volumes:
- name: cgroup
hostPath:
type: Directory
path: "/sys"
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: elastic-gpu-exporter
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- ""
resources:
- pods
verbs:
- update
- patch
- get
- list
- watch
- apiGroups:
- ""
resources:
- bindings
- pods/binding
verbs:
- create
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: elastic-gpu-exporter
namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: elastic-gpu-exporter
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: elastic-gpu-exporter
subjects:
- kind: ServiceAccount
name: elastic-gpu-exporter
namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
name: elastic-gpu-exporter
namespace: kube-system
annotations:
prometheus.io/scrape: "true"
labels:
kubernetes.io/cluster-service: "true"
spec:
clusterIP: None
ports:
- name: elastic-gpu-exporter
port: 5678
protocol: TCP
targetPort: 5678
selector:
app: nano-gpu-exporter


Checking Running Status

After deployment, a DaemonSet of elastic-gpu-exporter is generated in the cluster:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
elastic-gpu-exporter 1 1 1 1 1 <none> 3m36s
A running elastic-gpu-exporter Pod will be present on a qualified node:
NAME READY STATUS RESTARTS AGE
elastic-gpu-exporter-dblqm 1/1 Running 0 6s

Obtaining Monitoring Metrics

The node running the elastic-gpu-exporter service will be output to the /metrics path, so you can run the following command to obtain monitoring metrics:
$ curl NodeIP:5678/metrics

GPU Metrics

gpu_xxx
GPU Metrics
gpu_core_usage
Actual computing power usage of the GPU
gpu_mem_usage
Actual video memory usage of the GPU
gpu_core_utilization_percentage
GPU computing power utilization
gpu_mem_utilization_percentage
GPU video memory utilization
The GPU metrics format is as follows: 
gpu_core_usage{card="0",node="10.0.66.4"} 0
Note:
"card" represents the GPU serial number, and "node" represents the node where the GPU is located.

Pod Metrics

pod_xxx
Pod Metrics
pod_core_usage
Actual computing power usage of the Pod
pod_mem_usage
Actual video memory usage of the Pod
pod_core_utilization_percentage
Percentage of the computing power used by the Pod to the requested computing power
pod_mem_utilization_percentage
Percentage of the video memory used by the Pod to the requested video memory
pod_core_occupy_node_percentage
Percentage of the computing power used by the Pod to the total computing power of the node
pod_mem_occupy_node_percentage
Percentage of the video memory used by the Pod to the total video memory of the node
pod_core_request
Computing power requested by the Pod
pod_mem_request
Video memory requested by the Pod
The Pod metrics format is as follows: 
pod_core_usage{namespace="default",node="10.0.66.4",pod="7a2fa737-eef1-4801-8937-493d7efb16b7"} 0
Note:
"namespace" represents the namespace of the Pod, "node" represents the node where the Pod is located, and "pod" represents the name of the Pod.

Container Metrics

container_xxx
Container Metrics
container_gpu_utilization
Actual computing power of the container
container_gpu_memory_total
Actual video memory usage of the container
container_core_utilization_percentage
Percentage of the computing power used by the container to the requested computing power
container_mem_utilization_percentage
Percentage of the video memory used by the container to the requested video memory
container_request_gpu_memory
Requested video memory of the container
container_request_gpu_utilization
Requested computing power of the container
The container metrics format is as follows: 
container_gpu_utilization{container="cuda",namespace="default",node="10.0.66.4",pod="cuda"} 0
Note:
"container" represents the container name, "namespace" represents the namespace of the container, "node" represents the node where the container is located, and "pod" represents the name of the Pod where the container is located.
Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon