Obtaining GPU Monitoring Metrics

Add-On Overview
The Tencent Kubernetes Engine (TKE) add-on elastic-gpu-exporter has been developed for obtaining GPU-related monitoring metrics, including:
GPU utilization
Pod/Container GPU resource utilization
Deployment Mode
elastic-gpu-exporter is deployed to a cluster using DaemonSet.
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
  labels:
    app: elastic-gpu-exporter
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
      app: nano-gpu-exporter
  template:
    metadata:
      name: elastic-gpu-exporter
      labels:
        name: gpu-manager-ds
        app: nano-gpu-exporter
    spec:
      nodeSelector:
        qgpu-device-enable: enable
      serviceAccount: elastic-gpu-exporter
      hostNetwork: true
      hostPID: true
      hostIPC: true
      containers:
        - image: ccr.ccs.tencentyun.com/tkeimages/elastic-gpu-exporter:v1.0.8
          imagePullPolicy: Always
          args:
            - --node=$(NODE_NAME)
          env:
            - name: "PORT"
              value: "5678"
            - name: "NODE_NAME"
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          name: elastic-gpu-exporter
          securityContext:
            capabilities:
              add: ["SYS_ADMIN"]
          volumeMounts:
            - name: cgroup
              readOnly: true
              mountPath: "/host/sys"
      volumes:
        - name: cgroup
          hostPath:
            type: Directory
            path: "/sys"
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: elastic-gpu-exporter
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - events
    verbs:
      - create
      - patch
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - update
      - patch
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - bindings
      - pods/binding
    verbs:
      - create
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
      - list
      - watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: elastic-gpu-exporter
subjects:
  - kind: ServiceAccount
    name: elastic-gpu-exporter
    namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
spec:
  clusterIP: None
  ports:
    - name: elastic-gpu-exporter
      port: 5678
      protocol: TCP
      targetPort: 5678
  selector:
    app: nano-gpu-exporter
﻿
Checking Running Status
After deployment, a DaemonSet of elastic-gpu-exporter is generated in the cluster:
NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
elastic-gpu-exporter   1         1         1       1            1           <none>          3m36s
A running elastic-gpu-exporter Pod will be present on a qualified node:
NAME                        READY   STATUS    RESTARTS   AGE
elastic-gpu-exporter-dblqm  1/1     Running   0          6s
Obtaining Monitoring Metrics
The node running the elastic-gpu-exporter service will be output to the /metrics path, so you can run the following command to obtain monitoring metrics:
$ curl NodeIP:5678/metrics
GPU Metrics
gpu_xxx
GPU Metrics
gpu_core_usage
Actual computing power usage of the GPU
gpu_mem_usage
Actual video memory usage of the GPU
gpu_core_utilization_percentage
GPU computing power utilization
gpu_mem_utilization_percentage
GPU video memory utilization
The GPU metrics format is as follows: 
gpu_core_usage{card="0",node="10.0.66.4"} 0
Note:
"card" represents the GPU serial number, and "node" represents the node where the GPU is located.
Pod Metrics
pod_xxx
Pod Metrics
pod_core_usage
Actual computing power usage of the Pod
pod_mem_usage
Actual video memory usage of the Pod
pod_core_utilization_percentage
Percentage of the computing power used by the Pod to the requested computing power
pod_mem_utilization_percentage
Percentage of the video memory used by the Pod to the requested video memory
pod_core_occupy_node_percentage
Percentage of the computing power used by the Pod to the total computing power of the node
pod_mem_occupy_node_percentage
Percentage of the video memory used by the Pod to the total video memory of the node
pod_core_request
Computing power requested by the Pod
pod_mem_request
Video memory requested by the Pod
The Pod metrics format is as follows: 
pod_core_usage{namespace="default",node="10.0.66.4",pod="7a2fa737-eef1-4801-8937-493d7efb16b7"} 0
Note:
"namespace" represents the namespace of the Pod, "node" represents the node where the Pod is located, and "pod" represents the name of the Pod.
Container Metrics
container_xxx
Container Metrics
container_gpu_utilization
Actual computing power of the container
container_gpu_memory_total
Actual video memory usage of the container
container_core_utilization_percentage
Percentage of the computing power used by the container to the requested computing power
container_mem_utilization_percentage
Percentage of the video memory used by the container to the requested video memory
container_request_gpu_memory
Requested video memory of the container
container_request_gpu_utilization
Requested computing power of the container
The container metrics format is as follows: 
container_gpu_utilization{container="cuda",namespace="default",node="10.0.66.4",pod="cuda"} 0
Note:
"container" represents the container name, "namespace" represents the namespace of the container, "node" represents the node where the container is located, and "pod" represents the name of the Pod where the container is located.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

gpu_xxx	GPU Metrics
gpu_core_usage	Actual computing power usage of the GPU
gpu_mem_usage	Actual video memory usage of the GPU
gpu_core_utilization_percentage	GPU computing power utilization
gpu_mem_utilization_percentage	GPU video memory utilization

pod_xxx	Pod Metrics
pod_core_usage	Actual computing power usage of the Pod
pod_mem_usage	Actual video memory usage of the Pod
pod_core_utilization_percentage	Percentage of the computing power used by the Pod to the requested computing power
pod_mem_utilization_percentage	Percentage of the video memory used by the Pod to the requested video memory
pod_core_occupy_node_percentage	Percentage of the computing power used by the Pod to the total computing power of the node
pod_mem_occupy_node_percentage	Percentage of the video memory used by the Pod to the total video memory of the node
pod_core_request	Computing power requested by the Pod
pod_mem_request	Video memory requested by the Pod

container_xxx	Container Metrics
container_gpu_utilization	Actual computing power of the container
container_gpu_memory_total	Actual video memory usage of the container
container_core_utilization_percentage	Percentage of the computing power used by the container to the requested computing power
container_mem_utilization_percentage	Percentage of the video memory used by the container to the requested video memory
container_request_gpu_memory	Requested video memory of the container
container_request_gpu_utilization	Requested computing power of the container

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Tencent Big Model

Internet of Things

Stream Services

Cloud Real-time Rendering

Cloud Resource Management

More

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Management and Audit Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Voice Technology

Natural Language Processing

Communication

Media Process Services

Education Sevices

Developer Tools

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Image Creation

Optical Character Recognition

Interactive Video Services

Media SDK

Medical Services

Monitor and Operation

Add-On Overview

Deployment Mode

Checking Running Status

Obtaining Monitoring Metrics

GPU Metrics

Pod Metrics

Container Metrics