elastic-gpu-exporter
has been developed for obtaining GPU-related monitoring metrics, including:elastic-gpu-exporter
is deployed to a cluster using DaemonSet.apiVersion: apps/v1kind: DaemonSetmetadata:name: elastic-gpu-exporternamespace: kube-systemlabels:app: elastic-gpu-exporterspec:updateStrategy:type: RollingUpdateselector:matchLabels:name: gpu-manager-dsapp: nano-gpu-exportertemplate:metadata:name: elastic-gpu-exporterlabels:name: gpu-manager-dsapp: nano-gpu-exporterspec:nodeSelector:qgpu-device-enable: enableserviceAccount: elastic-gpu-exporterhostNetwork: truehostPID: truehostIPC: truecontainers:- image: ccr.ccs.tencentyun.com/tkeimages/elastic-gpu-exporter:v1.0.8imagePullPolicy: Alwaysargs:- --node=$(NODE_NAME)env:- name: "PORT"value: "5678"- name: "NODE_NAME"valueFrom:fieldRef:fieldPath: spec.nodeNamename: elastic-gpu-exportersecurityContext:capabilities:add: ["SYS_ADMIN"]volumeMounts:- name: cgroupreadOnly: truemountPath: "/host/sys"volumes:- name: cgrouphostPath:type: Directorypath: "/sys"---kind: ClusterRoleapiVersion: rbac.authorization.k8s.io/v1metadata:name: elastic-gpu-exporterrules:- apiGroups:- ""resources:- nodesverbs:- get- list- watch- apiGroups:- ""resources:- eventsverbs:- create- patch- apiGroups:- ""resources:- podsverbs:- update- patch- get- list- watch- apiGroups:- ""resources:- bindings- pods/bindingverbs:- create- apiGroups:- ""resources:- configmapsverbs:- get- list- watch---apiVersion: v1kind: ServiceAccountmetadata:name: elastic-gpu-exporternamespace: kube-system---kind: ClusterRoleBindingapiVersion: rbac.authorization.k8s.io/v1metadata:name: elastic-gpu-exporternamespace: kube-systemroleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: elastic-gpu-exportersubjects:- kind: ServiceAccountname: elastic-gpu-exporternamespace: kube-system---apiVersion: v1kind: Servicemetadata:name: elastic-gpu-exporternamespace: kube-systemannotations:prometheus.io/scrape: "true"labels:kubernetes.io/cluster-service: "true"spec:clusterIP: Noneports:- name: elastic-gpu-exporterport: 5678protocol: TCPtargetPort: 5678selector:app: nano-gpu-exporter
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGEelastic-gpu-exporter 1 1 1 1 1 <none> 3m36s
NAME READY STATUS RESTARTS AGEelastic-gpu-exporter-dblqm 1/1 Running 0 6s
/metrics
path, so you can run the following command to obtain monitoring metrics:$ curl NodeIP:5678/metrics
gpu_xxx | GPU Metrics |
gpu_core_usage | Actual computing power usage of the GPU |
gpu_mem_usage | Actual video memory usage of the GPU |
gpu_core_utilization_percentage | GPU computing power utilization |
gpu_mem_utilization_percentage | GPU video memory utilization |
gpu_core_usage{card="0",node="10.0.66.4"} 0
pod_xxx | Pod Metrics |
pod_core_usage | Actual computing power usage of the Pod |
pod_mem_usage | Actual video memory usage of the Pod |
pod_core_utilization_percentage | Percentage of the computing power used by the Pod to the requested computing power |
pod_mem_utilization_percentage | Percentage of the video memory used by the Pod to the requested video memory |
pod_core_occupy_node_percentage | Percentage of the computing power used by the Pod to the total computing power of the node |
pod_mem_occupy_node_percentage | Percentage of the video memory used by the Pod to the total video memory of the node |
pod_core_request | Computing power requested by the Pod |
pod_mem_request | Video memory requested by the Pod |
pod_core_usage{namespace="default",node="10.0.66.4",pod="7a2fa737-eef1-4801-8937-493d7efb16b7"} 0
container_xxx | Container Metrics |
container_gpu_utilization | Actual computing power of the container |
container_gpu_memory_total | Actual video memory usage of the container |
container_core_utilization_percentage | Percentage of the computing power used by the container to the requested computing power |
container_mem_utilization_percentage | Percentage of the video memory used by the container to the requested video memory |
container_request_gpu_memory | Requested video memory of the container |
container_request_gpu_utilization | Requested computing power of the container |
container_gpu_utilization{container="cuda",namespace="default",node="10.0.66.4",pod="cuda"} 0
Was this page helpful?