tencent cloud

Feedback

Obtaining GPU Monitoring Metrics

Last updated: 2024-12-25 15:00:17

    Add-On Overview

    The Tencent Kubernetes Engine (TKE) add-on elastic-gpu-exporter has been developed for obtaining GPU-related monitoring metrics, including:
    GPU utilization
    Pod/Container GPU resource utilization

    Deployment Mode

    elastic-gpu-exporter is deployed to a cluster using DaemonSet.
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    name: elastic-gpu-exporter
    namespace: kube-system
    labels:
    app: elastic-gpu-exporter
    spec:
    updateStrategy:
    type: RollingUpdate
    selector:
    matchLabels:
    name: gpu-manager-ds
    app: nano-gpu-exporter
    template:
    metadata:
    name: elastic-gpu-exporter
    labels:
    name: gpu-manager-ds
    app: nano-gpu-exporter
    spec:
    nodeSelector:
    qgpu-device-enable: enable
    serviceAccount: elastic-gpu-exporter
    hostNetwork: true
    hostPID: true
    hostIPC: true
    containers:
    - image: ccr.ccs.tencentyun.com/tkeimages/elastic-gpu-exporter:v1.0.8
    imagePullPolicy: Always
    args:
    - --node=$(NODE_NAME)
    env:
    - name: "PORT"
    value: "5678"
    - name: "NODE_NAME"
    valueFrom:
    fieldRef:
    fieldPath: spec.nodeName
    name: elastic-gpu-exporter
    securityContext:
    capabilities:
    add: ["SYS_ADMIN"]
    volumeMounts:
    - name: cgroup
    readOnly: true
    mountPath: "/host/sys"
    volumes:
    - name: cgroup
    hostPath:
    type: Directory
    path: "/sys"
    ---
    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
    name: elastic-gpu-exporter
    rules:
    - apiGroups:
    - ""
    resources:
    - nodes
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - ""
    resources:
    - events
    verbs:
    - create
    - patch
    - apiGroups:
    - ""
    resources:
    - pods
    verbs:
    - update
    - patch
    - get
    - list
    - watch
    - apiGroups:
    - ""
    resources:
    - bindings
    - pods/binding
    verbs:
    - create
    - apiGroups:
    - ""
    resources:
    - configmaps
    verbs:
    - get
    - list
    - watch
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: elastic-gpu-exporter
    namespace: kube-system
    ---
    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
    name: elastic-gpu-exporter
    namespace: kube-system
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: elastic-gpu-exporter
    subjects:
    - kind: ServiceAccount
    name: elastic-gpu-exporter
    namespace: kube-system
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: elastic-gpu-exporter
    namespace: kube-system
    annotations:
    prometheus.io/scrape: "true"
    labels:
    kubernetes.io/cluster-service: "true"
    spec:
    clusterIP: None
    ports:
    - name: elastic-gpu-exporter
    port: 5678
    protocol: TCP
    targetPort: 5678
    selector:
    app: nano-gpu-exporter
    

    Checking Running Status

    After deployment, a DaemonSet of elastic-gpu-exporter is generated in the cluster:
    NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
    elastic-gpu-exporter 1 1 1 1 1 <none> 3m36s
    A running elastic-gpu-exporter Pod will be present on a qualified node:
    NAME READY STATUS RESTARTS AGE
    elastic-gpu-exporter-dblqm 1/1 Running 0 6s

    Obtaining Monitoring Metrics

    The node running the elastic-gpu-exporter service will be output to the /metrics path, so you can run the following command to obtain monitoring metrics:
    $ curl NodeIP:5678/metrics

    GPU Metrics

    gpu_xxx
    GPU Metrics
    gpu_core_usage
    Actual computing power usage of the GPU
    gpu_mem_usage
    Actual video memory usage of the GPU
    gpu_core_utilization_percentage
    GPU computing power utilization
    gpu_mem_utilization_percentage
    GPU video memory utilization
    The GPU metrics format is as follows: 
    gpu_core_usage{card="0",node="10.0.66.4"} 0
    Note:
    "card" represents the GPU serial number, and "node" represents the node where the GPU is located.

    Pod Metrics

    pod_xxx
    Pod Metrics
    pod_core_usage
    Actual computing power usage of the Pod
    pod_mem_usage
    Actual video memory usage of the Pod
    pod_core_utilization_percentage
    Percentage of the computing power used by the Pod to the requested computing power
    pod_mem_utilization_percentage
    Percentage of the video memory used by the Pod to the requested video memory
    pod_core_occupy_node_percentage
    Percentage of the computing power used by the Pod to the total computing power of the node
    pod_mem_occupy_node_percentage
    Percentage of the video memory used by the Pod to the total video memory of the node
    pod_core_request
    Computing power requested by the Pod
    pod_mem_request
    Video memory requested by the Pod
    The Pod metrics format is as follows: 
    pod_core_usage{namespace="default",node="10.0.66.4",pod="7a2fa737-eef1-4801-8937-493d7efb16b7"} 0
    Note:
    "namespace" represents the namespace of the Pod, "node" represents the node where the Pod is located, and "pod" represents the name of the Pod.

    Container Metrics

    container_xxx
    Container Metrics
    container_gpu_utilization
    Actual computing power of the container
    container_gpu_memory_total
    Actual video memory usage of the container
    container_core_utilization_percentage
    Percentage of the computing power used by the container to the requested computing power
    container_mem_utilization_percentage
    Percentage of the video memory used by the container to the requested video memory
    container_request_gpu_memory
    Requested video memory of the container
    container_request_gpu_utilization
    Requested computing power of the container
    The container metrics format is as follows: 
    container_gpu_utilization{container="cuda",namespace="default",node="10.0.66.4",pod="cuda"} 0
    Note:
    "container" represents the container name, "namespace" represents the namespace of the container, "node" represents the node where the container is located, and "pod" represents the name of the Pod where the container is located.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support