tencent cloud

文档反馈

节点内存碎片化排障处理

最后更新时间:2024-12-13 11:32:48
    本文档介绍如何判断 TKE 集群中存在问题是否由内存碎片化引起,并给出解决方法,请按照以下步骤进行排查并解决。

    问题分析

    内存页分配失败,内核日志将会出现以下报错:
    mysqld: page allocation failure. order:4, mode:0x10c0d0
    mysqld:为被分配内存的程序。
    order:表示需要分配连续页的数量(2^order),本例中4则表示 2^4 = 16 个连续的页。
    mode:内存分配模式的标识,在内核源码文件 include/linux/gfp.h 中定义,通常是多个标识项与运算的结果。不同版本内核有一定区别。例如。在新版内核中 GFP_KERNEL__GFP_RECLAIM | __GFP_IO | __GFP_FS 的运算结果,而 __GFP_RECLAIM___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM 的运算结果。
    注意:
    当 order 值为0时,说明系统已经完全没有可用内存。
    当 order 值较大时,说明内存已碎片化,无法分配连续的大页内存。

    现象描述

    容器启动失败

    K8S 会为每个 Pod 创建 netns 来隔离 network namespace,内核初始化 netns 时会为其创建 nf_conntrack 表的 cache,需要申请大页内存,如果此时系统内存已经碎片化,无法分配到足够的大页内存内核就会出现如下报错(v2.6.33 - v4.6):
    runc:[1:CHILD]: page allocation failure: order:6, mode:0x10c0d0
    Pod 将会一直处于 ContainerCreating 状态,dockerd 启动容器也将失败。日志报错如下:
    Jan 23 14:15:31 dc05 dockerd: time="2019-01-23T14:15:31.288446233+08:00" level=error msg="containerd: start container" error="oci runtime error: container_linux.go:247: starting container process caused \\"process_linux.go:245: running exec setns process for init caused \\\\\\"exit status 6\\\\\\"\\"\\n" id=5b9be8c5bb121264899fac8d9d36b02150269d41ce96ba6ad36d70b8640cb01c
    Jan 23 14:15:31 dc05 dockerd: time="2019-01-23T14:15:31.317965799+08:00" level=error msg="Create container failed with error: invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""
    Kubelet 日志报错如下:
    Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352386 26037 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = failed to start sandbox container for pod "matchdataserver-1255064836-t4b2w": Error response from daemon: {"message":"invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""}
    Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352496 26037 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" failed: rpc error: code = 2 desc = failed to start sandbox container for pod "matchdataserver-1255064836-t4b2w": Error response from daemon: {"message":"invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""}
    Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352518 26037 kuberuntime_manager.go:618] createPodSandbox for pod "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" failed: rpc error: code = 2 desc = failed to start sandbox container for pod "matchdataserver-1255064836-t4b2w": Error response from daemon: {"message":"invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""}
    Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352580 26037 pod_workers.go:182] Error syncing pod 485fd485-1ed6-11e9-8661-0a587f8021ea ("matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)"), skipping: failed to "CreatePodSandbox" for "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" with CreatePodSandboxError: "CreatePodSandbox for pod \\"matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)\\" failed: rpc error: code = 2 desc = failed to start sandbox container for pod \\"matchdataserver-1255064836-t4b2w\\": Error response from daemon: {\\"message\\":\\"invalid header field value \\\\\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\\\\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\"\\\\\\\\n\\\\\\"\\"}"
    Jan 23 14:15:31 dc05 kubelet: I0123 14:15:31.372181 26037 kubelet.go:1916] SyncLoop (PLEG): "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)", event: &pleg.PodLifecycleEvent{ID:"485fd485-1ed6-11e9-8661-0a587f8021ea", Type:"ContainerDied", Data:"5b9be8c5bb121264899fac8d9d36b02150269d41ce96ba6ad36d70b8640cb01c"}
    Jan 23 14:15:31 dc05 kubelet: W0123 14:15:31.372225 26037 pod_container_deletor.go:77] Container "5b9be8c5bb121264899fac8d9d36b02150269d41ce96ba6ad36d70b8640cb01c" not found in pod's containers
    Jan 23 14:15:31 dc05 kubelet: I0123 14:15:31.678211 26037 kuberuntime_manager.go:383] No ready sandbox for pod "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" can be found. Need to start a new one
    执行命令 cat /proc/buddyinfo 查看 slab,系统不存在大块内存时返回0数量较多。如下所示:
    $ cat /proc/buddyinfo
    Node 0, zone DMA 1 0 1 0 2 1 1 0 1 1 3
    Node 0, zone DMA32 2725 624 489 178 0 0 0 0 0 0 0
    Node 0, zone Normal 1163 1101 932 222 0 0 0 0 0 0 0

    系统 OOM

    内存碎片化严重,系统缺少大页内存,即使是当前系统总内存较多的情况下,也会出现给程序分配内存失败的现象。此时系统会做出内存不够用的误判,认为需要停止一些进程来释放内存,从而导致系统 OOM。

    处理步骤

    1. 周期性地或者在发现大块内存不足时,先进行 drop_cache 操作。
    echo 3 > /proc/sys/vm/drop_caches
    2. 必要时执行以下操作进行内存整理。
    注意:
    请谨慎执行此步骤,该操作开销较大,会造成业务卡顿。
    echo 1 > /proc/sys/vm/compact_memory
    联系我们

    联系我们,为您的业务提供专属服务。

    技术支持

    如果你想寻求进一步的帮助,通过工单与我们进行联络。我们提供7x24的工单服务。

    7x24 电话支持