节点内存碎片化排障处理

Recent Pages

节点内存碎片化排障处理

最后更新时间：2024-12-13 11:32:48

本文档介绍如何判断 TKE 集群中存在问题是否由内存碎片化引起，并给出解决方法，请按照以下步骤进行排查并解决。
问题分析
内存页分配失败，内核日志将会出现以下报错：
mysqld: page allocation failure. order:4, mode:0x10c0d0
mysqld：为被分配内存的程序。
order：表示需要分配连续页的数量（2^order），本例中4则表示 2^4 = 16 个连续的页。
mode：内存分配模式的标识，在内核源码文件 include/linux/gfp.h 中定义，通常是多个标识项与运算的结果。不同版本内核有一定区别。例如。在新版内核中 GFP_KERNEL 是 __GFP_RECLAIM | __GFP_IO | __GFP_FS 的运算结果，而 __GFP_RECLAIM 是 ___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM 的运算结果。
注意：
当 order 值为0时，说明系统已经完全没有可用内存。
当 order 值较大时，说明内存已碎片化，无法分配连续的大页内存。
现象描述
容器启动失败
K8S 会为每个 Pod 创建 netns 来隔离 network namespace，内核初始化 netns 时会为其创建 nf_conntrack 表的 cache，需要申请大页内存，如果此时系统内存已经碎片化，无法分配到足够的大页内存内核就会出现如下报错（v2.6.33 - v4.6）：
runc:[1:CHILD]: page allocation failure: order:6, mode:0x10c0d0
Pod 将会一直处于 ContainerCreating 状态，dockerd 启动容器也将失败。日志报错如下：
Jan 23 14:15:31 dc05 dockerd: time="2019-01-23T14:15:31.288446233+08:00" level=error msg="containerd: start container" error="oci runtime error: container_linux.go:247: starting container process caused \\"process_linux.go:245: running exec setns process for init caused \\\\\\"exit status 6\\\\\\"\\"\\n" id=5b9be8c5bb121264899fac8d9d36b02150269d41ce96ba6ad36d70b8640cb01c
Jan 23 14:15:31 dc05 dockerd: time="2019-01-23T14:15:31.317965799+08:00" level=error msg="Create container failed with error: invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""
Kubelet 日志报错如下：
Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352386   26037 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = failed to start sandbox container for pod "matchdataserver-1255064836-t4b2w": Error response from daemon: {"message":"invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""}
Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352496   26037 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" failed: rpc error: code = 2 desc = failed to start sandbox container for pod "matchdataserver-1255064836-t4b2w": Error response from daemon: {"message":"invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""}
Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352518   26037 kuberuntime_manager.go:618] createPodSandbox for pod "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" failed: rpc error: code = 2 desc = failed to start sandbox container for pod "matchdataserver-1255064836-t4b2w": Error response from daemon: {"message":"invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\"\\\\\\"\\\\n\\""}
Jan 23 14:15:31 dc05 kubelet: E0123 14:15:31.352580   26037 pod_workers.go:182] Error syncing pod 485fd485-1ed6-11e9-8661-0a587f8021ea ("matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)"), skipping: failed to "CreatePodSandbox" for "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" with CreatePodSandboxError: "CreatePodSandbox for pod \\"matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)\\" failed: rpc error: code = 2 desc = failed to start sandbox container for pod \\"matchdataserver-1255064836-t4b2w\\": Error response from daemon: {\\"message\\":\\"invalid header field value \\\\\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\\\\\\\\"process_linux.go:245: running exec setns process for init caused \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"exit status 6\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\"\\\\\\\\n\\\\\\"\\"}"
Jan 23 14:15:31 dc05 kubelet: I0123 14:15:31.372181   26037 kubelet.go:1916] SyncLoop (PLEG): "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)", event: &pleg.PodLifecycleEvent{ID:"485fd485-1ed6-11e9-8661-0a587f8021ea", Type:"ContainerDied", Data:"5b9be8c5bb121264899fac8d9d36b02150269d41ce96ba6ad36d70b8640cb01c"}
Jan 23 14:15:31 dc05 kubelet: W0123 14:15:31.372225   26037 pod_container_deletor.go:77] Container "5b9be8c5bb121264899fac8d9d36b02150269d41ce96ba6ad36d70b8640cb01c" not found in pod's containers
Jan 23 14:15:31 dc05 kubelet: I0123 14:15:31.678211   26037 kuberuntime_manager.go:383] No ready sandbox for pod "matchdataserver-1255064836-t4b2w_basic(485fd485-1ed6-11e9-8661-0a587f8021ea)" can be found. Need to start a new one
执行命令 cat /proc/buddyinfo 查看 slab，系统不存在大块内存时返回0数量较多。如下所示：
$ cat /proc/buddyinfo
Node 0, zone      DMA      1      0      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32   2725    624    489    178      0      0      0      0      0      0      0
Node 0, zone   Normal   1163   1101    932    222      0      0      0      0      0      0      0
系统 OOM
内存碎片化严重，系统缺少大页内存，即使是当前系统总内存较多的情况下，也会出现给程序分配内存失败的现象。此时系统会做出内存不够用的误判，认为需要停止一些进程来释放内存，从而导致系统 OOM。
处理步骤
1. 周期性地或者在发现大块内存不足时，先进行 drop_cache 操作。
echo 3 > /proc/sys/vm/drop_caches
2. 必要时执行以下操作进行内存整理。
注意：
请谨慎执行此步骤，该操作开销较大，会造成业务卡顿。
echo 1 > /proc/sys/vm/compact_memory

联系我们

联系我们，为您的业务提供专属服务。

技术支持

如果你想寻求进一步的帮助，通过工单与我们进行联络。我们提供7x24的工单服务。

7x24 电话支持

tencent cloud

Recent Pages

节点内存碎片化排障处理

问题分析

现象描述

容器启动失败

系统 OOM

处理步骤

本页内容是否解决了您的问题？

本页内容是否解决了您的问题？

tencent cloud

注册

登录

Recent Pages

节点内存碎片化排障处理

问题分析

现象描述

容器启动失败

系统 OOM

处理步骤

本页内容是否解决了您的问题？

本页内容是否解决了您的问题？