Problem Description
In Docker 19 and later versions, when excessive system memory usage causes containerd to encounter an Out of Memory (OOM) situation, it may result in Docker stopping and not restarting automatically. This issue can be reproduced by executing the pkill -9 containerd; systemctl is-active dockerd containerd
command. At this point, dockerd will be stopped by systemd.
The most severe impact could be general nodes becoming NotReady after OOM, and issues with the primary node in an independent cluster could trigger an avalanche effect.
Problem Analysis
Initially, the Docker community set the relationship between docker and containerd as dockerd.service BindsTo containerd.service. This causes systemd to actively stop dockerd when containerd is forcibly terminated by the kill -9
command. Even if Restart is set in Docker, recovery is not possible. For more information, see:
Fixing Incremental Nodes
Incremental nodes were fixed on April 20, 2023.
Fixing Legacy Nodes
For legacy nodes, you can fix the problem with the following script:
#!/bin/bash
insert_if_absent() {
line="${1}"
lead="$(echo "${line}" | cut -f1 -d=)""="
if ! grep "^${lead}" /usr/lib/systemd/system/containerd.service > /dev/null 2>&1; then
sed -i "/^ExecStart=/a${line}" /usr/lib/systemd/system/containerd.service
fi
}
insert_if_absent OOMScoreAdjust=-999
insert_if_absent RestartSec=5
insert_if_absent Restart=always
sed -i '/BindsTo/d' /usr/lib/systemd/system/dockerd.service
sed -i 's/^Wants.*/Wants\\=network-online.target containerd.service/' /usr/lib/systemd/system/dockerd.service
systemctl daemon-reload
You can verify whether the issue of Docker not being able to restart after containerd is forcibly terminated has been successfully fixed by executing the command below. Additionally, you can further verify by executing the docker run
command.
pkill -9 containerd;systemctl is-active dockerd containerd
Was this page helpful?