tencent cloud

Feedback

Self-Heal Rules

Last updated: 2023-05-05 11:05:32

Overview

The instability of infrastructure and uncertainty of environment often trigger system failures at different levels. To relieve the Ops workload, the Tencent Kubernetes Engine (TKE) team has developed the self-heal feature for the Node-Problem-Detector-Plus add-on to help Ops engineers locate system exceptions and take minimal self-heal actions for various check items based on preset experiential Ops rules. Characteristics of the self-heal feature:
The system detects persistent faults that require human intervention in real time.
The scope of detection includes dozens of check items, such as check items on the operating system, Kubernetes environment, and runtime.
The feature quickly responds to faults based on preset experiential rules, such as executing a fix script and rebooting an add-on.

Check Items

Check Item
Description
Risk Level
Self-Heal Action
FDPressure
Too many files opened. This is to check whether the number of file descriptors of the server has reached 90% of the maximum value.
low
-
RuntimeUnhealthy
List containerd task failed
low
RestartRuntime
KubeletUnhealthy
Call kubelet healthz failed
low
RestartKubelet
ReadonlyFilesystem
Filesystem is readonly
high
-
OOMKilling
Process has been oom-killed
high
-
TaskHung
Task blocked more then beyond the threshold
high
-
UnregisterNetDevice
Net device unregister
high
-
KernelOopsDivideError
Kernel oops with divide error
high
-
KernelOopsNULLPointer
Kernel oops with NULL pointer
high
-
Ext4Error
Ext4 filesystem error
high
-
Ext4Warning
Ext4 filesystem warning
high
-
IOError
IOError
high
-
MemoryError
MemoryError
high
-
DockerHung
Task blocked more then beyond the threshold
high
-
KubeletRestart
Kubelet restart
low
-

Enabling the Self-Heal Feature for Nodes

Enabling the feature in the TKE console

1. Log in to the TKE console and select Cluster in the left sidebar.
2. On the cluster list page, click the ID of the target cluster to go to the details page.
3. Choose Node management > Fault self-heal rule in the left sidebar to go to the Fault self-heal rule list page.
4. Click Create rule to create a new self-heal rule. See the figure below:

5. Return to the node pool list page.
6. Click the ID of the target node pool to go to the details page of the node pool.
7. In the Ops information section of the details page, click Edit to enable the self-heal feature for the node pool.
8. View the details of real-time fault detection in the Ops records section. If the status of a check item is Failed, the check item failed.

Enabling the feature by using YAML

1. Create self-heal rules.

Specify the YAML configuration file as follows and run the kubectl ceate -f demo-HealthCheckPolicy.yaml command to create self-heal rules for a cluster:
apiVersion: config.tke.cloud.tencent.com/v1
kind: HealthCheckPolicy
metadata:
name: test-all
namespace: cls-xxxxxxxx (the ID of the cluster)
spec:
machineSetSelector:
matchLabels:
key: fake-label
rules:
- action: RestartKubelet
enabled: true
name: FDPressure
- action: RestartKubelet
autoRepairEnabled: true
enabled: true
name: RuntimeUnhealthy
- action: RestartKubelet
autoRepairEnabled: true
enabled: true
name: KubeletUnhealthy
- action: RestartKubelet
enabled: true
name: ReadonlyFilesystem
- action: RestartKubelet
enabled: true
name: OOMKilling
- action: RestartKubelet
enabled: true
name: TaskHung
- action: RestartKubelet
enabled: true
name: UnregisterNetDevice
- action: RestartKubelet
enabled: true
name: KernelOopsDivideError
- action: RestartKubelet
enabled: true
name: KernelOopsNULLPointer
- action: RestartKubelet
enabled: true
name: Ext4Error
- action: RestartKubelet
enabled: true
name: Ext4Warning
- action: RestartKubelet
enabled: true
name: IOError
- action: RestartKubelet
enabled: true
name: MemoryError
- action: RestartKubelet
enabled: true
name: DockerHung
- action: RestartKubelet
enabled: true
name: KubeletRestart


2. Enable the self-heal feature.

Set the value of the MachineSet parameter to healthCheckPolicyName: test-all in the YAML configuration file:
apiVersion: node.tke.cloud.tencent.com/v1beta1
kind: MachineSet
spec:
type: Hosted
displayName: demo-machineset
replicas: 2
autoRepair: true
deletePolicy: Random
healthCheckPolicyName: test-all
instanceTypes:
- C3.LARGE8
subnetIDs:
- subnet-xxxxxxxx
- subnet-yyyyyyyy
......


Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support