Self-Heal Rules

Overview
The instability of infrastructure and uncertainty of environment often trigger system failures at different levels. To relieve the Ops workload, the Tencent Kubernetes Engine (TKE) team has developed the self-heal feature for the Node-Problem-Detector-Plus add-on to help Ops engineers locate system exceptions and take minimal self-heal actions for various check items based on preset experiential Ops rules. Characteristics of the self-heal feature:
The system detects persistent faults that require human intervention in real time.
The scope of detection includes dozens of check items, such as check items on the operating system, Kubernetes environment, and runtime.
The feature quickly responds to faults based on preset experiential rules, such as executing a fix script and rebooting an add-on.
Check Items
Check Item
Description
Risk Level
Self-Heal Action
FDPressure
Too many files opened. This is to check whether the number of file descriptors of the server has reached 90% of the maximum value.
low
-
RuntimeUnhealthy
List containerd task failed
low
RestartRuntime
KubeletUnhealthy
Call kubelet healthz failed
low
RestartKubelet
ReadonlyFilesystem
Filesystem is readonly
high
-
OOMKilling
Process has been oom-killed
high
-
TaskHung
Task blocked more then beyond the threshold
high
-
UnregisterNetDevice
Net device unregister
high
-
KernelOopsDivideError
Kernel oops with divide error
high
-
KernelOopsNULLPointer
Kernel oops with NULL pointer
high
-
Ext4Error
Ext4 filesystem error
high
-
Ext4Warning
Ext4 filesystem warning
high
-
IOError
IOError
high
-
MemoryError
MemoryError
high
-
DockerHung
Task blocked more then beyond the threshold
high
-
KubeletRestart
Kubelet restart
low
-
Enabling the Self-Heal Feature for Nodes
Enabling the feature in the TKE console
1. Log in to the TKE console and select Cluster in the left sidebar.
2. On the cluster list page, click the ID of the target cluster to go to the details page.
3. Choose Node management > Fault self-heal rule in the left sidebar to go to the Fault self-heal rule list page.
4. Click Create rule to create a new self-heal rule. See the figure below:
﻿
5. Return to the node pool list page.
6. Click the ID of the target node pool to go to the details page of the node pool.
7. In the Ops information section of the details page, click Edit to enable the self-heal feature for the node pool.
8. View the details of real-time fault detection in the Ops records section. If the status of a check item is Failed, the check item failed.
Enabling the feature by using YAML
1. Create self-heal rules.
Specify the YAML configuration file as follows and run the kubectl ceate -f demo-HealthCheckPolicy.yaml command to create self-heal rules for a cluster:
apiVersion: config.tke.cloud.tencent.com/v1
kind: HealthCheckPolicy
metadata:
  name: test-all
  namespace: cls-xxxxxxxx (the ID of the cluster)
spec:
  machineSetSelector:
    matchLabels:
      key: fake-label
  rules:
  - action: RestartKubelet
    enabled: true
    name: FDPressure
  - action: RestartKubelet
    autoRepairEnabled: true
    enabled: true
    name: RuntimeUnhealthy
  - action: RestartKubelet
    autoRepairEnabled: true
    enabled: true
    name: KubeletUnhealthy
  - action: RestartKubelet
    enabled: true
    name: ReadonlyFilesystem
  - action: RestartKubelet
    enabled: true
    name: OOMKilling
  - action: RestartKubelet
    enabled: true
    name: TaskHung
  - action: RestartKubelet
    enabled: true
    name: UnregisterNetDevice
  - action: RestartKubelet
    enabled: true
    name: KernelOopsDivideError
  - action: RestartKubelet
    enabled: true
    name: KernelOopsNULLPointer
  - action: RestartKubelet
    enabled: true
    name: Ext4Error
  - action: RestartKubelet
    enabled: true
    name: Ext4Warning
  - action: RestartKubelet
    enabled: true
    name: IOError
  - action: RestartKubelet
    enabled: true
    name: MemoryError
  - action: RestartKubelet
    enabled: true
    name: DockerHung
  - action: RestartKubelet
    enabled: true
    name: KubeletRestart
﻿
2. Enable the self-heal feature.
Set the value of the MachineSet parameter to healthCheckPolicyName: test-all in the YAML configuration file:
apiVersion: node.tke.cloud.tencent.com/v1beta1
kind: MachineSet
spec:
  type: Hosted
  displayName: demo-machineset
  replicas: 2
  autoRepair: true
  deletePolicy: Random
  healthCheckPolicyName: test-all
  instanceTypes:
  - C3.LARGE8
  subnetIDs:
  - subnet-xxxxxxxx
  - subnet-yyyyyyyy
......
﻿
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

Check Item	Description	Risk Level	Self-Heal Action
FDPressure	Too many files opened. This is to check whether the number of file descriptors of the server has reached 90% of the maximum value.	low	-
RuntimeUnhealthy	List containerd task failed	low	RestartRuntime
KubeletUnhealthy	Call kubelet healthz failed	low	RestartKubelet
ReadonlyFilesystem	Filesystem is readonly	high	-
OOMKilling	Process has been oom-killed	high	-
TaskHung	Task blocked more then beyond the threshold	high	-
UnregisterNetDevice	Net device unregister	high	-
KernelOopsDivideError	Kernel oops with divide error	high	-
KernelOopsNULLPointer	Kernel oops with NULL pointer	high	-
Ext4Error	Ext4 filesystem error	high	-
Ext4Warning	Ext4 filesystem warning	high	-
IOError	IOError	high	-
MemoryError	MemoryError	high	-
DockerHung	Task blocked more then beyond the threshold	high	-
KubeletRestart	Kubelet restart	low	-

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service free trial

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

E-commerce

E-commerce retail solutions

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Financial Services

Financial Services Solution

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha

Cloud Workload Protection Platform

Data Security Governance Center

Key Management Service