High availability (HA) refers to the ability of an application system to maintain uninterrupted operation, which is usually achieved by improving the fault tolerance of the system. In general, the application fault tolerance can be improved by configuring replicas
to create multiple replicas of the application, but this does not necessarily mean that the application will have high availability.This document describes best practices for deploying application high availability. You can choose from them based on your situation.
Distributing and Scheduling Business Workloads
1. Using anti-affinity to prevent single-point failures
Kubernetes assumes that nodes are unreliable, so the more nodes there are, the higher the probability of nodes being unavailable due to software or hardware failures will be. Therefore, we usually have to deploy multiple replicas of applications and adjust the replicas
value based on the actual situation. If its value is 1, there must be risks of single-point failures. Even if its value is greater than 1 but all replicas are scheduled to the same node, the single-point failure risks will still be there.
To prevent single-point failures, we need to have an appropriate number of replicas, and we also need to make sure different replicas are scheduled to different nodes. We can do so with anti-affinity. See the example below:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- weight: 100
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- kube-dns
topologyKey: kubernetes.io/hostname
The relevant configurations in this example are shown below:
requiredDuringSchedulingIgnoredDuringExecution
This sets anti-affinity as a required condition that must be met when Pods are scheduled. If no node meets the condition, Pods will not be scheduled to any node (pending).If you do not want to set anti-affinity as a required condition, you can use preferredDuringSchedulingIgnoredDuringExecution
to instruct the scheduler to always try to meet the anti-affinity condition. If no node meets the condition, Pods can still be scheduled to certain nodes.
labelSelector.matchExpressions
This marks the keys and values of the labels in the service’s corresponding Pod.
topologyKey
This example uses kubernetes.io/hostname
to indicate that Pods are prevented from being scheduled to the same node.If you have higher requirements, such as preventing Pods from being scheduled to nodes in the same availability zone to achieve remote multi-site active-active disaster tolerance, you can use failure-domain.beta.kubernetes.io/zone
. Generally, all the nodes in the same cluster are in one region. If there are cross-region nodes, there will be considerable latency even if direct connect is used. If Pods have to be scheduled to nodes in the same region, you can use failure-domain.beta.kubernetes.io/region
.
2. Using topologySpreadConstraints
The topologySpreadConstraints feature defaults to be enabled in K8s v1.18. It is recommended that you use topologySpreadConstraints
to distribute Pods in clusters of v1.18 or later versions to improve the service availability.
Widely distribute and schedule Pods to each node:
For example, widely distribute and schedule all Pods of nginx to different nodes as evenly as possible. The max allowed number variance of nginx copies on different nodes is 1
. If no more Pods can be scheduled to a node due to reasons such as insufficient resources of the node, the remaining nginx copies are pending.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: nginx
qcloud-app: nginx
name: nginx
namespace: default
spec:
replicas: 1
selector:
matchLabels:
k8s-app: nginx
qcloud-app: nginx
template:
metadata:
labels:
k8s-app: nginx
qcloud-app: nginx
spec:
topologySpreadConstraints:
- maxSkew: 1
whenUnsatisfiable: DoNotSchedule
topologyKey: topology.kubernetes.io/region
labelSelector:
matchLabels:
k8s-app: nginx
containers:
- image: nginx
name: nginx
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 256Mi
dnsPolicy: ClusterFirst
topologyKey: It is similar to configurations in podAntiAffinity.
labelSelector: It is similar to configurations in podAntiAffinity. It supports selecting labels of multiple Pods.
maxSkew: It must be an integer larger than 0, indicating the max allowed variation of Pod number in different topological domain. 1
means the max allowed variation of Pod number is one.
whenUnsatisfiable: It indicates how to deal with the situations where the conditions are not met. DoNotSchedule
means do not schedule (keep pending), and it is similar to strong anti-affinity. ScheduleAnyway
means widely distribute and schedule Pods on node as evenly as possible, and it is similar to weak anti-affinity (change DoNotSchedule
to ScheduleAnyway
).
spec:
topologySpreadConstraints:
- maxSkew: 1
whenUnsatisfiable: ScheduleAnyway
topologyKey: topology.kubernetes.io/region
labelSelector:
matchLabels:
k8s-app: nginx
If the cluster node supports cross-AZ scheduling, you can widely distribute and schedule Pods to the AZs as evenly as possible to achieve higher levels of high availability (change topologyKey
to topology.kubernetes.io/zone
).
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
k8s-app:: nginx
Moreover, you can widely distribute the Pods within each AZ when you schedule the Pods to the AZs.
spec:
topologySpreadConstraints:
- maxSkew: 1
whenUnsatisfiable: ScheduleAnyway
topologyKey: topology.kubernetes.io/zone
labelSelector:
matchLabels:
k8s-app: nginx
- maxSkew: 1
whenUnsatisfiable: ScheduleAnyway
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
k8s-app: nginx
Using a Placement Group to Achieve Disaster Recovery in the Physical Layer
When the underlying hardware or software of a CVM is faulty, multiple nodes may have exceptions at the same time. Even if anti-affinity is used to distribute Pods to different nodes, business exceptions may still be unavoidable. You can use a placement group to distribute nodes in a physical layer, such as the CPM, exchange, or rack layer, to prevent underlying hardware or software faults from causing multiple node exceptions. The steps are as follows: Note:
The placement group and the TKE self-deployed cluster need to be in the same region.
2. Add a batch of nodes, check Add the instance to a placement group in Advanced configuration, and select the created placement group. For more information, see Adding Nodes.
3. On the "Node list" page, edit the same label for this batch of nodes to mark them. These nodes are simultaneously added to the placement group as a single batch.
Note:
The placement group policy takes effect only for nodes of the same batch. Therefore, you need to add a label for each batch of nodes and specify different values to mark different batches.
4. Specify node affinity for Pods where workloads need to be deployed. In this way, the Pods will be deployed on the same batch of nodes. Meanwhile, specify Pod anti-affinity so that the Pods will be widely distributed among the batch of nodes. The YAML sample is as follows:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "placement-set-uniq"
operator: In
values:
- "rack1"
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
Using PodDisruptionBudget to Avoid Service Unavailability Caused by Node Draining
Node draining involves negative impacts. The following describes the process of draining a node:
1. Cordon the node by setting it as unschedulable to prevent new Pods from being scheduled to it.
2. Delete Pods from the node.
3. Once detecting that the number of Pods decreases, ReplicaSet controller will create a new Pod to be scheduled to a new node.
Such a process first deletes the Pods and then creates new Pods instead of using rolling update. Therefore, if all replicas of a service are on the drained node, the service may become unavailable during the updating process. Normally, the service may become unavailable for two reasons:
2. The service is deployed on multiple nodes, but these nodes are drained at the same time. All the replicas of the service are deleted simultaneously, which may cause the service to become unavailable. In such a case, you can configure PDB (PodDisruptionBudget) to prevent the simultaneous deletion of all replicas. See the example below:
Ensure that zookeeper has at least two available replicas at the time of node draining.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: zookeeper
Ensure that zookeeper has no more than one unavailable replica at the time of node draining, which means that only one replica is deleted at a time and is recreated on another node.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: zookeeper
Using preStopHook and readinessProbe to Ensure Smooth and Uninterrupted Service Update
If configuration is not optimized for a service, some traffic errors may occur during the service update with the default configuration. Please refer to the following steps when making deployment.
Service update scenarios
Some service update scenarios include:
Manually adjusting the number of service replicas.
Manually deleting Pods to trigger re-scheduling.
Draining nodes voluntarily or involuntarily, where Pods are deleted from the drained nodes and then recreated on other nodes.
Triggering rolling update, such as modifying the image tag to upgrade the program version.
HPA (HorizontalPodAutoscaler) automatically scales out or scale in services.
VPA (VerticalPodAutoscaler) automatically scales up or scale down services.
Reasons for connection errors during service update
During a rolling update, the Pods corresponding to the service being updated will be created or terminated, and the endpoints of the service will also add and remove Pod IP:Port
corresponding to the Pods. Then kube-proxy will update the forwarding rules according to the updated Pod IP:Port
list, but such rules are not updated immediately.
The forwarding rules are not updated immediately because Kubernetes components are decoupled from each other. Each component uses the controller mode to ListAndWatch the resources it is interested in and responds with actions. Therefore, all the steps in the process, including Pod creation or termination, endpoint update, and forwarding rules update, happen in an asynchronous manner.
When forwarding rules are not immediately updated, some connection errors could occur during the service update. The following describes two possible scenarios to analyze the reasons behind the connection errors:
Scenario 1: Pods have been created but have not fully started yet. Endpoint controller adds the Pods to the Pod IP:Port
list of the service. kube-proxy watches the update and updates the service forwarding rules (iptables/ipvs). If there is a request made at this point, it could be forwarded to a Pod that has not fully started yet. A connection error may occur because the Pod is not able to properly process the request yet.
Scenario 2: Pods have been terminated, but since all the steps in the process are asynchronous, the forwarding rules have not been updated when the Pods have been fully terminated. In such a case, new requests can still be forwarded to the terminated Pods, leading to connection errors.
Smooth update
To address problems in scenario 1, you can add readinessProbe to the containers in the Pods. After a container fully starts, it will listen to an HTTP port to which kubelet will send readiness probe packets. If the container can respond normally, it means the container is ready, and the container’s status will be modified to Ready. Only when all the containers in a Pod are ready will the Pod be added by the endpoint controller to the IP:Port
list in the corresponding endpoint of the Service. Then, kube-proxy will update the forwarding rules. In this way, even if a request is immediately forwarded to the new Pod, it will be able to normally process the request, thereby avoiding connection errors. To address problems in scenario 2, you can add preStop hook to the containers in the Pods so that, before the Pods are fully terminated, they will sleep for some time during which the endpoint controller and kube-proxy can update the endpoints and the forwarding rules. During that time, the Pods will be in the Terminating status. Even if a request is forwarded to a terminating Pod before the forwarding rules are fully updated, the Pod can still normally process the request because it has not been terminated yet. Below is a YAML sample:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
selector:
matchLabels:
component: nginx
template:
metadata:
labels:
component: nginx
spec:
containers:
- name: nginx
image: "nginx"
ports:
- name: http
hostPort: 80
containerPort: 80
protocol: TCP
readinessProbe:
httpGet:
path: /healthz
port: 80
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 15
timeoutSeconds: 1
lifecycle:
preStop:
exec:
command: ["/bin/bash", "-c", "sleep 30"]
Was this page helpful?