Background
An Elasticsearch cluster comprises multiple nodes that work together to process client requests. In production environments, nodes may encounter abnormal issues due to hardware faults, network problems, or software defects. If a node encounters a fault, it can lead to a decrease in the overall cluster performance and even disrupt normal business operations. Therefore, the CFG provides node fault simulation.
Node fault simulation can help us understand how the Elasticsearch cluster performs under various fault scenarios. For example, by simulating node down, network partitions, disk damage, and other faults, you can observe the cluster's recovery process and assess risks such as data loss and inquiry delay. Continuous fault simulation helps identify and fix potential issues, optimize cluster configuration, and enhance cluster robustness. Additionally, node fault simulation can be used for training and experiments. By simulating real-world fault scenarios, team members can become familiar with fault troubleshooting processes and improve their ability to respond to faults. Meanwhile, fault simulation can also serve as a stress testing tool to verify the cluster's stability under high-load conditions.
Conducting node fault simulations for Elasticsearch is a crucial method for ensuring cluster stability and reliability. By simulating various fault scenarios, you can proactively discover and resolve issues, improve the cluster's fault tolerance and availability, and ensure the smooth operation of the business.
Experiment Preparation
Prepare an ES cluster instance for experiments.
Step 1: Create an experiment
2. In the left sidebar, select Experiment Management page, and click Create a New Experiment.
3. Click Skip and create a blank experiment.
4. After filling in the basic information, you can enter the experiment object configuration. Select Big Data as the resource type, and Elasticsearch Cluster as the resource object, then click Add Instance. After you click Add Instance, a list of all Elasticsearch cluster instances in the current region will appear. You can filter instances based on cluster name, cluster ID, or private IP address.
5. After selecting the target instance, click Add Now to add the ES Node down experiment action, then click Next.
6. Set action parameters. In this document, the Random Node Downtime is selected. Click Confirm.(Specific fault parameters can be selected based on the experiment's objectives.)
7. Click Next to go to Global Configuration. See Quick Start for Global Configuration. 8. After confirmation, click Submit.
9. After creating the experiment, click Experiment Details in the pop-up dialog box to enter the Experiment Details page.
Step 2: Execute the experiment
1. Observe the instance monitoring data before the experiment, focusing on the advanced monitoring metrics. You can go to ES console and click Elasticsearch Cluster > Cluster ID/Name > Node Monitoring to view. 2. On the Experiment Details page, click Execute to initiate the fault actions.
3. After the fault injection is successful, click the Fault Action panel to view the results and the executed nodes.
Was this page helpful?