Tencent Cloud Smart Advisor (TSA) is a cloud governance platform that provides multiple vertical applications in ITOM fields. Relying on the experience of Tencent Cloud's massive operation and maintenance experts, it optimizes cloud infrastructure with multiple governance solutions such as Cloud Risk Assessment and Chaotic Fault Generator to improve system security and service reliability.
Discover business risks through out-of-the-box cloud resource risk assessment, provide online optimization suggestions based on actual needs, improve business continuity, and combine it with efficient and safe fault experiment services to help you promptly discover business disaster recovery risks and verify the effectiveness of high-availability plans, thereby improving system availability and resilience.
TSA - Cloud Risk Assessment
Cloud Risk Assessment is an out-of-the-box product that assesses risks for Tencent Cloud resources. After Cloud Risk Assessment is granted to a CAM service role, it can quickly assess and analyze risks in cloud resources, application architecture, business performance, and security and then offer optimization suggestions online according to the actual business usage, helping improve the system security, business stability, and service reliability. List of supported products
Cloud Risk Assessment provides a wide variety of assessment items, flexible assessment configurations, and system optimization suggestions to help you improve business continuity.
It offers various risk assessment items in multiple dimensions, such as security, reliability, cost, service restriction, and performance for different Tencent Cloud products. For cloud products that currently support evaluation, please refer to Assessment Settings. More Tencent Cloud products and services will be supported, and more risk assessment items will be available.
TSA - Chaotic Fault Generator
Chaotic Fault Generator (CFG) provides efficient, convenient, safe, and reliable fault injection services. In addition, it also provides industry templates, monitoring guardrails, and other core functions, and is committed to helping users promptly discover business disaster recovery risks and verify the effectiveness of high-availability plans, thereby improving system availability and resilience.
Basic Concepts
Before use of the CFG, understanding the relevant concepts will help you get started with product operations faster.
|
Chaos engineering | Chaos engineering is a discipline that conducts experiments on distributed systems. It updates the understanding of the system through practice, thereby understanding and discovering the unknown weaknesses of the system. The purpose is to build the ability and confidence of the system to resist out-of-control conditions in the production environment. | - |
Experiment | The process of verifying and improving system availability by injecting specified faults into specified locations of the system and observing the experimental results. | - |
Action | It refers to the atomic fault actions injected into the system during the experiment, including various fault injection scenes of IaaS, PaaS, and SaaS. In an experiment, users can freely combine and orchestrate multiple experiment actions. An action group is a collection of actions. | High CPU usage, CVM shutdown, and database primary/secondary switch |
Object | The instance object that the action acts on. | CVM and MySQL |
Template | Save valuable and frequently used experiments and scenes as experiment templates for quick reuse later. The templates include basic experiment information and action orchestration solution, and you only need to determine the experiment object for subsequent use. | Cross-AZ disaster recovery experiment template and network fault template |
Monitoring metrics | To determine whether the system is running stably and whether the fault injection is successful, the system steady-state metrics can be configured in advance to observe changes in steady-state metrics during experiments, perceiving system changes in real time. | Disk usage (%) |
Guardrail policy | Configure alarm metrics and trigger policies. When the alarm metrics reach the trigger threshold, the system can automatically stop the experiment and roll back the action to control the impact scope of the experiment. | If the disk usage (%) reaches 90%, the experiment will automatically stop. |