tencent cloud

All product documents
Tencent Cloud Observability Platform
Alarm Suppression
Last updated: 2024-08-22 16:20:00
Alarm Suppression
Last updated: 2024-08-22 16:20:00

Foreword

To avoid additional Ops workload caused by hundreds of similar alarms due to the same issue, we have introduced the alarm suppression feature. Alarm suppression means that if an alarm of a certain type is triggered, other related similar alarms will be suppressed. For example, if the alarm content is that a certain cluster is inaccessible, you can configure Inhibition rules to silence all other alarms related to that cluster.

Directions

1. Log in to TMP Console.
2. In the Prometheus instance list, click Instance ID/Name.
3. Enter the Prometheus Management Center, and click Alarm Management > Inhibit Rules > Create in the top navigation bar.



4. After navigating to the Create page, configure the suppression rules as prompted by the page, then click Save.




Parameter Description

Parameter
Description
Source Matcher
Triggered alarm. Select Label name, Condition, and Label value.
Target Matcher
Alarms to be silenced. Select Label name, Condition, and Label value.
Equal
The target and source alarm must have the same label value for the label name in the matching criteria. Select Label name.
Note:
Inhibition rules configuration: When there is an alarm (source) that meets a certain rule, the suppression rule will silence another alarm (target) that meets a different rule. The target and source alarm must have the same label value for the label name in the matching criteria.
To prevent self-suppression alarms, alarms that match both the target and source rules cannot be suppressed by other alarms that also match both target and source rules (including themselves). Therefore, it is recommended to design the source and target rules of alarms in such a way that no alarm matches both the source and target rules simultaneously.

Example

Use Cases: Alarm on High Server CPU Load

Scenario Description:

In a monitoring system, two alarms are configured:
Alarm A: CPU load exceeds 90%.
Alarm B: System response time exceeds 500 ms.
Both alarms are triggered by the same cause: high CPU load on the server, leading to degraded system performance. The policy rules for Alarm A are as follows: alert: HighCPUUsage expr: avg(rate(cpu_usage_seconds_total[5m])) by (instance) > 0.9 The policy rules for Alarm B are as follows: alert: HighResponseTime expr: avg(response_time_seconds) by (instance) > 0.5 The Inhibition rule configuration is as follows:
Source: alert=HighCPUUsage
Target: alert=HighResponseTime
Matching criteria: instance

Overall Effect:

The average rate of the cpu_usage_seconds_total metric over 5 minutes is 95%. If the metric's label instance=instanceX, Alarm A will be triggered, and an alarm notification will be sent.
The average value of the response_time_seconds metric is 0.8s. If the metric's label instance=instanceX, Alarm B will be triggered, but no alarm notification will be sent because the Inhibition rule is matched.

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon