tencent cloud

Feedback

Basic Questions

Last updated: 2024-08-15 17:58:46

    How to use Prometheus's native default UI and other features?

    TencentCloud Managed Service for Prometheus (TMP) is different from the open source stand-alone Prometheus. TMP is of structure with separated collection and storage and does not provide native default UI features. Use Tencent Cloud Managed Service for Grafana as an alternative for querying features.

    When the alarm is recovered, the $value in the notification template is incorrect. How to handle it?

    The $value during the alarm recovering is the value that meets the alarm expression condition for the last time. The value that does not meet the condition cannot be obtained. Regarding the design, the alarm expression is a whole. The alarm expression has no difference from the PromQL query in the normal scene. If the series in the query result meets the duration, the alarm will be triggered. When the next query result does not contain a series, the alarm corresponding to the series will be recovered. Prometheus cannot deconstruct and interpret the alarm expression by itself, because some expressions themselves do not contain comparison relationships such as thresholds, for example: a and b and 123456789.

    How to view the alarm records?

    Prometheus itself does not have the concept of alarm records. Due to special reasons, it cannot fully support the feature of alarm records, but the related alarms and status can be viewed through the ALERTS or ALERTS_FOR_STATE metrics. In addition, all Prometheus alarms are currently delivered to the Tencent Cloud Observability Platform Alarm Service, and then sent to the user side through various notification channels of the Platform. The alarm records feature provided by the Tencent Cloud Observability Platform can make up for this defect to a certain extent. Due to conceptual differences in design, it cannot be used as a complete alternative but can be used for troubleshooting and other requirements. You can view records through the Alarm Records Feature.

    Is the alarm repeat interval inconsistent with the configured duration?

    Prometheus alarm repeat interval (repeat_interval) is not set for a single alarm but for the entire group. When there is a new alarm or alarms are recovered in the group, the entire group will be notified, and then notifications will be sent at a repeated interval. Currently, TencentCloud Managed Service for Prometheus is grouped by a single alarm policy. For example, a single alarm policy may be for the restart of instance A/B/C. When A/B restarts, the alarm is triggered and then notifications will be sent at a repeated interval. After a period of time, C also restarts, and the interval will be interrupted and recovered. At this time, it will be notified to the user taking A/B/C as a Group. Currently, due to implementation limitations, the alarms delivered based on the Tencent Cloud Observability Platform cannot be grouped and aggregated as a single message to notify users. If necessary, you can use a custom Alertmanager or Tencent Cloud Managed Service for Grafana.

    Why doesn't the rate/irate function produce any data when the original metrics exist?

    The rate/irate function requires at least two data points to calculate, so the time range calculated by rate/irate must cover at least two data points. Considering the possibility of data point loss due to network exception, the official recommendation for the time range is four times the collection interval.

    Why does the rate/irate function calculate a very large outlier?

    rate/irate functions can only be used for Counter type metrics, which are defined as strictly increasing numbers. Querying through Prometheus will process the problem of Counter being reset to 0, such as server restart. Normally, this will not affect the calculation results unless there is a problem of data point disorder. For example, the disorder of two second-level data points 9999 and 10000 results in an outlier of (10000+9999)-10000 = 9999 (normally it should be 1). The typical scene of this situation is as follows:
    There are multiple collection components collecting the same metric and repeatedly reporting it to the same Prometheus instance, which may cause disorder problems, resulting in a large outlier calculated after the reset processing logic of Prometheus. Solution: If the collection component collects the same metric, it does not meet the expectations. It is recommended to troubleshoot and solve the problem of repeated collection. If the collection solution is designed to ensure the high availability of the collection component, different collection components need to add replica or other tags to distinguish all collected metrics.
    Report directly to TencentCloud Managed Service for Prometheus through pushgateway and other methods and no timestamp information is included. TencentCloud Managed Service for Prometheus side will store the current time as the timestamp of the metric. If the network jitters, the arrival time of the data points will become disordered or the timestamp information granted by different processes/threads processing different data points will be disordered, resulting in calculation errors. The solution is to include the timestamp information when reporting. TencentCloud Managed Service for Prometheus built-in pushgateway reporting method is only a supplement to remotewrite and does not realize the complete feature. It is recommended to use Agent(remotewrite) for collection and reporting when it is not necessary.

    Why is the interval of data points returned by the query different from that of capturing?

    The interval of data points returned by querying through Prometheus is determined by the query parameter interval/step. Each data point is filled in and aligned strictly according to the interval/step, and there is no one-to-one correspondence with the capturing interval. If there is too much data loss or the capturing interval is too large, data will not be filled in, and the storage side will not store any information related to the capturing configuration. Users need to process their capturing configuration.

    Why does the query return extra data for the last five minutes?

    By default, Prometheus will fill in data for certain queries. Even if there is only one data point in the last five minutes, it may return multiple data points in the past five minutes (according to the step/interval parameters of the query). This is the default behavior of open source Prometheus and cannot be adjusted currently. Generally, it does not affect normal use.

    How do time-related functions in PromQL handle local time zones?

    All time in Prometheus is in UTC time. There are no exceptions, and the concept of time zone is not considered in the design, which may cause day_of_* series of functions to be inconvenient to use, and the official support for them can not be realized in the short term. The temporary solution for the domestic time zone can be as follows: day_of_week(vector(time() + 8*3600)).

    Can Prometheus-related use limits be adjusted?

    The use limits related to Prometheus are adjustable. In most cases, exceeding these use limits may lead to degradation of user experience and performance. Therefore, adjusting the limits upward cannot guarantee the query and write performance benchmarks before adjustments. The service level agreement may no longer apply. Users need to have certain psychological expectations for potential related risks.

    From the chart, it seems that the alarm meets the duration but is not actually triggered?

    The basic principle of Prometheus alarm detection is to use Instant Query to query data every minute, and trigger alarms if the alarm condition is (continuously) met. In some cases, Prometheus's Range Query will fill in the missing data, and the continuous timeline on the chart may be discontinuous under the execution logic of the alarm component. In addition, due to the characteristics of alarm scheduled inspection, the time when the alarm is detected may be slightly delayed. You can check the status of the alarm by querying the ALERTS or ALERTS_FOR_STATE metrics.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support