tencent cloud

Feedback

Instance Diagnosis

Last updated: 2024-07-30 18:14:31

    Background

    To enhance the user experience with the Prometheus monitoring collection terminal, we now offer a new collection terminal architecture. The upgraded new architecture supports instance diagnosis, system health checks, and improves the resource utilization rate of the collection agent and the stability of metric collection. This document will guide users in upgrading the old collection architecture to the new one via the console instance diagnosis page and obtain detailed information about the current instance collection and storage for a better experience.

    Directions

    Upgrading to the New Architecture

    1. Log in to TMP Console.
    2. In the Prometheus instance list, click Instance ID/Name.
    3. In the Prometheus management center, click instance diagnostics on the top navigation bar, and select the corresponding collection cluster.
    Click OK to upgrade to the new architecture.
    
    
    
    Note:
    The upgrade process is expected to take 5 minutes and may experience a 1-2 minute metric interruption.
    If the number of IPs in the managed cluster of the instance is less than 10, a risk warning will appear. To prevent the issue of insufficient IPs from causing component upgrade failures, please increase the available IPs as guided below.
    
    
    

    Adding Subnet

    1. On the instance diagnosis page, click subnet information to enter the Managed Cluster Subnet Management page.
    
    
    
    2. The page displays the subnets that have already been added to the managed cluster and those that have not in the current VPC. You can click Enable Subnet to enable subnets with sufficient remaining IPs as per your plan.
    3. If there are no available subnets, please Add Subnet first and then enable it on the current page.

    Instance Diagnosis

    After the architecture is upgraded, the instance diagnosis page will include content on both collection and storage, helping users understand the operating status of Prometheus collection and storage to locate issues more quickly.

    Diagnostic Collection

    As shown in the following figure, collection diagnosis includes resource occupancy of the corresponding collection, collection configuration, target allocation status, target status, agent status, component version, and the collection architecture diagram.
    
    
    

    Resource Utilization

    Resource occupancy of the collection shard displays information including CPU, memory limit and occupancy, and inbound and outbound traffic. Click View logs to see the logs of the Pod, facilitating the review of running details and troubleshooting of exceptions.
    
    
    
    
    
    

    Collection Configuration

    It displays the specific collection configuration of the current collection.
    
    
    

    Target Allocation Status

    It shows the URL of the collection target, the name of the collection job to which the target belongs, and which collection shard is currently collecting it.
    
    
    

    Target Status

    You can filter active collection targets based on the collection job, and obtain information about the corresponding target such as status, Labels, and Discovered Labels. The target status is "Healthy" in normal conditons, but if it is "abnormal" and there has already been the last scrape time, you can troubleshoot based on the error message on the far right. Common issues may include the target itself being unhealthy, incorrect permission configuration, and network errors.
    
    
    

    Agent Status

    It shows the running status of the collection shard agent. Click View logs to see the logs of the Pod to understand running details and troubleshoot exceptions.
    
    
    

    Component Version

    It shows the component version information of the current collection. On the Component version page, it displays IP quantity check and the current version, latest version, and component description, upgrade description of each component version.
    
    
    
    Note:
    Please try to keep the collection component in the latest version, which can be upgraded through the Upgrade of the corresponding component.
    Components tmp-operator, tmp-agent, and proxy-agent can be upgraded without impact under normal circumstances.
    During the upgrade of the proxy-server component, there will be collection breakpoints. The breakpoint duration is the time for eks to activate the component Pod, and it will affect the collection of the entire Prometheus instance (including the integration center and container clusters). Please operate cautiously.

    Data Collection Architecture Diagram

    The collection architecture diagram provides information about the current collection architecture.
    
    
    
    Collection Component Managed Cluster:
    Available IP count. When the number of IPs is insufficient, an Insufficient IP Count alert will appear. Click to enter the Managed Cluster Subnet Management page. For specific operations, see Adding Subnet.
    Managed cluster's security group and its pass-through requirements during normal operation. Some network issues during the collection may be caused by the security group not allowing pass-through.
    The number of Pods and resource utilization related to the current collection
    Running status of the collection scheduling component tmp-operator
    Collection target allocation status
    Running status of the collection shard component tmp-agent
    Metric collection rate and inbound bandwidth
    Running status of the proxy component proxy-server
    Metric write rate and outbound bandwidth
    Write target status, including data overwrite of the current instance's corresponding storage and user configuration.
    User Cluster:
    Status of the public network proxy CLB (if enabled)
    Running status of the proxy component proxy-agent;
    Status of the collection target within the cluster

    Storage Diagnosis

    As shown in the figure, the storage diagnosis includes storage-related status and limitations.
    
    
    

    Parameter Description

    Parameter
    Description
    Total rate of metric reporting
    Includes free metrics and paid metrics.
    Instance series storage limit
    The number of active series exceeding this value will cause the corresponding instance series discarded due to limit exceeded.
    The storage limit for a single metric series
    Different labels for the same metric name constitute different series. Exceeding this value will result in discard due to limit exceeded.
    The maximum length of a metric label name
    Exceeding this value will result in discard due to invalid length.
    Maximum number of labels for metrics
    Exceeding this value will result in discard due to invalid length.
    The oldest range allowed for metric timestamps
    Indicates the oldest timestamp acceptable in a single series (out of order not allowed).
    maximum allowed range for metric timestamps
    Indicates the latest timestamp acceptable in a single series (out of order not allowed).
    Maximum number of series per single query
    Indicates the maximum number of series involved in data query. It is recommended to shorten the range query time or use instant query in suitable scenarios.
    Maximum number of alarms per unit of time
    Indicates the maximum number of alarms triggered within 5 minutes.
    Maximum byte size limit for alarms per unit of time
    Indicates the total size limit of fields (such as lable and annotation) for alarms triggered within 5 minutes.
    The top 10 metrics based on the number of series
    Same metric name and different label keys are considered different series. Storage is subject to the upper limit for single metric series, and a large quantity can cause high cardinality issue.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support