Reduce APM use cost through sampling policy

Overview
Application Performance Management (APM) provides distributed trace tracking capabilities. It can help users automatically construct the complete path of each request and achieve one-stop full-trace problem analysis, improving the efficiency of problem localization. After an application accesses APM, it reports trace information in the form of Trace and Span to the APM system based on user requests. As the business scale grows, the trace information reported to APM will increase, and accordingly, the costs of using APM will also increase.
Trace represents a complete trace. It describes the path that a user request goes through across multiple distributed applications.
Span represents a stage of a trace. It can be an RPC call, an HTTP call, a message sending, or a local function call in an application.
In actual scenarios, not every Trace needs to be recorded by the APM system, as the mutual calls between distributed systems will generate a large amount of repetitive trace information. Especially when the system is running normally, repetitive and redundant trace information does not do much help to users in analyzing performance problems. The introduction of sampling can reduce repetitive trace information, thereby helping users focus on more valuable data and reducing the costs of using APM.
Common Sampling Solutions
The essence of sampling is to select a part of data from a large amount of data for observation and analysis, in order to understand the overall behavior or characteristics of the system. In the APM system, sampling is used for trace data. This means that the unselected trace data can be not collected, or collected and then directly discarded. The specific trace data to be selected varies greatly in different sampling solutions. According to the characteristics of the APM system, the introduction of a sampling policy needs to take into account the following important aspects:
Trace Integrity. A complete Trace contains multiple Spans. If sampling causes some Spans in a trace to be discarded, the integrity of the trace will be broken, and the entire trace data will lose its value.
Correctness of Metrics. Various statistical metrics, including the throughput, response time, error rate, and application health status, cannot be biased due to the introduction of sampling policies.
Preserve High-Value Data. In real distributed systems, occasionally there will be some highly concerned trace data. In this case, the sampling policy needs to be able to selectively preserve these data. The most typical examples are error calls and slow calls. When there are error or slow calls in a trace, they cannot be discarded by the sampling policy.
Combining these common requirements, two typical sampling solutions have emerged in the APM domain based on the selection of data and the timing of sampling.
Head Based Sampling
Basic Concepts
Head-based sampling refers to making a sampling decision at the entrance of a request into a distributed system. As the request circulates among different applications, the sampling decision will continuously spread, ensuring that every stage follows the same behavior. At the system entrance, once it is decided to sample a certain request, the entire trace can be completely retained.
How It Works?
The implementation of head-based sampling is relatively simple and uses a common trace propagation protocol, such as the W3C trace context standard used in OpenTelemetry by default, which provides support for the head-based sampling solution. The APM system can achieve head-based sampling with minimal processing in the agents or SDKs provided for application access, based on the trace propagation protocol. However, head-based sampling may lead to the omission of some important events or exceptions, as the sampling decision is determined at the beginning of a request and cannot be adjusted according to the actual situation of the request during processing.
For example, when a user request enters a distributed system, the decision made based on the preset sampling rules is that this trace is not sampled. However, during subsequent processing, a slow database call occurs in this trace, and the duration of SQL execution exceeds 3 seconds. This is an important event that needs to be analyzed, but due to the introduction of the sampling policy, the entire trace is discarded.
In addition, in the APM system, there is a correlation between trace data and statistical metrics. For example, calculating the 99th percentile of response time for a certain API often relies on the APM server computing based on the received traces. Head-based sampling can cause metric data to deviate greatly from the actual situation.
Tail Based Sampling
Tail-based sampling refers to making a sampling decision after a request is completed. The APM system will first temporarily collect all trace data, and then decide the traces to be retained after the request is completed based on actual conditions (such as the response time and error status of the request).
The advantage of tail-based sampling is that it can better capture important events and exceptions and ensure the accuracy of statistical metrics. However, tail-based sampling requires the APM system to introduce additional storage and compute resources on the server side to temporarily save all request trace data, which is more complicated to implement. In common open-source APM projects, no complete implementation method for tail-based sampling has been provided.
Summary of Sampling Solutions
Category
Head-based Sampling
Tail-based Sampling
Decision-Making Time Point
The decision is made at the request entry.
The decision is made after the request is completed.
Decision-Maker
The APM client, that is, the application side.
The APM server side.
Implementation Method
Unsampled data is not reported to the APM server.
All data is reported, and the APM server decides whether to save the data based on the characteristics of the trace.
Trace Integrity
Yes
Yes
Preserve Error and Slow Requests
No
Yes
Metric Correctness
In most cases, it is not satisfied, especially in the scenarios where the APM server calculates metric statistics through trace data.
Yes
Complexity of Implementation
Low
High
In summary, head-based sampling and tail-based sampling each have their advantages and disadvantages. You need to select the sampling sheme according to actual business requirements and system conditions. But from the perspective of APM users, it is recommended to use the tail-based sampling solution, which can do more help in analyzing application performance and troubleshooting problems.
Enabling the Sampling Policy in Tencent Cloud APM
Based on the advantages of the tail-based sampling solution in analyzing application performance and troubleshooting problems, Tencent Cloud Application Performance Management (APM) has implemented a complete tail-based sampling solution. This solution can reduce data storage while ensuring that error and slow traces are completely saved and all metric data is accurate. Reasonable use of sampling policies can not only reduce the usage costs of APM, but also have no obvious impact on the user experience of APM. In addition, the sampling policies provided by APM are applicable to applications written in all languages and all access solutions.
Note:
In pay-as-you-go mode, the billing items of APM include the reporting fee and trace storage fee. For details, please refer to Introduction to Pay-As-You-Go. After the sampling policy is enabled in the APM console, the trace storage fee can be reduced by up to 90%. However, since the tail-based sampling solution requires full data reporting to ensure the correctness of metrics and the complete preservation of error and slow traces, enabling the sampling policy cannot reduce the reporting fee.
In package (prepaid) mode, the sampling rate will be fixed at 10%. For details, please refer to Introduction to Packages.
Currently, sampling policies are made available through an allowlist. Please submit a ticket to apply. After the application is completed, Sampling Configuration will be displayed in Application Performance Monitoring > System Configuration.
Configuring the Global Sampling Ratio
1. Log in to the TCOP console.
2. In the left sidebar, select APM > System Configuration to enter the Sampling configuration page.
3. In the Global sampling configuration module, click Edit.
4. Fill in the sampling percentage. The valid sampling percentage is between 10% and 100%.
Note:
Configure whether to save error traces and the threshold for saving slow calls according to actual needs. If there are error calls or slow calls in a trace, the entire trace will be completely saved. It is recommended to enable error trace saving and set the threshold for saving slow calls between 500ms and 2000ms.
Configuring Full Sampling APIs
When the global sampling ratio is less than 100%, you can customize the APIs that need full sampling. If the trace passes through the APIs that need full sampling, the entire trace will be completely saved.
1. Click Add interface for full sampling.
2. In the dialog box, enter the policy name in the dialog box and specify the APIs that needs full sampling.
If a specific application is specified in the policy, you can leave the API matching rule empty. This means that all traces passing through this application will be completely saved. You can specify APIs using exact matching, prefix matching, or suffix matching. In actual usage scenarios, you can define important APIs with high attention as full sampling APIs to ensure that no calls to important APIs are missed.
Note:
Updates to the global sampling ratio and full sampling APIs can take effect immediately.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

Category	Head-based Sampling	Tail-based Sampling
Decision-Making Time Point	The decision is made at the request entry.	The decision is made after the request is completed.
Decision-Maker	The APM client, that is, the application side.	The APM server side.
Implementation Method	Unsampled data is not reported to the APM server.	All data is reported, and the APM server decides whether to save the data based on the characteristics of the trace.
Trace Integrity	Yes	Yes
Preserve Error and Slow Requests	No	Yes
Metric Correctness	In most cases, it is not satisfied, especially in the scenarios where the APM server calculates metric statistics through trace data.	Yes
Complexity of Implementation	Low	High

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Voice Technology

Internet of Things

Stream Services

Cloud Real-time Rendering

Cloud Resource Management

More

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Management and Audit Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Image Creation

Natural Language Processing

Communication

Media Process Services

Education Sevices

Developer Tools

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Tencent Big Model

Optical Character Recognition

Interactive Video Services

Media SDK

Medical Services

Monitor and Operation

Overview

Common Sampling Solutions

Head Based Sampling

Basic Concepts

How It Works?

Tail Based Sampling

Summary of Sampling Solutions

Enabling the Sampling Policy in Tencent Cloud APM

Configuring the Global Sampling Ratio

Configuring Full Sampling APIs