The EdgeOne data analysis module helps users to analyze traffic characteristics through in-depth analysis of the massive log data continuously recorded by EdgeOne products. In order to optimize the user experience, the sampling statistics technology is introduced in the EdgeOne data analysis to ensure accurate and timely query even when large amounts of data are processed.
What Is Sampling Statistics?
In data analysis, sampling refers to selecting a representative subset from all the data for analysis in order to extract valuable information. For example, in social surveys, researchers cannot survey everyone, so they will select a portion of the population as a representative sample and use the answers of these samples to reflect the tendencies of the entire population.
When Will EdgeOne Apply Sampling Statistics?
EdgeOne employs the dynamic sampling technology to adapt to different users' log data volumes, so as to ensure the accuracy and efficiency of data analysis. In the following data query scenarios, the data displayed on the EdgeOne related pages may be sampled.
When querying L7 access-related metrics on the Metric Analysis page with filters such as status code, ISP, province, TLS version, URL path, Referer, resource type, device type, browser type, system type, IP version, and client IP address. This is because when users query the overall traffic, we will provide users with pre-aggregated statistical tables to help users quickly obtain accurate statistical results. However, when users need to perform drill-down analysis on certain specific dimensions, the query will switch to a massive multidimensional statistical table. At this point, a sampling mechanism is needed to reduce the amount of underlying data scanning volume and provide users with a fast query experience. When querying L7 protection-related metrics on the Metric Analysis page, or conducting Statistical Analysis or viewing Sample Log on the Web Security Analysis page. If a large-scale CC attack occurs within the query time range, the data you see may also be sampled. In this case, there may be circumstances where the log corresponding to a specific request ID cannot be retrieved. Note:
Note that EdgeOne will continuously optimize and adjust the sampling policy based on the scale of platform log data and users' actual needs. If you have any questions about the data analysis query results provided by EdgeOne, feel free to contact us at any time. Does It Affect the Use of EdgeOne?
The sampling statistics technology is only applied to the data analysis module and will not affect other service configurations such as site acceleration, L4 proxy, or security protection. Through the sampling statistics technology, EdgeOne can provide you with statistical analysis results more quickly, helping you obtain query results on the page while improving query efficiency. This ensures that even in the face of massive data, EdgeOne can maintain query response speed and accuracy.
How Do I Query Full Data?
If your business requires in-depth analysis of full log data, we recommend you use EdgeOne's real-time log push feature. The real-time log push feature can transfer detailed and complete log data to your designated log analysis system (such as Tencent Cloud CLS, third-party log solution, or self-built ELK stack). By obtaining full data, you can perform precise data processing. Through the real-time log feature, you can ensure that more accurate data analysis results can be obtained in scenarios requiring higher data precision, thereby providing more accurate data support for your business decisions. Learning More
Working Principles of Sampling Statistics
Sampling Policy
EdgeOne adopts a dynamic grading policy. This policy will periodically analyze your domain name request volume and the corresponding query performance to determine whether your domain name meets the sampling conditions. When the sampling system determines that your domain name meets the sampling conditions, it will select an appropriate sampling grade for you from the four sampling ratios of 10%, 1%, 0.1%, and 0.01% based on the request volume during the determination period. The trigger rules for each sampling ratio are as follows:
10%: The daily average request volume is more than 10 million times;
1%: The daily average request volume is more than 100 million times;
0.1%: The daily average request volume is more than 1 billion times;
0.01%: The daily average request volume is more than 10 billion times.
After sampling is triggered, your sampling grade is not fixed. If your domain name request volume continues to increase, EdgeOne will accordingly upgrade your sampling grade and use a lower sampling ratio; if your domain name request volume continues to decline, EdgeOne will accordingly downgrade your sampling grade, use a higher sampling ratio, or even cancel the sampling mechanism for you.
Data Representativeness
EdgeOne will provide a unique identifier (Request ID) for each of your request logs. The sampling system will perform sampling analysis on your data based on this unique identifier to ensure the randomness of the sampling factor. Based on our tests, when the characteristic you need to analyze accounts for a high proportion in the overall data, sampling analysis can provide you with fast and accurate results. However, we also need to point out that when the characteristic you need to analyze accounts for a low proportion in the overall data, due to the small sample size, the results of the sampling analysis may be too large or too small.
For example, you have a dataset with a volume of 10,000, which includes three URL paths: A, B, and C. Their quantity distributions are 7,000 (70%), 2,900 (29%), and 100 (1%) respectively. Ideally, after 10% sampling, the sample sizes for URL paths A, B, and C will be 700, 290, and 10 respectively. Since the sample size for the URL C is too small, the accuracy of estimating the overall based on this sample will be greatly reduced. At this point, the results of your drill-down analysis on URL C may not meet expectations.
Was this page helpful?