Field | Details |
Template Type | Currently supports two types of templates: table-level and field-level, and supports filtering |
Template Name | Template naming |
Template Description | Detailed description of the specific execution logic and formulas of the template rules |
Dimension | Accuracy, Uniqueness, Integrity, Consistency, Timeliness, Validity, support filtering |
Applicable Engine | Engine types applicable to this template: currently supports Hive, Spark, DLC, TCHouse-D, and Doris types. Supports filtering |
Reference Count | The number of rules currently associated with the template, supports filtering |
Monitored Object | Rule Dimension | Compute Item | Calculation Sub-item | Description | Numeric Type | Numeric - Volatility Type | Numeric - Standard Score Type | Other | ||||||||
| | | | | Fixed Value | Value Range | Previous Cycle | 1 day ago | 7 days ago | 30 days ago | 7 days | 30 days | Empty/Unique/Duplicate | Format Matching | Enumerated range | Value size |
Table-level | Accuracy | Number of table rows | | Calculates the number of data rows | ✅ | - | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | - | - | - | - |
| | Table size (bytes) | | Calculates the size of the data table (supports only Hive tables) | ✅ | - | - | ✅ | ✅ | - | - | - | - | - | - | - |
| Timeliness | Timeliness of data output | | Calculates the number of data rows. If the number of rows is 0, it is considered that no data is produced | ✅ = 0 | - | - | - | - | - | - | - | - | - | - | - |
Field-level | Accuracy | Field value | Average value | Calculates the average value | ✅ | - | - | ✅ | ✅ | ✅ | ✅ | ✅ | - | - | - | - |
| | | Total value | Calculate the total value of numerical data | ✅ | - | - | ✅ | ✅ | ✅ | ✅ | ✅ | - | - | - | - |
| | | Median | Calculate the median of numerical data | ✅ | - | - | ✅ | ✅ | ✅ | ✅ | ✅ | - | - | - | - |
| | | Minimum value | Calculate the minimum value of numerical data | ✅ | - | - | ✅ | ✅ | ✅ | ✅ | ✅ | - | - | - | - |
| | | Maximum value | Calculate the maximum value of numerical data | ✅ | - | - | ✅ | ✅ | ✅ | ✅ | ✅ | - | - | - | - |
| Uniqueness | Field unique values | Number of unique values | Verify unique values | - | - | - | - | - | - | - | - | ✅ | - | - | - |
| | | Number of unique values/Total rows | | - | - | - | - | - | - | - | - | ✅ | - | - | - |
| | Field duplicate values | Number of duplicate values | Verify duplicate values | - | - | - | - | - | - | - | - | ✅ | - | - | - |
| | | Number of duplicate values/Total rows | | - | - | - | - | - | - | - | - | ✅ | - | - | - |
| Integrity | Field null values | Number of null values | Validation controls | - | - | - | - | - | - | - | - | ✅ | - | - | - |
| | | Number of null values/Total rows | | - | - | - | - | - | - | - | - | ✅ | - | - | - |
| Validity | Mobile number format | Number of invalid entries | Regular Expression Validation, conforms to Mainland China Mobile Phone Number Format | - | - | - | - | - | - | - | - | - | ✅ | - | - |
| | | Number of invalid entries/Total rows | | - | - | - | - | - | - | - | - | - | ✅ | - | - |
| | Email format | Number of invalid entries | Regular Expression Validation, conforms to Email Format | - | - | - | - | - | - | - | - | - | ✅ | - | - |
| | | Number of invalid entries/Total rows | | - | - | - | - | - | - | - | - | - | ✅ | - | - |
| | ID card format | Number of invalid entries | Regular Expression Validation, conforms to Chinese Mainland ID Card Format | - | - | - | - | - | - | - | - | - | ✅ | - | - |
| | | Number of invalid entries/Total rows | | - | - | - | - | - | - | - | - | - | ✅ | - | - |
| Consistency | Field Data Range | Value Range | Check if the value is within the numeric range | - | ✅ | - | - | - | - | - | - | - | - | - | - |
| | | Enumerated range | Check if the character value is within enumerated values | - | - | - | - | - | - | - | - | - | - | ✅ | - |
| | Field Data Correlation | | Comparing a field against another database table | - | - | - | - | - | - | - | - | - | - | - | ✅ |
Terminology | Explanation | |
Monitored Object | Table-level | When the monitored object is table-level, you can monitor the number of table rows, table size, and timeliness of data output (equivalent to the number of table rows). |
| Field Level | When the monitored object is field-level, you can monitor the field's values (including average value, maximum value, minimum value, median, summary value), field value format (phone number, email, ID card number), and whether the field is empty. |
Rule Dimension | - | The rule dimension is designed to calculate the quality score and reflect the quality proportion of different types of rules. There are six built-in rule dimensions in the system: Accuracy, Uniqueness, Integrity, Consistency, Timeliness, and Validity. |
Validation Method | Numeric Type | Mainly includes numerical comparison and numeric range comparison. |
| Volatility Type | Term Explanation: The volatility type is used to reflect the fluctuation of values, that is, the rise or fall compared to a certain time point. Calculation Formula: Volatility = Current scan result / Scan result at a certain time point * 100%. Note: The calculation result of volatility is a percentage. When using the volatility template, the Partition must be specified. Example 1: 7-day Cyclical Volatility When the partition is specified, and the baseline value is the data from 7 days ago, if the calculation result is 100%, it means that the current partition data has doubled compared to the partition data from 7 days ago. Example 2: Previous Period Volatility: When the partition is specified, and the baseline value is the last operation period, and the rule is associated with a production scheduling task (e.g., an offline development task), if the calculation result is 100%, it indicates that the statistical data after the current offline development task has been completed has doubled compared to the statistical data after the previous operation was completed. Example 3: Cyclical Volatility Rate + Default Period: When setting quality rules using the cyclical volatility rate template and a default period is set, such as 7 days ago. If this rule is not associated with a production scheduling task, and the calculation result is 100%. It means that the current partition data has doubled compared to the partition data from 7 days ago. That is, it compares the current data with the data from 7 days ago. |
| Standard Typing (Variance Fluctuation) | Term Explanation: The standard score is an important statistical concept, reflecting whether a certain value is within a credible range. If the calculation result is too large or too small, it is highly likely an abnormal value. Calculation Formula: Note: The calculation result of the standard score is a unitless decimal, indicating whether the data is abnormal within the dataset. Generally, a standard score absolute value greater than 3 is considered an abnormal value, with a normal probability of only 0.28% [-1,1]: Normal Probability: 68.26% [-2,2]: Normal Probability: 95.44% [-3,3]: Normal Probability: 99.72% Not within [-3,3]: Normal Probability: 0.28% |
| Other | No restriction on value validation field type. Null/Unique/Duplicate: Count or proportion of null values, unique values, and duplicate values; Format Matching: Count or proportion of values not matching the format; Enumeration Range: Count of values not within the enumeration range; Note: Fill in the expected value here. An alarm will be triggered when the field is out of range. Field Relevance: Statistics on whether it is the same as the value of another database table field. Comparative Relationship: Greater than, Less than, Equal to; Target Data: Database table, field, filter criteria; Associated Conditions: Associated fields of two tables. Note: The comparison table needs to correspond to the detection table data one-to-one. |
Was this page helpful?