Data Quality Fundamentals

Recent Pages

Data Quality Fundamentals

Last updated: 2024-11-01 15:58:21

Note:
This article will guide you through a complete configuration process to quickly introduce you to the core capabilities of the Data Quality Module.
Data Table Preparation
1. Create a database called emall in the emr-hive Data source, and then create a Data Table named dq_test.
2. Add the following columns to the table:
Field name
Column Type
Partitioning
pt_date
string
Yes
id
int
No
name
string
No

gender
int
No
Data Development Task Preparation
1. In the Data Development ﻿ Orchestration Space interface, click Create Workflow.
2. Create a task node in the workflow, task type: Hive SQL.
Hive SQL is as follows:
INSERT INTO TABLE emall.dq_test PARTITION (pt_date = '2024-05-01')
VALUES ('1', 'Zhang San', '1');
Configure Data Quality Task
Create Quality Task
Step 1: Enter Data Monitoring Page
Enter the Data Quality > Data Monitoring > EMR interface, click Add Rules for a Single Table, to start adding a new quality rule.
﻿
﻿
﻿
Step 2: Select Monitoring Object
Enter the Add Rules for a Single Table interface, select the data source, database, and monitoring table in sequence, click Add Monitoring Rule.
﻿
﻿
﻿
Step 3: Add Monitoring Rule
In the pop-up to add a new Monitoring Rule, fill in the following information, click Save.
﻿
﻿
﻿
Explanation:
Element
Note
Rule Type
Select System Template.
Here, you can choose a System Template, a Custom Template, or Custom SQL:
System Template: WeData has built-in 56 rule templates that you can try for free. Detailed descriptions of each template can be found in the System Template Description.
Custom Template: You can add rules that are applicable to your own business in the rule template menu for easy reuse. Detailed operation instructions can be found in the Custom Template Description.
Custom SQL: You can directly write SQL statements as detection rules. Detailed operation instructions can be found in Adding Quality Rules.
Monitoring Object
Select Table(table).
Monitored objects can be divided into: Table-level and Field-Level:
Table-level, can monitor the number of table rows, table size (only supports Hive tables).
Field-Level, can monitor whether a field is empty, duplicated, its average value, maximum value, minimum value, etc.
Select Template
Select Number of table rows.
WeData has built-in 56 rule templates that you can try for free. Detailed descriptions of each template can be found in the System Template Description.
Detection Range
Select Conditional Scan.
Enter the following WHERE Condition:
pt_date='${yyyy-MM-dd-1d}'
Note:
Typically, partition fields are specified here to avoid full table scans for every quality task, thus preventing a waste of computing resources.
In SQL, ${yyyy-MM-dd-1d} is a date variable representing the day before the execution date. It will be replaced with the specific date when the quality task is executed.
For example: When the quality task executes on 2024-05-02 00:00:00, ${yyyy-MM-dd-1d} will be replaced with 2024-05-01.
The specific replacement logic of time variables can be found in the Time Parameter Description.
Trigger Condition
Comparator: select Less than.
Comparison Value: enter 1.
Number of table rows less than 1, combined with the time variable entered for the detection range, indicates: trigger an alert when there is no new data added yesterday.
Note:
The trigger condition entered here is Abnormal Value, meaning: the condition that triggers the alert.
Trigger Level
Select Medium.
Trigger levels can be categorized into: High, Medium, Low.
High: When an alert is triggered, immediately block downstream task execution (only effective when linked with a production task).
Medium: Only trigger an alert.
Low: Does not trigger an alarm, only displays the result as abnormal.
Step 4: Set Execution Strategy
After clicking Batch Set Execution Policies , enter the Execution Policies setting page. After setting up the batch execution policies, click Save.
﻿
﻿
﻿
Explanation:
Element
Note
Execution Method
Select Associate Production Scheduling.
You can choose to associate production scheduling and offline periodic detection here:
Associate Production Scheduling: That is, associated scheduling. Quality tasks are linked with production tasks (data synchronization tasks or data development tasks), and running quality rule tasks are inserted after the completion of production tasks. If an anomaly is found, the handler will be notified immediately to address the issue, and depending on the task level, downstream tasks may be blocked to prevent the problem data from expanding.
Note:
The same quality inspection task can be associated with multiple production tasks; likewise, the same production task can be associated with multiple quality inspection tasks.
Offline Periodic Detection: That is, independent scheduling. For selected database tables, the core business fields are set for periodic quality inspections at a self-defined frequency such as daily, hourly, or by minute. Quality tasks will be executed at the set period, and if anomalies are detected, subscribers will be notified immediately.
Execution Engine
Select Hive.
Here you can choose between Hive and Spark, depending on the EMR resources purchased. Generally, Hive tables can directly select the Hive engine.
Computing Resources
Select default
Here you can select a resource group within the EMR cluster. Generally, you can directly select default.
Execute Resources
The execute resources here refer to the schedule resource group that the project has already bound.
Select task
Select the Hive SQL task created in the preparation work.
Step 5: Set up subscription
When the running result of the quality inspection task meets the trigger condition (i.e., the result is unexpected, the detection result fails), how and to whom a notification should be sent.
Enter the Subscription Configuration interface, check the Subscription Method, set up the Recipient, and click Save.
﻿
﻿
﻿
Step 6: Generate rules and view details
1. Enter the Single Table New Rule interface, click Generate Rules. You must click Generate Rules to produce quality inspection rules.
2. You can click Details in the upper right corner of the interface to enter the Rule List.
﻿
﻿
﻿
Step 7: Trial run
1. Go to the table dimension's rule list, click Trial run;
2. Modify the scheduling time, and click Start trial run; after the trial run ends, you can click Click to view running results.
Note:
Since the data partition entered during data insertion is 2024-05-01, you can enter 2024-05-02 here.
Step 8: View trial run results
1. After entering the Execution instance and result page, you can click the left Dropdown triangle to expand the Rule details.
2. Click the Execution history of a rule to view the Historical running results.
Step 9: Enable Monitoring
If the trial run results are satisfactory, you can return to the Quality Rule List at the table dimension and enable monitoring in the Monitoring Status Column
Note:
Only monitoring rules in the Enabled state will run automatically.
﻿
﻿
﻿
View Job Execution Results
When a quality task completes, if the detection result is abnormal, the system automatically sends an alarm and blocks downstream tasks for those with a high trigger level.
Database Table Responsible Persons and Data Development Engineers need to promptly troubleshoot anomalies and resolve issues.
1. On the Data Quality > Operations Management > Execute Instances and Results page, select the data source and database in turn to view the task's execution results.
2. On this page, you can view the task's detection status and expand to see the results and logs of each rule using the Left Dropdown Triangle.
Note:
Detection status is categorized into two main types:
Task Process Status:
Issuing: The detection task has been generated but is queued for deployment, which might be delayed due to Compute Resource Tension.
Link Failure: The task could not be deployed to the computing resource (e.g., EMR), possibly due to a network disconnection between the Schedule Resource Group and the computing resource.
SQL Detection Status:
Under detection: The SQL is running on the computing resource (e.g., EMR), and detection may take longer due to Compute Resource Tension.
Normal: The SQL has been executed, and the results meet expectations (not reaching the Detection Rules trigger conditions).
Abnormal: The SQL has been executed, and the results do not meet expectations (having reached the Trigger Conditions of the Detection Rules, at which point Task Alert Information will be sent).
﻿
﻿
﻿

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support

Element	Note
Rule Type	Select System Template. Here, you can choose a System Template, a Custom Template, or Custom SQL: System Template: WeData has built-in 56 rule templates that you can try for free. Detailed descriptions of each template can be found in the System Template Description. Custom Template: You can add rules that are applicable to your own business in the rule template menu for easy reuse. Detailed operation instructions can be found in the Custom Template Description. Custom SQL: You can directly write SQL statements as detection rules. Detailed operation instructions can be found in Adding Quality Rules.
Monitoring Object	Select Table(table). Monitored objects can be divided into: Table-level and Field-Level: Table-level, can monitor the number of table rows, table size (only supports Hive tables). Field-Level, can monitor whether a field is empty, duplicated, its average value, maximum value, minimum value, etc.
Select Template	Select Number of table rows. WeData has built-in 56 rule templates that you can try for free. Detailed descriptions of each template can be found in the System Template Description.
Detection Range	Select Conditional Scan. Enter the following WHERE Condition: pt_date='${yyyy-MM-dd-1d}' Note: Typically, partition fields are specified here to avoid full table scans for every quality task, thus preventing a waste of computing resources. In SQL, ${yyyy-MM-dd-1d} is a date variable representing the day before the execution date. It will be replaced with the specific date when the quality task is executed. For example: When the quality task executes on 2024-05-02 00:00:00, ${yyyy-MM-dd-1d} will be replaced with 2024-05-01. The specific replacement logic of time variables can be found in the Time Parameter Description.
Trigger Condition	Comparator: select Less than. Comparison Value: enter 1. Number of table rows less than 1, combined with the time variable entered for the detection range, indicates: trigger an alert when there is no new data added yesterday. Note: The trigger condition entered here is Abnormal Value, meaning: the condition that triggers the alert.
Trigger Level	Select Medium. Trigger levels can be categorized into: High, Medium, Low. High: When an alert is triggered, immediately block downstream task execution (only effective when linked with a production task). Medium: Only trigger an alert. Low: Does not trigger an alarm, only displays the result as abnormal.

tencent cloud

Recent Pages

Data Quality Fundamentals

Data Table Preparation

Data Development Task Preparation

Configure Data Quality Task

Create Quality Task

Step 1: Enter Data Monitoring Page

Step 2: Select Monitoring Object

Step 3: Add Monitoring Rule

Step 4: Set Execution Strategy

Step 5: Set up subscription

Step 6: Generate rules and view details

Step 7: Trial run

Step 8: View trial run results

Step 9: Enable Monitoring

View Job Execution Results

Was this page helpful?

Was this page helpful?

Field name	Column Type	Partitioning
pt_date	string	Yes
id	int	No
name	string	No
gender	int	No

Element	Note
Execution Method	Select Associate Production Scheduling. You can choose to associate production scheduling and offline periodic detection here: Associate Production Scheduling: That is, associated scheduling. Quality tasks are linked with production tasks (data synchronization tasks or data development tasks), and running quality rule tasks are inserted after the completion of production tasks. If an anomaly is found, the handler will be notified immediately to address the issue, and depending on the task level, downstream tasks may be blocked to prevent the problem data from expanding. Note: The same quality inspection task can be associated with multiple production tasks; likewise, the same production task can be associated with multiple quality inspection tasks. Offline Periodic Detection: That is, independent scheduling. For selected database tables, the core business fields are set for periodic quality inspections at a self-defined frequency such as daily, hourly, or by minute. Quality tasks will be executed at the set period, and if anomalies are detected, subscribers will be notified immediately.
Execution Engine	Select Hive. Here you can choose between Hive and Spark, depending on the EMR resources purchased. Generally, Hive tables can directly select the Hive engine.
Computing Resources	Select default Here you can select a resource group within the EMR cluster. Generally, you can directly select default.
Execute Resources	The execute resources here refer to the schedule resource group that the project has already bound.
Select task	Select the Hive SQL task created in the preparation work.

tencent cloud

Sign Up

Log in

Recent Pages

Data Quality Fundamentals

Data Table Preparation

Data Development Task Preparation

Configure Data Quality Task

Create Quality Task

Step 1: Enter Data Monitoring Page

Step 2: Select Monitoring Object

Step 3: Add Monitoring Rule

Step 4: Set Execution Strategy

Step 5: Set up subscription

Step 6: Generate rules and view details

Step 7: Trial run

Step 8: View trial run results

Step 9: Enable Monitoring

View Job Execution Results

Was this page helpful?

Was this page helpful?