Data Deletion

Tencent Cloud TCHouse-D

Product Introduction

Purchase Guide

Configuration Adjustment Billing Instructions

Getting Started

Using Tencent Cloud TCHouse-D Through the Console

Using Tencent Cloud TCHouse-D Through a Client

Operation Guide

Cluster Operation

Creating and Terminating

Scale-Out and Scale-In

Configuration Adjustment

Monitoring and Alarm Configuration

Cluster Monitoring

FE Monitoring Metric

BE Monitoring Metric

Business Monitoring Metrics

Alarm Configuration

Account Privilege Management

Privilege Management Through a Client

Data Management

Data Backup

Data Recovery

Cross-Cluster DataMigration

Database Auditing

Cold-Hot Data Tiering

Query Management

Query Analysis

Slow Query Analysis

Modify Configurations

Modify Configurations Through Console

Modify Configurations Through Client

Node Management

Log Analysis

SQL Studio

Enabling Resource Isolation

Development Guide

Design of Data Table

Data Table and Data Model

Data Partition and Bucket

Data Distribution and Replica

Index, Sorted Column and Prefix Index

Importing Data

Import Overview

Import Methods

Stream Load (local file)

Broker Load (HDFS Data)

S3 Load ( Cloud Object Storage, COS)

Spark Load

Routine Load (Kafka Data)

Flink Connector (Real-time or Batch Data with Flink)

INSERT INTO

Synchronizing Data with JDBC

Synchronizing Data Through External Tables

Real-time or Batch Writing of MySQL Data

Importing Data from Kafka

Importing with DataX

Importing Doris from Logstash

Data Transformation, Column Mapping and Filtering

Importing Strict Mode

Importing Data in JSON Format

Exporting Data

Exporting via the EXPORT Statement

Using Mysqldump Tool to Export Table Structure or Data

Basic Feature

Table Structure Change

Two Tables of Atomic Swap (Swap)

Query Optimization

Caching Table or Partition to Memory

Colocation Join

Bucket Shuffle Join

Runtime Filter

Query Profiling and Optimization

Query Cache Configuration

Ecological Expansion Feature

API Documentation

Making API Requests

Cluster Operation APIs

ModifyClusterConfigs

ModifyInstanceKeyValConfigs

ModifyNodeStatus

RestartClusterForConfigs

RestartClusterForNode

DescribeClusterConfigsHistory

ModifyInstance

DescribeCreateTablesDDL

DescribeInstanceOperationHistory

Database and Table APIs

ExecuteParametrizedQuery

Cluster Information Viewing APIs

DescribeClusterConfigs

DescribeAreaRegion

DescribeInstancesHealthState

DescribeSpec

DescribeInstance

DescribeInstanceState

DescribeInstances

DescribeInstanceNodesInfo

DescribeInstanceUsedSubnets

DescribeSqlApis

DescribeInstanceNodes

DescribeInstanceNodesRole

Hot-Cold Data Layering APIs

CheckCoolDownWorkingVariableConfigCorrect

DescribeCoolDownBackends

DescribeCoolDownPolicies

DescribeCoolDownTableData

Database and Operation Audit APIs

DescribeBackUpJobDetail

DescribeBackUpTables

DescribeBackUpTaskDetail

DescribeRestoreTaskDetail

DescribeBackUpSchedules

RecoverBackUpJob

DescribeInstanceOperations

DescribeDatabaseAuditDownload

DescribeDatabaseAuditRecords

DescribeSlowQueryRecords

DescribeSlowQueryRecordsDownload

DescribeQueryAnalyse

User and Permission APIs

ActionAlterUser

ModifyUserPrivilegesV3

DescribeUserPolicy

ModifyDatabaseTableAccess

Resource Group Management APIs

DescribeUserBindWorkloadGroup

CreateWorkloadGroup

DeleteWorkloadGroup

DescribeWorkloadGroup

ModifyWorkloadGroup

ModifyUserBindWorkloadGroup

ModifyWorkloadGroupStatus

Data Types

Error Codes

Cloud Ecosystem

Granting CAM Policies to Sub-accounts

Query Acceleration for Tencent Cloud DLC

Practical Tutorial

Basic Feature Usage

Advanced Features Usage

Resource Specification Selection and Optimization Suggestions

Naming Specifications and Limits to the Database and Data Table

Table Design and Data Import

Query Optimization

Suggested Usage to Avoid

Accessing TCHouse-D via JDBC over the Public Network

Performance Testing

TPC-H Performance Testing

Test Scheme Introduction

Test Results for Reference

SSB Performance Testing

Test Scheme Introduction

Test Results for Reference

TPC-DS 性能测试

Testing Scheme Introduction

Test Results for Reference

FAQs

Common Operational Issues

Common Errors

Glossary

Product Policy

Service Level Agreement

Data Processing And Security Agreement

DocumentationTencent Cloud TCHouse-DDevelopment GuideBasic FeatureData Deletion

Data Deletion

Download PDF

Last updated: 2024-06-27 11:07:47

Data Deletion

Last updated: 2024-06-27 11:07:47

Download PDF

Batch Deletion
Currently, Doris supports various import methods such as Broker Load (HDFS Data), Routine Load (Kafka Data), and Stream Load (Local Files). For data deletion, it can only be carried out using the delete statement. When you are using the delete statement, each execution of delete generates a new data version. Frequent deletions can severely affect query performance, and the deletion is achieved by generating an empty rowset to record the deletion criteria. Reading requires filtering through the deletion criteria, which can also adversely impact performance when there are many conditions.
Compared to other systems, Greenplum is more like that of traditional database products, while Snowflake realizes it through merge syntax.
For scenarios like CDC data import, insert and delete operations are usually interlaced in the data. Our current import methods cannot satisfy these scenarios. Even if we separate insert and delete operations to solve the import issue, we still cannot solve the deletion issue. The batch delete feature can address these specific scenarios' needs. There are three Merge methods for data import:
1. APPEND: Append all data to the existing data.
2. DELETE: Delete all rows where the key column values match the imported data.
3. MERGE: Decide whether to APPEND or DELETE based on DELETE ON.
Basic Principles
Add a hidden column named __DORIS_DELETE_SIGN__ to implement this. Since we are only doing batch deletion on the unique model, we just need to add a hidden column with a Boolean type of aggregate function replace. In BE, all aggregate write procedures are the same as those of a normal column. There are two read schemes:
When FE encounters select * etc. expansion, it excludes the __DORIS_DELETE_SIGN__ column and defaults to add a __DORIS_DELETE_SIGN__ != true condition. BE always adds a column during reading, to determine whether to delete it according to the condition.
Import
During import, when FE parses, it sets the hidden column's value to the value of DELETE ON expression. Other aggregate behaviors are the same as the column with the aggregate function replace.
Read
During reading, add a condition of __DORIS_DELETE_SIGN__ != true to all olapScanNodes that have the hidden column. BE is unaware of this process and operates normally.
Cumulative Compaction
Treat the hidden column as a normal column during Cumulative Compaction. The Compaction logic remains unchanged.
Base Compaction
During Base Compaction, lines marked for deletion are removed to reduce space used by data.
Enabling Batch Deletion
There are two ways to enable batch deletion:
1. You can add enable_batch_delete_by_default=true in the FE configuration file. After FE is rebooted, all new tables will support batch deletion. This option is set to false by default.
2. To enable batch deletion on tables that do not support the batch deletion feature or have no FE configuration modified, you can use the following statement: ALTER TABLE tablename ENABLE FEATURE "BATCH_DELETE". This operation is essentially a schema change operation, which returns immediately, and the completion of the operation can be confirmed with the show alter table column command.
To determine whether a table supports batch deletion, you can enable the display of hidden columns by setting a session variable SET show_hidden_columns=true, and then use the desc tablename command. If the output includes the column __DORIS_DELETE_SIGN__, batch deletion is supported; if not, it is not supported.
Syntax Description
The syntax for import mainly adds a column map field that specifies the delete mark column, and requires the addition of a column in the data being imported. The syntax for setting this for various import methods are as follows.
Stream Load
The Stream Load method adds a column field in the header to set the delete mark column, as shown in the example -H "columns: k1, k2, label_c3" -H "merge_type: [MERGE|APPEND|DELETE]" -H "delete: label_c3=1".
Broker Load
The method for Broker Load involves setting the delete mark column in the PROPERTIES section. The syntax is as follows:
LOAD LABEL db1.label1
(
    [MERGE|APPEND|DELETE] DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
    INTO TABLE tbl1
    COLUMNS TERMINATED BY ","
    (tmp_c1,tmp_c2, label_c3)
    SET
    (
        id=tmp_c2,
        name=tmp_c1,
    )
    [DELETE ON label_c3=true]
)
WITH BROKER 'broker'
(
    "username"="user",
    "password"="pass"
)
PROPERTIES
(
    "timeout" = "3600"
);
Routine Load
Routine Load adds a map to the columns field, where the map follows the same syntax as above
CREATE ROUTINE LOAD example_db.test1 ON example_tbl 
 [WITH MERGE|APPEND|DELETE]
 COLUMNS(k1, k2, k3, v1, v2, label),
 WHERE k1 > 100 and k2 like "%doris%"
 [DELETE ON label=true]
 PROPERTIES
 (
     "desired_concurrent_number"="3",
     "max_batch_interval" = "20",
     "max_batch_rows" = "300000",
     "max_batch_size" = "209715200",
     "strict_mode" = "false"
 )
 FROM KAFKA
 (
     "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
     "kafka_topic" = "my_topic",
     "kafka_partitions" = "0,1,2,3",
     "kafka_offsets" = "101,0,0,200"
 );
Notes
1. As import operations other than Stream Load might be executed out of order in Doris, it is necessary to use MERGE along with load sequence when importing if it is not Stream Load. You can refer to Sequence Column in the documentation for the specific syntax.
2. DELETE ON can only be used together with MERGE.
Usage Examples
Check whether batch deletion is enabled
mysql> SET show_hidden_columns=true;
Query OK, 0 rows affected (0.00 sec)
﻿
mysql> DESC test;
+-----------------------+--------------+------+-------+---------+---------+
| Field                 | Type         | Null | Key   | Default | Extra   |
+-----------------------+--------------+------+-------+---------+---------+
| name                  | VARCHAR(100) | No   | true  | NULL    |         |
| gender                | VARCHAR(10)  | Yes  | false | NULL    | REPLACE |
| age                   | INT          | Yes  | false | NULL    | REPLACE |
| __DORIS_DELETE_SIGN__ | TINYINT      | No   | false | 0       | REPLACE |
+-----------------------+--------------+------+-------+---------+---------+
4 rows in set (0.00 sec)
Example of Using Stream Load
1. Normal data import:
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: APPEND"  -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
The APPEND condition can be omitted, and the effect is the same as the following statement:
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
2. Delete all data with the same key as the imported data:
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: DELETE"  -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
Assume that the original data in the imported table is:
+--------+----------+----------+------+
| siteid | citycode | username | pv   |
+--------+----------+----------+------+
|      3 |        2 | tom      |    2 |
|      4 |        3 | bush     |    3 |
|      5 |        3 | helen    |    3 |
+--------+----------+----------+------+
The data imported is:
3,2,tom,0
After import, the data becomes:
+--------+----------+----------+------+
| siteid | citycode | username | pv   |
+--------+----------+----------+------+
|      4 |        3 | bush     |    3 |
|      5 |        3 | helen    |    3 |
+--------+----------+----------+------+
3. Import lines in the data that are the same as the key column of site_id=1.
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: MERGE" -H "delete: siteid=1"  -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
Assume that the data before being imported is:
+--------+----------+----------+------+
| siteid | citycode | username | pv   |
+--------+----------+----------+------+
|      4 |        3 | bush     |    3 |
|      5 |        3 | helen    |    3 |
|      1 |        1 | jim      |    2 |
+--------+----------+----------+------+
The data imported is:
2,1,grace,2
3,2,tom,2
1,1,jim,2
After import, it becomes:
+--------+----------+----------+------+
| siteid | citycode | username | pv   |
+--------+----------+----------+------+
|      4 |        3 | bush     |    3 |
|      2 |        1 | grace    |    2 |
|      3 |        2 | tom      |    2 |
|      5 |        3 | helen    |    3 |
+--------+----------+----------+------+
Sql Delete Operation
Delete is different from other import methods. It is a synchronous process, similar to Insert into. All Delete operations in Doris are independent import jobs. Generally, the Delete statement needs to specify the table and partition as well as the deletion conditions to filter the data to be deleted. It will delete the data of base table and rollup table simultaneously. For the syntax of Delete operation, see DELETE syntax.
Returned Result
The Delete command is a SQL command. The returned result is synchronous and is divided into several types:
1. Execution is successful
If Delete is successfully executed and visible, the following results will be returned, Query OK indicates success.
mysql> delete from test_tbl PARTITION p1 where k1 = 1;
Query OK, 0 rows affected (0.04 sec)
{'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'VISIBLE', 'txnId':'4005'}
2. Submission is successful, but it is not yet visible.
The transaction submission in Doris is divided into two steps: submission and version release. Only when the version release step is completed, the result will be visible to users. If it has been successfully submitted, it can be assumed that it will eventually be successfully released. Doris will try to wait for a while to release after the submission is completed. If it times out, even if the version release is not yet completed, it will give priority to return to users, to inform users that the submission has been completed. If Delete has been committed and executed, but the version has not yet been released and is still not visible, the following results will be returned:
 mysql> delete from test_tbl PARTITION p1 where k1 = 1;
 Query OK, 0 rows affected (0.04 sec)
 {'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'COMMITTED', 'txnId':'4005', 'err':'delete job is committed but may be taking effect later' }
The result will also return a json character string:
affected rows: Indicates the number of rows affected by this deletion. Since Doris's deletion is currently logical, this value is always 0.
label: An automatically generated label, which is the identifier of the import job. Each import job has a label that is unique within a single database.
status: Indicates whether the data deletion is visible. If visible, it displays VISIBLE; if not, it shows COMMITTED.
txnId: Transaction ID corresponding to this Delete job.
err: This field displays detailed information about this deletion.
3. Submission fails, and Transaction is canceled.
If the Delete statement has not been successfully submitted, Doris will automatically abort it and return the following result:
mysql> delete from test_tbl partition p1 where k1 > 80;
ERROR 1064 (HY000): errCode = 2, detailMessage = {Error reason}
Example:
For instance, a timeout deletion will return the timeout time and the unfinished (tablet=replica).
mysql> delete from test_tbl partition p1 where k1 > 80;
ERROR 1064 (HY000): errCode = 2, detailMessage = failed to delete replicas from job: 4005, Unfinished replicas:10000=60000, 10001=60000, 10002=60000
In summary, the correct processing logic for the returned result of the Delete operation is as follows:
1. If the returned result is ERROR 1064 (HY000), it means the deletion fails.
2. If the returned result is Query OK, it means the deletion is successful.
If the status is COMMITTED, data is still not visible and you may want to wait a while before using the show delete command to view the result.
If the status is VISIBLE, it means the data deletion is successful.
FE Configurations Related to Delete Operations
TIMEOUT configuration
In general, the timeout limit for Doris's deletion jobs is between 30 seconds and 5 minutes, and specific times can be adjusted through the following configuration items:
tablet_delete_timeout_second
The timeout of the delete operation itself can change according to the number of tablets under a given partition, and this configuration setting contributes to the average timeout time of one tablet. The default value is 2.
If the specified partition contains 5 tablets for this deletion, the timeout for deletion is 10 seconds. Since this is lower than the minimum timeout of 30 seconds, the final timeout is 30 seconds.
load_straggler_wait_second
If the user anticipates a large volume of data and the upper limit of 5 minutes is not sufficient, they can adjust the timeout limit using this configuration. The default value is 300.
The calculation rule of TIMEOUT is in seconds
TIMEOUT = MIN(load_straggler_wait_second, MAX(30, tablet_delete_timeout_second * tablet_num))
query_timeout
Since delete is a SQL command, the deletion statement would be session limited and the timeout would be affected by the query_timeout value in the session. We can increase the timeout by using the SET query_timeout = xxx command. The time unit is in seconds.
IN Predicate Configuration
max_allowed_in_element_num_of_delete
If users require more elements while using an IN predicate, this setting can be used to adjust the maximum number of allowable elements. The default value is 1024.
Viewing Historical Records
You can use the show delete command to view historical deletion records.
The syntax is as follows:
SHOW DELETE [FROM db_name]
Example:
mysql> show delete from test_db;
+-----------+---------------+---------------------+-----------------+----------+
| TableName | PartitionName | CreateTime          | DeleteCondition | State    |
+-----------+---------------+---------------------+-----------------+----------+
| empty_tbl | p3            | 2020-04-15 23:09:35 | k1 EQ "1"       | FINISHED |
| test_tbl  | p4            | 2020-04-15 23:09:53 | k1 GT "80"      | FINISHED |
+-----------+---------------+---------------------+-----------------+----------+
2 rows in set (0.00 sec)
Notes
Unlike the Insert into command, delete cannot manually specify a label. For relevant concepts, view the INSERT INTO document.
More Help
For more detailed syntax, refer to the Delete command manual, or enter HELP DELETE in the command line of the Mysql client to access more help information.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service free trial

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

E-commerce

E-commerce retail solutions

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Financial Services

Financial Services Solution

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha

Cloud Workload Protection Platform

Data Security Governance Center

Key Management Service