Batch Deletion
Currently, Doris supports various import methods such as Broker Load (HDFS Data), Routine Load (Kafka Data), and Stream Load (Local Files). For data deletion, it can only be carried out using the delete statement. When you are using the delete statement, each execution of delete generates a new data version. Frequent deletions can severely affect query performance, and the deletion is achieved by generating an empty rowset to record the deletion criteria. Reading requires filtering through the deletion criteria, which can also adversely impact performance when there are many conditions.
Compared to other systems, Greenplum is more like that of traditional database products, while Snowflake realizes it through merge syntax. For scenarios like CDC data import, insert and delete operations are usually interlaced in the data. Our current import methods cannot satisfy these scenarios. Even if we separate insert and delete operations to solve the import issue, we still cannot solve the deletion issue. The batch delete feature can address these specific scenarios' needs. There are three Merge methods for data import:
1. APPEND: Append all data to the existing data.
2. DELETE: Delete all rows where the key column values match the imported data.
3. MERGE: Decide whether to APPEND or DELETE based on DELETE ON.
Basic Principles
Add a hidden column named __DORIS_DELETE_SIGN__
to implement this. Since we are only doing batch deletion on the unique model, we just need to add a hidden column with a Boolean type of aggregate function replace. In BE, all aggregate write procedures are the same as those of a normal column. There are two read schemes:
When FE encounters select * etc. expansion, it excludes the __DORIS_DELETE_SIGN__
column and defaults to add a __DORIS_DELETE_SIGN__ != true
condition. BE always adds a column during reading, to determine whether to delete it according to the condition.
Import
During import, when FE parses, it sets the hidden column's value to the value of DELETE ON
expression. Other aggregate behaviors are the same as the column with the aggregate function replace.
Read
During reading, add a condition of __DORIS_DELETE_SIGN__ != true
to all olapScanNodes that have the hidden column. BE is unaware of this process and operates normally.
Cumulative Compaction
Treat the hidden column as a normal column during Cumulative Compaction. The Compaction logic remains unchanged.
Base Compaction
During Base Compaction, lines marked for deletion are removed to reduce space used by data.
Enabling Batch Deletion
There are two ways to enable batch deletion:
1. You can add enable_batch_delete_by_default=true
in the FE configuration file. After FE is rebooted, all new tables will support batch deletion. This option is set to false by default.
2. To enable batch deletion on tables that do not support the batch deletion feature or have no FE configuration modified, you can use the following statement: ALTER TABLE tablename ENABLE FEATURE "BATCH_DELETE"
. This operation is essentially a schema change operation, which returns immediately, and the completion of the operation can be confirmed with the show alter table column
command.
To determine whether a table supports batch deletion, you can enable the display of hidden columns by setting a session variable SET show_hidden_columns=true
, and then use the desc tablename
command. If the output includes the column __DORIS_DELETE_SIGN__
, batch deletion is supported; if not, it is not supported.
Syntax Description
The syntax for import mainly adds a column map field that specifies the delete mark column, and requires the addition of a column in the data being imported. The syntax for setting this for various import methods are as follows.
Stream Load
The Stream Load
method adds a column field in the header to set the delete mark column, as shown in the example -H "columns: k1, k2, label_c3" -H "merge_type: [MERGE|APPEND|DELETE]" -H "delete: label_c3=1"
.
Broker Load
The method for Broker Load
involves setting the delete mark column in the PROPERTIES
section. The syntax is as follows:
LOAD LABEL db1.label1
(
[MERGE|APPEND|DELETE] DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
INTO TABLE tbl1
COLUMNS TERMINATED BY ","
(tmp_c1,tmp_c2, label_c3)
SET
(
id=tmp_c2,
name=tmp_c1,
)
[DELETE ON label_c3=true]
)
WITH BROKER 'broker'
(
"username"="user",
"password"="pass"
)
PROPERTIES
(
"timeout" = "3600"
);
Routine Load
Routine Load
adds a map to the columns
field, where the map follows the same syntax as above
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
[WITH MERGE|APPEND|DELETE]
COLUMNS(k1, k2, k3, v1, v2, label),
WHERE k1 > 100 and k2 like "%doris%"
[DELETE ON label=true]
PROPERTIES
(
"desired_concurrent_number"="3",
"max_batch_interval" = "20",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
"kafka_topic" = "my_topic",
"kafka_partitions" = "0,1,2,3",
"kafka_offsets" = "101,0,0,200"
);
Notes
1. As import operations other than Stream Load
might be executed out of order in Doris, it is necessary to use MERGE
along with load sequence when importing if it is not Stream Load
. You can refer to Sequence Column in the documentation for the specific syntax. 2. DELETE ON
can only be used together with MERGE.
Usage Examples
Check whether batch deletion is enabled
mysql> SET show_hidden_columns=true;
Query OK, 0 rows affected (0.00 sec)
mysql> DESC test;
+
| Field | Type | Null | Key | Default | Extra |
+
| name | VARCHAR(100) | No | true | NULL | |
| gender | VARCHAR(10) | Yes | false | NULL | REPLACE |
| age | INT | Yes | false | NULL | REPLACE |
| __DORIS_DELETE_SIGN__ | TINYINT | No | false | 0 | REPLACE |
+
4 rows in set (0.00 sec)
Example of Using Stream Load
1. Normal data import:
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: APPEND" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
The APPEND condition can be omitted, and the effect is the same as the following statement:
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
2. Delete all data with the same key as the imported data:
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: DELETE" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
Assume that the original data in the imported table is:
+--------+----------+----------+------+
| siteid | citycode | username | pv |
+--------+----------+----------+------+
| 3 | 2 | tom | 2 |
| 4 | 3 | bush | 3 |
| 5 | 3 | helen | 3 |
+--------+----------+----------+------+
The data imported is:
After import, the data becomes:
+--------+----------+----------+------+
| siteid | citycode | username | pv |
+--------+----------+----------+------+
| 4 | 3 | bush | 3 |
| 5 | 3 | helen | 3 |
+--------+----------+----------+------+
3. Import lines in the data that are the same as the key column of site_id=1
.
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: MERGE" -H "delete: siteid=1" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load
Assume that the data before being imported is:
+--------+----------+----------+------+
| siteid | citycode | username | pv |
+--------+----------+----------+------+
| 4 | 3 | bush | 3 |
| 5 | 3 | helen | 3 |
| 1 | 1 | jim | 2 |
+--------+----------+----------+------+
The data imported is:
2,1,grace,2
3,2,tom,2
1,1,jim,2
After import, it becomes:
+--------+----------+----------+------+
| siteid | citycode | username | pv |
+--------+----------+----------+------+
| 4 | 3 | bush | 3 |
| 2 | 1 | grace | 2 |
| 3 | 2 | tom | 2 |
| 5 | 3 | helen | 3 |
+--------+----------+----------+------+
Sql Delete Operation
Delete is different from other import methods. It is a synchronous process, similar to Insert into. All Delete operations in Doris are independent import jobs. Generally, the Delete statement needs to specify the table and partition as well as the deletion conditions to filter the data to be deleted. It will delete the data of base table and rollup table simultaneously. For the syntax of Delete operation, see DELETE syntax. Returned Result
The Delete command is a SQL command. The returned result is synchronous and is divided into several types:
1. Execution is successful
If Delete is successfully executed and visible, the following results will be returned, Query OK
indicates success.
mysql> delete from test_tbl PARTITION p1 where k1 = 1;
Query OK, 0 rows affected (0.04 sec)
{'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'VISIBLE', 'txnId':'4005'}
2. Submission is successful, but it is not yet visible.
The transaction submission in Doris is divided into two steps: submission and version release. Only when the version release step is completed, the result will be visible to users. If it has been successfully submitted, it can be assumed that it will eventually be successfully released. Doris will try to wait for a while to release after the submission is completed. If it times out, even if the version release is not yet completed, it will give priority to return to users, to inform users that the submission has been completed. If Delete has been committed and executed, but the version has not yet been released and is still not visible, the following results will be returned:
mysql> delete from test_tbl PARTITION p1 where k1 = 1;
Query OK, 0 rows affected (0.04 sec)
{'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'COMMITTED', 'txnId':'4005', 'err':'delete job is committed but may be taking effect later' }
The result will also return a json character string:
affected rows
: Indicates the number of rows affected by this deletion. Since Doris's deletion is currently logical, this value is always 0.
label
: An automatically generated label, which is the identifier of the import job. Each import job has a label that is unique within a single database.
status
: Indicates whether the data deletion is visible. If visible, it displays VISIBLE
; if not, it shows COMMITTED
.
txnId
: Transaction ID corresponding to this Delete job.
err
: This field displays detailed information about this deletion.
3. Submission fails, and Transaction is canceled.
If the Delete statement has not been successfully submitted, Doris will automatically abort it and return the following result:
mysql> delete from test_tbl partition p1 where k1 > 80;
ERROR 1064 (HY000): errCode = 2, detailMessage = {Error reason}
Example:
For instance, a timeout deletion will return the timeout time and the unfinished (tablet=replica)
.
mysql> delete from test_tbl partition p1 where k1 > 80;
ERROR 1064 (HY000): errCode = 2, detailMessage = failed to delete replicas from job: 4005, Unfinished replicas:10000=60000, 10001=60000, 10002=60000
In summary, the correct processing logic for the returned result of the Delete operation is as follows:
1. If the returned result is ERROR 1064 (HY000)
, it means the deletion fails.
2. If the returned result is Query OK
, it means the deletion is successful.
If the status
is COMMITTED
, data is still not visible and you may want to wait a while before using the show delete
command to view the result.
If the status
is VISIBLE
, it means the data deletion is successful.
FE Configurations Related to Delete Operations
TIMEOUT configuration
In general, the timeout limit for Doris's deletion jobs is between 30 seconds and 5 minutes, and specific times can be adjusted through the following configuration items:
tablet_delete_timeout_second
The timeout of the delete operation itself can change according to the number of tablets under a given partition, and this configuration setting contributes to the average timeout time of one tablet. The default value is 2.
If the specified partition contains 5 tablets for this deletion, the timeout for deletion is 10 seconds. Since this is lower than the minimum timeout of 30 seconds, the final timeout is 30 seconds.
load_straggler_wait_second
If the user anticipates a large volume of data and the upper limit of 5 minutes is not sufficient, they can adjust the timeout limit using this configuration. The default value is 300.
The calculation rule of TIMEOUT is in seconds
TIMEOUT = MIN(load_straggler_wait_second, MAX(30, tablet_delete_timeout_second * tablet_num))
query_timeout
Since delete is a SQL command, the deletion statement would be session limited and the timeout would be affected by the query_timeout
value in the session. We can increase the timeout by using the SET query_timeout = xxx
command. The time unit is in seconds.
IN Predicate Configuration
max_allowed_in_element_num_of_delete
If users require more elements while using an IN predicate, this setting can be used to adjust the maximum number of allowable elements. The default value is 1024.
Viewing Historical Records
You can use the show delete command to view historical deletion records.
The syntax is as follows:
SHOW DELETE [FROM db_name]
Example:
mysql> show delete from test_db;
+
| TableName | PartitionName | CreateTime | DeleteCondition | State |
+
| empty_tbl | p3 | 2020-04-15 23:09:35 | k1 EQ "1" | FINISHED |
| test_tbl | p4 | 2020-04-15 23:09:53 | k1 GT "80" | FINISHED |
+
2 rows in set (0.00 sec)
Notes
Unlike the Insert into command, delete cannot manually specify a label
. For relevant concepts, view the INSERT INTO document. More Help
For more detailed syntax, refer to the Delete command manual, or enter HELP DELETE
in the command line of the Mysql client to access more help information.
Was this page helpful?