Users can modify the schema of an existing table via Schema Change operation. Currently, Doris supports the following changes:
Column addition or deletion
Column type modification
Column order adjustment
Bloom Filter addition or modification
bitmap index addition or deletion
Here we mainly describe how to create a Schema Change Job, as well as precautions and common issues in the Schema Change process.
Definitions
Base Table: Whenever a table is created, it corresponds to a base table.
Rollup: The rollup table created based on the base table or other rollups.
Index: Materialized index. Both the rollup and base table are referred to as materialized indexes.
Transaction: Each import task is a Transaction, and each Transaction has a unique incrementing Transaction ID.
How It Works
The basic process of executing Schema Change involves generating new index data in accordance with the original index's data. This mainly involves converting two sets of data: the existing historical data and the newly arrived import data during the Schema Change execution.
+----------+
| Load Job |
+----+-----+
|
| Load job generates both origin and new index data
|
| +------------------+ +---------------+
| | Origin Index | | Origin Index |
+------> New Incoming Data| | History Data |
| +------------------+ +------+--------+
| |
| | Convert history data
| |
| +------------------+ +------v--------+
| | New Index | | New Index |
+------> New Incoming Data| | History Data |
+------------------+ +---------------+
Before converting historical data, Doris will access the latest Transaction ID and wait for all import transactions prior to this Transaction ID. This Transaction ID become the watershed, meaning that Doris guarantees that all import tasks after the watershed will simultaneously generate data for the original index and the new index. Thus, after the historical data conversion is complete, it is ensured that the data in the new index is complete.
Creating a Job
Refer to HELP ALTER TABLE
for the specifics of the Schema Change syntax.
Creating a Schema Change is an asynchronous process. Once a job is successfully submitted, users need to use SHOW ALTER TABLE COLUMN
command to view the progress of the job.
Viewing Job
SHOW ALTER TABLE COLUMN
can view the Schema Change job that is currently running or already completed. If a Schema Change Job involves multiple indexes, this command will display multiple lines, and each corresponding to one Index. Example as follows:
JobId: 20021
TableName: tbl1
CreateTime: 2019-08-05 23:03:13
FinishTime: 2019-08-05 23:03:42
IndexName: tbl1
IndexId: 20022
OriginIndexId: 20017
SchemaVersion: 2:792557838
TransactionId: 10023
State: FINISHED
Msg:
Progress: N/A
Timeout: 86400
JobId: Unique ID of each Schema Change job.
TableName: Name of the base table corresponding to the Schema Change.
CreateTime: Job creation time.
FinishedTime: Job completion time. If not finished, it shows "N/A".
IndexName: The name of one of the Indexes involved in this change.
IndexId: Unique ID of the new Index.
OriginIndexId: Unique ID of the old Index.
SchemaVersion: Shown in M:N format, where M stands for the version changed this time of Schema Change, and N stands for the corresponding Hash value. Schema Change version increments each time.
TransactionId: Watershed transaction ID that converts historic data.
State: Current stage of the job.
PENDING: The job is waiting in the queue to be scheduled.
WAITING_TXN: Wait for the import job before the watershed transaction ID to be completed.
RUNNING: Convert historical data.
FINISHED: The job is successful.
CANCELLED: The job fails.
Msg: If the job fails, this will display the error message.
Progress: Job progress. Only shows progress in the RUNNING state. The progress is displayed in the form of M/N. Where N is the total number of replicas involved in Schema Change, and M is the number of replicas that have completed the historical data conversion.
Timeout: Job timeout, expressed in seconds.
Canceling Job
When the job state is not FINISHED or CANCELLED, the following command can be used to cancel the Schema Change Job:
CANCEL ALTER TABLE COLUMN FROM tbl_name;
Best Practice
Schema Change can make different modifications to multiple indexes in one Job. Here's an example:
Source Schema:
+-----------+-------+------+------+------+---------+-------+
| IndexName | Field | Type | Null | Key | Default | Extra |
+-----------+-------+------+------+------+---------+-------+
| tbl1 | k1 | INT | No | true | N/A | |
| | k2 | INT | No | true | N/A | |
| | k3 | INT | No | true | N/A | |
| | | | | | | |
| rollup2 | k2 | INT | No | true | N/A | |
| | | | | | | |
| rollup1 | k1 | INT | No | true | N/A | |
| | k2 | INT | No | true | N/A | |
+-----------+-------+------+------+------+---------+-------+
The following command can be used to add column k4 to both rollup1 and rollup2, and also add column k5 to rollup2:
ALTER TABLE tbl1
ADD COLUMN k4 INT default "1" to rollup1,
ADD COLUMN k4 INT default "1" to rollup2,
ADD COLUMN k5 INT default "1" to rollup2;
After the completion, the Schema becomes:
+-----------+-------+------+------+------+---------+-------+
| IndexName | Field | Type | Null | Key | Default | Extra |
+-----------+-------+------+------+------+---------+-------+
| tbl1 | k1 | INT | No | true | N/A | |
| | k2 | INT | No | true | N/A | |
| | k3 | INT | No | true | N/A | |
| | k4 | INT | No | true | 1 | |
| | k5 | INT | No | true | 1 | |
| | | | | | | |
| rollup2 | k2 | INT | No | true | N/A | |
| | k4 | INT | No | true | 1 | |
| | k5 | INT | No | true | 1 | |
| | | | | | | |
| rollup1 | k1 | INT | No | true | N/A | |
| | k2 | INT | No | true | N/A | |
| | k4 | INT | No | true | 1 | |
+-----------+-------+------+------+------+---------+-------+
As you can see, the Base table tbl1 has also automatically added the columns k4 and k5. Any column added to any rollup will automatically be added to the Base table.
At the same time, it is not allowed to add columns already present in the Base table to the Rollup. If users need to do this, they can establish a new Rollup that includes the new column and then delete the original Rollup.
Notes
Only one Schema Change Job can run at the same time on a table.
Schema Change operations do not block import and query operations.
Partition column and bucket column cannot be modified.
If there is a value column of REPLACE aggregate in the Schema, it is not allowed to delete the Key column.
If the Key column is deleted, Doris cannot determine the value of the REPLACE column.
All non-Key columns in the Unique data model table are REPLACE Aggregate type.
When adding a value column of SUM or REPLACE type in Aggregate, the default value of the column has no meaning for historical data.
Because the historical data has already lost detailed information, the default value does not actually reflect the value after Aggregate.
When you are modifying the column type, all fields except Type need to be completed according to the information on the original column.
To modify the column k1 INT SUM NULL DEFAULT "1"
to BIGINT, execute the command as follows:
ALTER TABLE tbl1 MODIFY COLUMN
k1 BIGINT SUM NULL DEFAULT "1";
Note
Apart from new column type, such as Aggregate mode, Nullable property and default value, all need to be completed according to the original information.
Modifying column name, Aggregate type, Nullable property, default value and column comment are not supported.
FAQs
The execution speed of Schema Change.
The execution speed of Schema Change is estimated to be about 10MB/s at worst. For safety's sake, users can set the timeout time of the job to this rate.
Submit job error Table xxx is not stable. ...
.
Schema Change can only start when the table data is complete and in an unbalanced state. If some data tablet replicas of the table are incomplete, or some replicas are balancing, the submission will be rejected.
Whether the data tablet replica is complete can be viewed through the following command: ADMIN SHOW REPLICA STATUS FROM tbl WHERE STATUS != "OK";
.
If there are returned results, it means that there are problems with the replica. Usually, the system will automatically repair these problems. Users can prioritize repairing this table through the following command: ADMIN REPAIR TABLE tbl1;
.
Users can view whether there is a balance task running through the following command: SHOW PROC "/cluster_balance/pending_tablets";
.
You can wait for the balance task to complete, or disable the balance operation temporarily with the following command: ADMIN SET FRONTEND CONFIG ("disable_balance" = "true");
.
Relevant Configuration
FE configuration
alter_table_timeout_second
: job timeout time, which defaults to 86400 seconds.
BE configuration
alter_tablet_worker_count
:BE side thread count for executing historical data conversion. It defaults to 3. If you want to speed up the Schema Change job, you can increase this parameter appropriately and restart BE. However, too many conversion threads might increase IO pressure, affecting other operations. This thread is shared with Rollup job.
Was this page helpful?