Data Transformation, Column Mapping and Filtering

Recent Pages

Data Transformation, Column Mapping and Filtering

Last updated: 2024-06-27 11:05:37
Supported import methods
Broker Load
LOAD LABEL example_db.label1
(
  DATA INFILE("bos://bucket/input/file")
  INTO TABLE my_table
  (k1, k2, tmpk3)
  PRECEDING FILTER k1 = 1
  SET (
          k3 = tmpk3 + 1
  )
  WHERE k1 > k2
)
WITH BROKER bos
(
  ...
);
Stream load
curl
--location-trusted
-u user:passwd
-H "columns: k1, k2, tmpk3, k3 = tmpk3 + 1"
-H "where: k1 > k2"
-T file.txt
http://host:port/api/testDb/testTbl/_stream_load
Routine Load
CREATE ROUTINE LOAD example_db.label1 ON my_table
COLUMNS(k1, k2, tmpk3, k3 = tmpk3 + 1),
PRECEDING FILTER k1 = 1,
WHERE k1 > k2
...
The above import methods support column mapping, transformation, and filtering operations on the source data:
Pre-filter: Filter the read raw data.
PRECEDING FILTER k1 = 1
Mapping: Defines a column in the source data. If the name of the defined column is the same as the column in the table, it is mapped directly to the column in the table. If different, the defined column can be used for subsequent transformation operations. As in the example above:
(k1, k2, tmpk3)
Conversion: Convert the mapped columns in the first step. You can use built-in expressions, functions, and custom functions for conversion and remap them to the corresponding columns in the table. As in the example above:
k3 = tmpk3 + 1
Post filtering: Filter the mapped and transformed columns by expressions. Filtered data rows are not imported into the system. As in the example above:
WHERE k1 > k2
Column Mapping
The main purpose of columns mapping is to describe the information of each column in the imported file, which is equivalent to defining the name of the column in the source file. By describing the column mapping relationship, we can import source files with different column order and different number of columns into Doris. Below is an example:
Suppose the source file has four columns, the content is as follows (the header column names are only for convenience, and there are no actual headers):
Column 1
Column 2
Column 3
Column 4
1
100
beijing
1.1
2
200
shanghai
1.2
3
300
guangzhou
1.3
4
\\N
chongqing
1.4
Note
\\N stands for null in the source file.
1. Adjust the map order
Suppose there are k1,k2,k3,k4 four columns in the table. The desired import map relationship is as follows:
Column 1                  -> k1
Column 2 -> k3
Column 3       -> k2
Column 4                      -> k4
The column map order should be as follows:
(k1, k3, k2, k4)
2. The number of columns in the source file is more than the columns in the table
Assume the table has k1,k2,k3 three columns. The import map relationship we desire is as follows:
Column 1                  -> k1
Column 2 -> k3
Column 3       -> k2
The column map order should be as follows:
(k1, k3, k2, tmpk4)
tmpk4 is a custom column name, which does not exist in the table. Doris will ignore the non-existing
column name.
3. The number of columns in the source file is less than the number of columns in the table, fill with default values.
Assume the table has k1,k2,k3,k4,k5 five columns. The import mapping relationship we desire is as follows:
Column 1                  -> k1
Column 2 -> k3
Column 3       -> k2
Here, we only use the first three columns from the source file. We hope to fill in k4,k5 two columns with default values.
The column mapping should be written in the following order:
(k1, k3, k2)
If columns k4,k5 have default values, the default values will be populated. Otherwise, if it is a nullable columns, it will be populated with a nullvalue. Otherwise, the import job will report an error.
Column Pre-Filtering
Pre-filtering is to filter the read raw data once. Currently, only BROKER LOAD and ROUTINE LOAD are supported.
Here are some application scenarios for pre-filtering:
1. Filtering before conversion.
In scenarios where you want to filter before column mapping and conversion. It can filter out some unwanted data first.
2. The filter column does not exist in the table and is only used as a filter identifier.
For example, the source data stores the data of multiple tables (or data of multiple tables is written to the same Kafka Message Queue). Each row in the data has a column name to identify which table the row of data belongs to. You can filter the corresponding table data for import by pre-filtering conditions.
Column Transformation
The column transformation feature allows users to transform the column values in the source file. Currently, Doris supports most built-in functions, and users-defined functions for conversion.
Note
 User-defined functions belong to a specific database. When using the user-defined functions for conversion, the user needs to have read permissions for this database.
Transformation operations are usually defined together with column mapping. That is, the columns are first mapped and then transformed. Below is an illustrative example:
Assume the source file has four columns with the following content (the header column names are only for convenience, and there's actually no header):
Column 1
Column 2
Column 3
Column 4
1
100
beijing
1.1
2
200
shanghai
1.2
3
300
guangzhou
1.3
4
400
chongqing
1.4
1. Convert the column values in the source file and import them into the table after transformation.
Assume there are k1,k2,k3,k4 four columns in the table. The desired mapping and transformation relationship is as follows:
Column 1                  -> k1
Column 2 * 100 -> k3
Column 3       -> k2
Column 4                      -> k4
The column map order should be as follows:
(k1, tmpk3, k2, k4, k3 = tmpk3 * 100)
Here, we name the second column in the source file as tmpk3, and specify the value of k3 in the table as tmpk3 * 100. The data in the final table is as follows:
k1
k2
k3
k4
1
beijing
10000
1.1
2
shanghai
20000
1.2
3
guangzhou
30000
1.3
null
chongqing
40000
1.4
2. Through the case when function, column transformation is performed conditionally.
Assuming there are k1,k2,k3,k4 four columns in the table. We hope that beijing, shanghai, guangzhou, chongqing in the source data are tranforted to their corresponding region ids and imported:
Column 1                  -> k1
Column 2                      -> k2
Column 3 converted to region id    -> k3
Column 4                      -> k4
The column map order should be as follows:
(k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
The data in the final table is as follows:
k1
k2
k3
k4
1
100
1
1.1
2
200
2
1.2
3
300
3
1.3
null
400
4
1.4
3. Convert the null value in the source file to 0 and import it. At the same time, the region ID transformation in Example 2 is also performed.
Assuming there are k1,k2,k3,k4 4 columns in the table. While converting the region id, we also want to transform the null value of the k1 column of the source data to 0 and import:
Column 1 If null, then transform to 0   -> k1
Column 2                      -> k2
Column 3                      -> k3
Column 4                      -> k4
The column map order should be as follows:
(tmpk1, k2, tmpk3, k4, k1 = ifnull(tmpk1, 0), k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
The data in the final table is as follows:
k1
k2
k3
k4
1
100
1
1.1
2
200
2
1.2
3
300
3
1.3
0
400
4
1.4
Column Filtering
After column mapping and transformation, we can filter the data that we do not want to import into Doris through filtering conditions. Below we illustrate with an example:
Suppose the source file has 4 columns, the content are as follows (the header column name are only for convenience, and there is actually no header):
Column 1
Column 2
Column 3
Column 4
1
100
beijing
1.1
2
200
shanghai
1.2
3
300
guangzhou
1.3
4
400
chongqing
1.4
1. In the default case of column mapping and transformation, filter directly.
Assuming there are k1,k2,k3,k4 4 columns in the table. We can directly define the filter conditions directly with default column mapping and transformation. If we want to import only the data rows whose fourth column in the source file is greater than 1.2, the filtering condition is as follows:
where k4 > 1.2
The data in the final table is as follows:
k1
k2
k3
k4
3
300
guangzhou
1.3
null
400
chongqing
1.4
By default, Doris will map columns in sequence, so the fourth column in the source file is automatically mapped to the k4 column in the table.
2. Filter the column-transformed data.
Suppose there are k1,k2,k3,k4 4 columns in the table. In the column transformation example, we transformed the province name into an id. Here we want to filter out the data with id 3. The transformation and filter conditions are as follows:
(k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
where k3 != 3
The data in the final table is as follows:
k1
k2
k3
k4
1
100
1
1.1
2
200
2
1.2
null
400
4
1.4
Here we see that the column value when performing the filter is the final column value after mapping and transformation, not the original data.
3. Multi-condition filtering
Suppose there are k1,k2,k3,k4 4 columns in the table. We want to filter out data whose k1 column is null, and at the same time filter out data whose k4 column is less than 1.2. Then the filter conditions are as follows:
where k1 is null and k4 < 1.2
The data in the final table is as follows:
k1
k2
k3
k4
2
200
2
1.2
3
300
3
1.3
Data Quality Problems and Filtering Thresholds
The processed data rows in the import Job can be divided into the following three types:
1. Filtered Rows
Data that is filtered out due to poor data quality. Unqualified data quality includes type errors, precision errors, long string length, mismatched file column number, and data rows that are filtered out due to the lack of corresponding partition.
2. Unselected Rows
This part is the row of data that was filtered out due to preceding filter or where column filtering conditions.
3. Loaded Rows
Rows of data being imported correctly.
Doris's import job allows users to set a maximum error rate (max_filter_ratio). If the error rate of imported data is below the threshold, those erroneous rows will be ignored and other correct data will be imported.
The error rate is calculated as follows:
#Filtered Rows / (#Filtered Rows + #Loaded Rows)
That is to say Unselected Rows will not be involved in the calculation of error rate.
Contact Us
Contact our sales team or business advisors to help your business.
Technical Support
Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.
7x24 Phone Support
tencent cloud

Recent Pages

Data Transformation, Column Mapping and Filtering

Supported import methods

Column Mapping

Column Pre-Filtering

Column Transformation

Column Filtering

Data Quality Problems and Filtering Thresholds

Was this page helpful?

Was this page helpful?

Column 1	Column 2	Column 3	Column 4
1	100	beijing	1.1
2	200	shanghai	1.2
3	300	guangzhou	1.3
4	\\N	chongqing	1.4
k1	k2	k3	k4
1	beijing	10000	1.1
2	shanghai	20000	1.2
3	guangzhou	30000	1.3
null	chongqing	40000	1.4
tencent cloud

Sign Up

Log in

Recent Pages

Data Transformation, Column Mapping and Filtering

Supported import methods

Column Mapping

Column Pre-Filtering

Column Transformation

Column Filtering

Data Quality Problems and Filtering Thresholds

Was this page helpful?

Was this page helpful?