tencent cloud

Feedback

Data Transformation, Column Mapping and Filtering

Last updated: 2024-06-27 11:05:37

    Supported import methods

    LOAD LABEL example_db.label1
    (
    DATA INFILE("bos://bucket/input/file")
    INTO TABLE my_table
    (k1, k2, tmpk3)
    PRECEDING FILTER k1 = 1
    SET (
    k3 = tmpk3 + 1
    )
    WHERE k1 > k2
    )
    WITH BROKER bos
    (
    ...
    );
    curl
    --location-trusted
    -u user:passwd
    -H "columns: k1, k2, tmpk3, k3 = tmpk3 + 1"
    -H "where: k1 > k2"
    -T file.txt
    http://host:port/api/testDb/testTbl/_stream_load
    CREATE ROUTINE LOAD example_db.label1 ON my_table
    COLUMNS(k1, k2, tmpk3, k3 = tmpk3 + 1),
    PRECEDING FILTER k1 = 1,
    WHERE k1 > k2
    ...
    The above import methods support column mapping, transformation, and filtering operations on the source data:
    Pre-filter: Filter the read raw data.
    PRECEDING FILTER k1 = 1
    Mapping: Defines a column in the source data. If the name of the defined column is the same as the column in the table, it is mapped directly to the column in the table. If different, the defined column can be used for subsequent transformation operations. As in the example above:
    (k1, k2, tmpk3)
    Conversion: Convert the mapped columns in the first step. You can use built-in expressions, functions, and custom functions for conversion and remap them to the corresponding columns in the table. As in the example above:
    k3 = tmpk3 + 1
    Post filtering: Filter the mapped and transformed columns by expressions. Filtered data rows are not imported into the system. As in the example above:
    WHERE k1 > k2

    Column Mapping

    The main purpose of columns mapping is to describe the information of each column in the imported file, which is equivalent to defining the name of the column in the source file. By describing the column mapping relationship, we can import source files with different column order and different number of columns into Doris. Below is an example: Suppose the source file has four columns, the content is as follows (the header column names are only for convenience, and there are no actual headers):
    Column 1
    Column 2
    Column 3
    Column 4
    1
    100
    beijing
    1.1
    2
    200
    shanghai
    1.2
    3
    300
    guangzhou
    1.3
    4
    \\N
    chongqing
    1.4
    Note
    \\N stands for null in the source file.
    1. Adjust the map order Suppose there are k1,k2,k3,k4 four columns in the table. The desired import map relationship is as follows:
    Column 1 -> k1
    Column 2 -> k3
    Column 3 -> k2
    Column 4 -> k4
    The column map order should be as follows:
    (k1, k3, k2, k4)
    2. The number of columns in the source file is more than the columns in the table Assume the table has k1,k2,k3 three columns. The import map relationship we desire is as follows:
    Column 1 -> k1
    Column 2 -> k3
    Column 3 -> k2
    The column map order should be as follows:
    (k1, k3, k2, tmpk4)
    tmpk4 is a custom column name, which does not exist in the table. Doris will ignore the non-existing column name.
    3. The number of columns in the source file is less than the number of columns in the table, fill with default values. Assume the table has k1,k2,k3,k4,k5 five columns. The import mapping relationship we desire is as follows:
    Column 1 -> k1
    Column 2 -> k3
    Column 3 -> k2
    Here, we only use the first three columns from the source file. We hope to fill in k4,k5 two columns with default values. The column mapping should be written in the following order:
    (k1, k3, k2)
    If columns k4,k5 have default values, the default values will be populated. Otherwise, if it is a nullable columns, it will be populated with a nullvalue. Otherwise, the import job will report an error.

    Column Pre-Filtering

    Pre-filtering is to filter the read raw data once. Currently, only BROKER LOAD and ROUTINE LOAD are supported. Here are some application scenarios for pre-filtering:
    1. Filtering before conversion. In scenarios where you want to filter before column mapping and conversion. It can filter out some unwanted data first.
    2. The filter column does not exist in the table and is only used as a filter identifier. For example, the source data stores the data of multiple tables (or data of multiple tables is written to the same Kafka Message Queue). Each row in the data has a column name to identify which table the row of data belongs to. You can filter the corresponding table data for import by pre-filtering conditions.

    Column Transformation

    The column transformation feature allows users to transform the column values in the source file. Currently, Doris supports most built-in functions, and users-defined functions for conversion.
    Note
    User-defined functions belong to a specific database. When using the user-defined functions for conversion, the user needs to have read permissions for this database.
    Transformation operations are usually defined together with column mapping. That is, the columns are first mapped and then transformed. Below is an illustrative example: Assume the source file has four columns with the following content (the header column names are only for convenience, and there's actually no header):
    Column 1
    Column 2
    Column 3
    Column 4
    1
    100
    beijing
    1.1
    2
    200
    shanghai
    1.2
    3
    300
    guangzhou
    1.3
    4
    400
    chongqing
    1.4
    1. Convert the column values in the source file and import them into the table after transformation. Assume there are k1,k2,k3,k4 four columns in the table. The desired mapping and transformation relationship is as follows:
    Column 1 -> k1
    Column 2 * 100 -> k3
    Column 3 -> k2
    Column 4 -> k4
    The column map order should be as follows:
    (k1, tmpk3, k2, k4, k3 = tmpk3 * 100)
    Here, we name the second column in the source file as tmpk3, and specify the value of k3 in the table as tmpk3 * 100. The data in the final table is as follows:
    k1
    k2
    k3
    k4
    1
    beijing
    10000
    1.1
    2
    shanghai
    20000
    1.2
    3
    guangzhou
    30000
    1.3
    null
    chongqing
    40000
    1.4
    2. Through the case when function, column transformation is performed conditionally. Assuming there are k1,k2,k3,k4 four columns in the table. We hope that beijing, shanghai, guangzhou, chongqing in the source data are tranforted to their corresponding region ids and imported:
    Column 1 -> k1
    Column 2 -> k2
    Column 3 converted to region id -> k3
    Column 4 -> k4
    The column map order should be as follows:
    (k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
    The data in the final table is as follows:
    k1
    k2
    k3
    k4
    1
    100
    1
    1.1
    2
    200
    2
    1.2
    3
    300
    3
    1.3
    null
    400
    4
    1.4
    3. Convert the null value in the source file to 0 and import it. At the same time, the region ID transformation in Example 2 is also performed. Assuming there are k1,k2,k3,k4 4 columns in the table. While converting the region id, we also want to transform the null value of the k1 column of the source data to 0 and import:
    Column 1 If null, then transform to 0 -> k1
    Column 2 -> k2
    Column 3 -> k3
    Column 4 -> k4
    The column map order should be as follows:
    (tmpk1, k2, tmpk3, k4, k1 = ifnull(tmpk1, 0), k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
    The data in the final table is as follows:
    k1
    k2
    k3
    k4
    1
    100
    1
    1.1
    2
    200
    2
    1.2
    3
    300
    3
    1.3
    0
    400
    4
    1.4

    Column Filtering

    After column mapping and transformation, we can filter the data that we do not want to import into Doris through filtering conditions. Below we illustrate with an example: Suppose the source file has 4 columns, the content are as follows (the header column name are only for convenience, and there is actually no header):
    Column 1
    Column 2
    Column 3
    Column 4
    1
    100
    beijing
    1.1
    2
    200
    shanghai
    1.2
    3
    300
    guangzhou
    1.3
    4
    400
    chongqing
    1.4
    1. In the default case of column mapping and transformation, filter directly. Assuming there are k1,k2,k3,k4 4 columns in the table. We can directly define the filter conditions directly with default column mapping and transformation. If we want to import only the data rows whose fourth column in the source file is greater than 1.2, the filtering condition is as follows:
    where k4 > 1.2
    The data in the final table is as follows:
    k1
    k2
    k3
    k4
    3
    300
    guangzhou
    1.3
    null
    400
    chongqing
    1.4
    By default, Doris will map columns in sequence, so the fourth column in the source file is automatically mapped to the k4 column in the table.
    2. Filter the column-transformed data. Suppose there are k1,k2,k3,k4 4 columns in the table. In the column transformation example, we transformed the province name into an id. Here we want to filter out the data with id 3. The transformation and filter conditions are as follows:
    (k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
    where k3 != 3
    The data in the final table is as follows:
    k1
    k2
    k3
    k4
    1
    100
    1
    1.1
    2
    200
    2
    1.2
    null
    400
    4
    1.4
    Here we see that the column value when performing the filter is the final column value after mapping and transformation, not the original data.
    3. Multi-condition filtering Suppose there are k1,k2,k3,k4 4 columns in the table. We want to filter out data whose k1 column is null, and at the same time filter out data whose k4 column is less than 1.2. Then the filter conditions are as follows:
    where k1 is null and k4 < 1.2
    The data in the final table is as follows:
    k1
    k2
    k3
    k4
    2
    200
    2
    1.2
    3
    300
    3
    1.3

    Data Quality Problems and Filtering Thresholds

    The processed data rows in the import Job can be divided into the following three types:
    1. Filtered Rows Data that is filtered out due to poor data quality. Unqualified data quality includes type errors, precision errors, long string length, mismatched file column number, and data rows that are filtered out due to the lack of corresponding partition.
    2. Unselected Rows This part is the row of data that was filtered out due to preceding filter or where column filtering conditions.
    3. Loaded Rows Rows of data being imported correctly.
    Doris's import job allows users to set a maximum error rate (max_filter_ratio). If the error rate of imported data is below the threshold, those erroneous rows will be ignored and other correct data will be imported. The error rate is calculated as follows:
    #Filtered Rows / (#Filtered Rows + #Loaded Rows)
    That is to say Unselected Rows will not be involved in the calculation of error rate.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support