tencent cloud

Feedback

Importing with DataX

Last updated: 2024-06-27 11:05:02

    About DataX

    As a universal offline data import tool open-sourced by Alibaba, DataX is widely used in the industry. DataX enables efficient data synchronization features between various heterogeneous data sources such as MySQL, Oracle, SQLServer, Postgre, HDFS, Hive, ADS, HBase, TableStore(OTS), MaxCompute(ODPS), Hologres, and DRDS, etc. Importing data into Doris using DataX requires the DataX doriswriter plugin. This plugin uses the Stream Load feature of Doris for data import and works together with DataX service.

    User Manual

    For the code for the DataX doriswriter plugin, see here. This includes the plugin code and the development environment of the DataX project. The doriswriter plugin depends on some modules in the DataX code, which are not in the official Maven repository. Thus, when developing the doriswriter plugin, it is necessary to download the complete DataX code library to compile and develop the plugin.

    Directory Structure

    doriswriter/

    This directory is the code directory of the doriswriter plugin. All the code in this directory is hosted in the Apache Doris repository. For the doriswriter plugin documentation, see:doriswriter/doc.

    init-env.sh

    This script is primarily used to set up the DataX development environment, which includes the following operations:
    1. Clone the DataX code library to local.
    2. Connect doriswriter/ directory to DataX/doriswriter directory through soft link.
    3. Add the <module>doriswriter</module> module to the DataX/pom.xml file.
    4. Change the httpclient version in the DataX/core/pom.xml file from 4.5 to 4.5.13.
    Note
    Httpclient v4.5 has a bug in handling 307 redirects.
    After running this script, developers may start developing or compiling in the DataX/ directory. Due to the soft link, any modifications in the DataX/doriswriter directory will be reflected in the doriswriter/ directory, which facilitates code submission for developers.

    Compilation

    1. Run init-env.sh.
    2. Modify the code in DataX/doriswriter as needed.
    3. Compile doriswriter:
    3.1 Compile the doriswriter plugin separately:
    mvn clean install -pl plugin-rdbms-util,doriswriter -DskipTests
    3.2 Compile the entire DataX project:
    mvn package assembly:assembly -Dmaven.test.skip=true
    The output is in target/datax/datax/.
    Note
    The hdfsreader, hdfswriter, and oscarwriter plugins require additional jar packages. If you don't need these plugins, you can remove their modules from DataX/pom.xml.
    3.3 Compilation error If you encounter the following compilation error:
    Could not find artifact com.alibaba.datax:datax-all:pom:0.0.1-SNAPSHOT ...
    You can try the following ways to solve it:
    3.3.2 After decompression, copy the directory alibaba/datax/ obtained to the corresponding .m2/repository/com/alibaba/ of the maven in use.
    3.3.3 Try compiling again.
    3.3.4 Submit modifications as needed.

    Sample code

    Stream reads data and imports it into Doris

    For instructions on how to use this sample plugin, please see here.

    Mysql reads and imports data into Doris

    1. Mysql table structure
    CREATE TABLE t_test(
    idbigint(30) NOT NULL,
    order_code varchar(30) DEFAULT NULL COMMENT '',
    line_code varchar(30) DEFAULT NULL COMMENT '',
    remark varchar(30) DEFAULT NULL COMMENT '',
    unit_no varchar(30) DEFAULT NULL COMMENT '',
    unit_name varchar(30) DEFAULT NULL COMMENT '',
    price decimal(12,2) DEFAULT NULL COMMENT '',
    PRIMARY KEY(id) USING BTREE
    )ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC COMMENT='';
    2. Doris table structure
    CREATE TABLE ods_t_test (
    idbigint(30) NOT NULL,
    order_code varchar(30) DEFAULT NULL COMMENT '',
    line_code varchar(30) DEFAULT NULL COMMENT '',
    remark varchar(30) DEFAULT NULL COMMENT '',
    unit_no varchar(30) DEFAULT NULL COMMENT '',
    unit_name varchar(30) DEFAULT NULL COMMENT '',
    price decimal(12,2) DEFAULT NULL COMMENT ''
    )ENGINE=OLAP
    UNIQUE KEY(id, order_code)
    DISTRIBUTED BY HASH(order_code) BUCKETS 1
    PROPERTIES (
    "replication_allocation" = "tag.location.default: 3",
    "in_memory" = "false",
    "storage_format" = "V2"
    );
    3. Create datax scriptimport_t_test.json
    {
    "job": {
    "setting": {
    "speed": {
    "channel": 1
    },
    "errorLimit": {
    "record": 0,
    "percentage": 0
    }
    },
    "content": [
    {
    "reader": {
    "name": "mysqlreader",
    "parameter": {
    "username": "xxx",
    "password": "xxx",
    "column": ["id","order_code","line_code","remark","unit_no","unit_name","price"],
    "connection": [ { "table": [ "t_test" ], "jdbcUrl": [ "jdbc:mysql://10.10.10.1:3306/demo" ] } ] }
    },
    "writer": {
    "name": "doriswriter",
    "parameter": {
    "feLoadUrl": ["127.0.0.1:8030","127.0.0.2:8030"],
    "beLoadUrl": ["127.0.0.3:8040","127.0.0.4:8040","127.0.0.5:8040"],
    "jdbcUrl": "jdbc:mysql://127.0.0.1:9030/",
    "database": "demo",
    "table": "ods_t_test",
    "column": ["id","order_code","line_code","remark","unit_no","unit_name","price"],
    "username": "xxx",
    "password": "xxx",
    "postSql": [],
    "preSql": [],
    "loadProps": {
    },
    "maxBatchRows" : 300000,
    "maxBatchByteSize" : 20971520
    }
    }
    }
    ]
    }
    }
    4. Execute datax task: python datax.py import_t_test.json. For specific usage, see DataX official website.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support