tencent cloud

Feedback

Exporting via the EXPORT Statement

Last updated: 2024-06-27 11:06:34
    Data Export is a feature provided by Doris to export data. This feature allows the data of a specified table or partition to be exported in text format through a Broker process to remote storage such as HDFS/BOS, etc. This document introduces the basics, usage, best practices, and precautions of Export.

    Definitions

    FE: Frontend, the frontend node of Doris, responsible for Metadata management and request access.
    BE: Backend, the backend node of Doris, responsible for query execution and data storage.
    Broker: Doris can operate on remote storage through a Broker process.
    Tablet: A table is divided into several tablets.

    Principles

    After a user submits an Export Job, Doris will calculate all the tablets involved in this job, then group them and generate a special query plan for each group. This query plan will read the data in the tablet then use the Broker to write the data to a specific path in distant storage or directly export it to remote storage that supports the S3 protocol. The overall dispatch process works as follows:
    +--------+
    | Client |
    +---+----+
    | 1. Submit Job
    |
    +---v--------------------+
    | FE |
    | |
    | +-------------------+ |
    | | ExportPendingTask | |
    | +-------------------+ |
    | | 2. Generate Tasks
    | +--------------------+ |
    | | ExportExporingTask | |
    | +--------------------+ |
    | |
    | +-----------+ | +----+ +------+ +---------+
    | | QueryPlan +----------------> BE +--->Broker+---> |
    | +-----------+ | +----+ +------+ | Remote |
    | +-----------+ | +----+ +------+ | Storage |
    | | QueryPlan +----------------> BE +--->Broker+---> |
    | +-----------+ | +----+ +------+ +---------+
    +------------------------+ 3. Execute Tasks
    
    1. The user submits an Export Job to the FE.
    2. The Export scheduler of FE executes an Export Job in two stages:
    PENDING: FE generates an ExportPendingTask, sends a snapshot command to the BE to snapshot all involved tablets, and generates several query plans.
    EXPORTING: FE generates an ExportExportingTask and starts executing the query plan.

    Query Plan Splitting

    Each Export Job generates multiple query plans, each responsible for scanning a portion of tablets. The number of tablets scanned by each plan is specified by the FE's configuration parameter export_tablet_num_per_task. The default value is 5. Suppose there are 100 tablets, they will yield 20 query plans. Users might specify this value using the Job attribute tablet_num_per_task when submitting Jobs. Multiple query plans of a job are executed sequentially.

    Query Plan Execution

    A query plan scans multiple tablets, organizes the data read into rows, makes a batch every 1024 behaviors, and uses Broker to write to remote storage.
    A query plan will automatically retry as a whole 3 times when it encounters an error. If a query plan fails after 3 retries, the entire job will fail.
    Doris will first create a temporary directory named __doris_export_tmp_12345 in the specified remote storage path (where 12345 is the job id). The exported data will first be written to this temporary directory. Each query plan generates a file, the file name example: export-data-c69fcf2b6db5420f-a96b94c1ff8bccef-1561453713822. Where c69fcf2b6db5420f-a96b94c1ff8bccef is the query id of the query plan. 1561453713822 is the Timestamp when the file was created.
    When all data is exported, Doris will rename these files to the user-specified path.

    Broker Parameters

    Export needs the help of the Broker process to access remote storage, and different Brokers need to provide different parameters.

    Starting Export

    Exporting to HDFS

    EXPORT TABLE db1.tbl1
    PARTITION (p1,p2)
    [WHERE [expr]]
    TO "hdfs://host/path/to/export/"
    PROPERTIES
    (
    "label" = "mylabel",
    "column_separator"=",",
    "columns" = "col1,col2",
    "exec_mem_limit"="2147483648",
    "timeout" = "3600"
    )
    WITH BROKER "hdfs"
    (
    "username" = "user",
    "password" = "passwd"
    );
    label: The identifier of this export job. You can use this identifier to view job status later.
    column_separator: Column separator. The default is \\t. Invisible characters are supported, such as '\\x07'.
    columns: The columns to export, separated by commas in English. If this parameter is not filled in, all columns of the table are exported by default.
    line_delimiter: Row delimiter. The default is \\n. Invisible characters are supported, such as '\\x07'.
    exec_mem_limit: This is the memory usage limit of a query plan in a single BE in the Export Job. Default is 2GB. The unit is byte.
    timeout: Job timeout. The default is 2 hours. The unit is seconds.
    tablet_num_per_task: The maximum number of tablets allocated per query plan. The default is 5.

    Exporting to Cloud Object Storage

    Exporting directly to cloud storage, without going through a broker.
    EXPORT TABLE db.table
    TO "s3://your_bucket/xx_path"
    PROPERTIES
    (
    "label"="your_label",
    "line_delimiter"="\\n",
    "column_separator"="\\001",
    "tablet_num_per_task"="10"
    )
    WITH S3
    (
    "AWS_ENDPOINT" = "http://cos.ap-beijing.myqcloud.com",
    "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
    "AWS_SECRET_KEY"="AWS_SECRET_KEY",
    "AWS_REGION"="AWS_REGION"
    );
    AWS_ACCESS_KEY/AWS_SECRET_KEY: These are the keys to access your OSS API.
    AWS_ENDPOINT: This represents the region of the OSS data center.
    AWS_REGION: The Endpoint represents the domain name for accessing the OSS external service.

    Viewing the Export Status

    After submitting the job, you can query the status of the import job using the SHOW EXPORT command. Example results are as follows:
    mysql> show EXPORT\\G;
    *************************** 1. row ***************************
    JobId: 14008
    State: FINISHED
    Progress: 100%
    TaskInfo: {"partitions":["*"],"exec mem limit":2147483648,"column separator":",","line delimiter":"\\n","tablet num":1,"broker":"hdfs","coord num":1,"db":"default_cluster:db1","tbl":"tbl3"}
    Path: hdfs://host/path/to/export/
    CreateTime: 2019-06-25 17:08:24
    StartTime: 2019-06-25 17:08:28
    FinishTime: 2019-06-25 17:08:34
    Timeout: 3600
    ErrorMsg: NULL
    1 row in set (0.01 sec)
    JobId: The unique ID of the Job.
    Label: The identifier of the Job.
    State: The state of the Job:
    PENDING: The Job is waiting to be scheduled.
    EXPORTING: Data is being exported.
    FINISHED: The Job was successful.
    CANCELLED: The Job failed.
    Progress: The progress of the Job. This progress is counted on the basis of query plans. Assuming there are 10 query plans in total, and 3 have been completed, the progress would be 30%.
    TaskInfo: Job information displayed in Json format:
    db: Database name.
    tbl: Table name.
    partitions: Specifies the partition to export. * implies all partitions.
    exec mem limit: Memory usage limit for the query plan. The unit is byte.
    column separator: Column separator for export file.
    line delimiter: Row separator for export file.
    tablet num: Total number of tablets involved.
    broker: Name of the broker in use.
    coord num: Number of query plans.
    Path: Export path on the remote storage.
    CreateTime/StartTime/FinishTime: Creation time, start scheduling time, and finish time of the job.
    Timeout: Job timeout. Unit in seconds. This timing starts from CreateTime.
    ErrorMsg: If the job encounters an error, it will show the cause of the error here.

    Best Practice

    Splitting the Query Plan

    The number of query plans that need to be executed for an export job depends on the total number of tablets, and how many tablets can be allocated to a single query plan. Multiple query plans are executed serially. Therefore, assigning more tablets to a query plan can reduce the job's runtime. However, if the query plan fails (for example, when calling the broker's RPC fails, or remote storage fluctuates), too many tablets will increase the retry cost for a query plan. Therefore, the number of query plans and the number of tablets each query plan scans need to be arranged reasonably in order to strike a balance between execution time and execution success rate. It is generally recommended that the amount of data scanned by a query plan is within 3-5 GB (the size and number of tablets of a table can be viewed by using SHOW TABLET FROM tbl_name; command).

    exec_mem_limit

    Generally, a query plan for an Export Job only contains two parts: scanning - exporting, and does not need calculation logics that requires large memory. Therefore, the default memory limit of 2GB is usually able to meet the requirements. However, in some scenarios, such as when one query plan needs to scan too many tablets on the same BE or when there are too many versions of the tablet data, it could result in insufficient memory. In such cases, this parameter needs to be used to set larger memory, such as 4 GB or 8 GB.

    Exporting to Tencent Cloud COS

    Accessing COS Domain Name

    COS buckets have region attribute. When specifying COS upload address, it is necessary to include region ID. Detailed region ID can be viewed from Regions and Access Endpoints, or it can be found on the COS list page:
    
    enter image description here
    
    
    When using the EXPORT statement, the AWS_ENDPOINT parameter must be specified. This is the domain name address of the COS, formatted as:
    https://cos.[region_id].myqcloud.com
    In which region_id is the name of the region where the COS is located, for example: ap-guangzhou in the above figure. Other fields remain unchanged.

    Access to COS's SecretId and SecretKey

    COS's SecretId corresponds to S3 protocol's ACCESS_KEY, and COS's SecretKey corresponds to S3 protocol's SECRET_KEY. Access to COS's SecretId and SecretKey can be obtained from Tencent Application Programming Interface Key Management. If there are existing keys, they can be used directly; or new keys need to be created.
    
    
    

    Starting the Export Task

    Based on the information obtained, an EXPORT task can be created. Taking the export of a 62 GB lineitem_ep table containing 600 million rows of data as an example.
    EXPORT TABLE lineitem TO "s3://doris-1301087413/doris-export/export_test"
    WITH s3 (
    "AWS_ENDPOINT" = "https://cos.ap-guangzhou.myqcloud.com",
    "AWS_ACCESS_KEY" = "xxxSecretId",
    "AWS_SECRET_KEY"="xxxSecretKey",
    "AWS_REGION" = "ap-guangzhou"
    );
    In s3://doris-1301087413/doris-export/export_test.
    's3' is the fixed prefix, indicating the target to Doris.
    'doris-1301087413' is the COS bucket.
    '/doris-export/export_test' is the path to the designated export location. If the path does not exist, Doris will create it automatically.
    "AWS_ENDPOINT" = "https://cos.ap-guangzhou.myqcloud.com" is the COS server domain name filled in according to the region of the bucket.
    "AWS_REGION" = "ap-guangzhou" is the region of the bucket.

    Viewing Export Task Progress and Information

    By using the command 'show export;' you can peek at the progress and error information of the export task.
    
    enter image description here
    
    
    Pay attention to the 'State' and 'Progress' fields. If the task fails, the 'ErrorMsg' field will have the corresponding error information.

    Importing COS Data Files into DORIS

    Import directive:
    LOAD LABEL test_db.exmpale_label_4(
    DATA INFILE("s3://doris-1301087413/doris-export/export_test/*")
    INTO TABLE test_tb COLUMNS TERMINATED BY "\\t"
    )
    WITH S3(
    "AWS_ENDPOINT" = "https://cos.ap-guangzhou.myqcloud.com",
    "AWS_ACCESS_KEY" = "xxxxx", "AWS_SECRET_KEY"="xxxx",
    "AWS_REGION" = "ap-guangzhou"
    )
    PROPERTIES (
    "timeout" = "7600"
    );
    Where:
    LABEL test_db.exmpale_label_4:test_db indicates the name of the import db.
    exmpale_label_4 is the Tag for this import, uniquely marking one import task DATA INFILE("s3://doris-1301087413/doris-export/export_test/*") :doris-1301087413/doris-export/export_test/ indicates the import file path.
    * indicates that all files under this path should be imported, or you can specify a unique file name.

    Exporting to Tencent Cloud EMR's HDFS

    Network Confirmation

    Confirm that Doris cluster and EMR cluster are in the same VPC network.

    Creating the Export Task

    EXPORT TABLE orders TO "hdfs://hdfs_ip:hdfs_port/your_path"
    PROPERTIES (
    "column_separator"="\\t",
    "line_delimiter" = "\\n"
    )
    WITH BROKER "Broker_Doris"(
    "username" = "-",
    "password"="-"
    )
    Where:
    Broker_Doris: Broker_Doris is the default name in Tencent Cloud Doris's Broker, there is no need to modify.
    username: The username to access HDFS.
    password: The password to access HDFS.

    Viewing the Export Task Status

    By using the command 'show export;' you can peek at the progress and error information of the export task.
    
    enter image description here
    
    
    Pay attention to the 'State' and 'Progress' fields. If the task fails, the 'ErrorMsg' field will have the corresponding error information.

    Importing Doris from Tencent Cloud EMR's HDFS

    Import directive:
    LOAD LABEL tpchdb.load_orders_recover(
    DATA INFILE("hdfs://hdfs_ip:hdfs_port/your_path/*")
    INTO TABLE orders_recover
    COLUMNS TERMINATED BY "\\t"
    )
    WITH BROKER "Broker_Doris"(
    "username" = "-",
    "password" = "-!"
    )
    PROPERTIES
    (
    "timeout"="1200",
    "max_filter_ratio"="0.1"
    );
    Where:
    LABEL tpchdb.load_orders_recover:tpchdb indicates the name of the import db.
    load_orders_recover is the tag for this import.
    DATA INFILE("hdfs://hdfs_ip:hdfs_port/your_path/"): hdfs://hdfs_ip:hdfs_port/your_path/ is the import file path.
    * indicates that all files under this path should be imported, or you can specify a unique file name.

    FAQs

    Import error: Scan byte per file scanner exceeds limit:xxxxx

    Findings: 'show load' command can be used to view the import task operation state, the 'ErrorMsg' shows error messages.
    
    enter image description here
    
    
    Reason: The size of the import file is larger than the maximum value in cluster settings.
    Resolution: Change the configuration 'max_broker_concurrency' to 'BE amount' in fe.conf. The amount of data handled by a single BE in the current import task = original file size / max_broker_concurrencymax_bytes_per_broker_scanner should be >= the amount of data handled by a single BE in the current import task
    Configuration Method:
    ADMIN SET FRONTEND CONFIG ("max_bytes_per_broker_scanner" = "52949672960");
    ADMIN SET FRONTEND CONFIG ("max_broker_concurrency" = "3");

    Notes

    It is not recommended to export a large amount of data at once. The recommended maximum export data volume for an Export Job is in dozens of GBs. Oversize exports will lead to more garbage files and higher retry costs.
    If the table data volume is too large, it is recommended to export partition by partition.
    During the running of an Export Job, if FE restarts or master switches occur, the Export Job will fail and users need to resubmit.
    If the Export Job fails, the temporary directory __doris_export_tmp_xxx generated at the remote storage and the already generated files will not be deleted, users need to manually delete them.
    If the Export Job runs successfully, the __doris_export_tmp_xxx directory generated at the remote storage, depending on the file system semantics of the remote storage, may be retained or may be erased. For example, in Baidu Cloud Object Storage (BOS), by removing the last file out of a directory through a rename operation, the directory will also be deleted. If the directory is not deleted, users can delete it manually.
    When the Export is completed (successful or failed), if FE restarts or master switches occur, some information of the Job displayed by SHOW EXPORT will be lost and cannot be viewed.
    Export Job will only export data from the Base table, not the data from the Rollup Index.
    The Export Job will scan data, occupy IO resources, and may affect the system's query latency.

    Relevant Configuration

    FE

    export_checker_interval_second: The scheduling interval of the Export Job scheduler, which is 5 seconds by default. It is required to restart FE to set this parameter.
    export_running_job_num_limit: The limit for the number of running Export Jobs. If exceeded, the Job will wait and be in PENDING status. The default limit is 5, which can be adjusted during runtime.
    export_task_default_timeout_second: The default timeout for the Export Job. The default is 2 hours. It can be adjusted during runtime.
    export_tablet_num_per_task: The maximum number of tablets a query plan is responsible for. Default value is 5.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support