tencent cloud

Feedback

Export Query Results

Last updated: 2024-06-27 11:07:10
    This document explains how to use the OUTFILE command to export query results.

    Syntax

    The SELECT INTO OUTFILE statement can export query results to a file. Currently, it supports exporting to remote storage such as HDFS, S3, BOS, and COS (Tencent Cloud) through the Broker process, the S3 protocol, or direct through the HDFS protocol. The syntax is as follows:
    query_stmt
    INTO OUTFILE "file_path"
    [format_as]
    [properties]
    file_path``file_path points to the path and file prefix of Cloud File Storage. Such as hdfs://path/to/my_file_. The final filename will be composed of my_file_, file number and file format suffix. The file number starts from 0, and the quantity is the number of times the file is divided, such as:
    my_file_abcdefg_0.csv
    my_file_abcdefg_1.csv
    my_file_abcdegf_2.csv
    [format_as]
    FORMAT AS CSV
    Specifies the export format. The default is CSV.
    [properties] Specify related attributes. Currently, it supports exporting through the Broker process, or through the S3 protocol.
    Broker related attributes must be prefixed with broker..
    HDFS related attributes must be prefixed with hdfs..
    For the S3 protocol, you can directly execute the S3 protocol configuration.
    ("broker.prop_key" = "broker.prop_val", ...)
    or
    ("hdfs.fs.defaultFS" = "xxx", "hdfs.hdfs_user" = "xxx")
    or
    ("AWS_ENDPOINT" = "xxx", ...)
    Other properties:
    ("key1" = "val1", "key2" = "val2", ...)
    The following properties are currently supported:
    column_separator: Column separator, only applicable to CSV format. Default is \\t.
    line_delimiter: Row separator, only applicable to CSV format. Default is \\n.
    max_file_size: Maximum size of a single file. The default is 1GB. The range is between 5MB and 2GB. Files larger than this size will be split.
    schema: PARQUET file schema information. Only applicable to PARQUET format. When the files are exported in the PARQUET format, you must specify schema.

    Concurrent Export

    By default, the export of query result sets is non-concurrent, that is, single-point export. If users wish to export the query result sets concurrently, the following conditions must be met:
    1. Enable concurrent export with the session variable 'enable_parallel_outfile': set enable_parallel_outfile = true;.
    2. The export mode is through S3 or HDFS, not through a broker.
    3. The query can meet the requirements for concurrent export, for example, the top level does not include a single-point node like sort. (Examples will be given later to explain which queries cannot export result sets concurrently). The above three conditions can trigger the concurrent export of query results. The degree of concurrency is be_instance_num * parallel_fragment_exec_instance_num.

    How to Verify Whether a Result Set Has Been Exported Concurrently

    After the user enables concurrent export through the session variable Setting, if you want to verify whether the current query can be exported concurrently, you can use the following method.
    explain select xxx from xxx where xxx into outfile "s3://xxx" format as csv properties ("AWS_ENDPOINT" = "xxx", ...);
    After explaining the query, Doris will return the planning of this query. If you see RESULT FILE SINK appearing in PLAN FRAGMENT 1, it means that the concurrent export was successful. If RESULT FILE SINK appears in PLAN FRAGMENT 0, it means that the current query cannot be exported concurrently (the current query does not simultaneously satisfy the three conditions of concurrent export).
    Concurrent Export Planning Example:
    +-----------------------------------------------------------------------------+
    | Explain String |
    +-----------------------------------------------------------------------------+
    | PLAN FRAGMENT 0 |
    | OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5> |
    | PARTITION: UNPARTITIONED |
    | |
    | RESULT SINK |
    | |
    | 1:EXCHANGE |
    | |
    | PLAN FRAGMENT 1 |
    | OUTPUT EXPRS:`k1` + `k2` |
    | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` |
    | |
    | RESULT FILE SINK |
    | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ |
    | STORAGE TYPE: S3 |
    | |
    | 0:OlapScanNode |
    | TABLE: multi_tablet |
    +-----------------------------------------------------------------------------+

    Usage Examples

    1. Example 1 Exporting using a broker, exporting the simple query result to the file hdfs://path/to/result.txt. Specify the export format as CSV. Use Broker_Doris and set Kerberos authentication information. Specify the column separator as , and the row separator as \\n.
    SELECT * FROM tbl
    INTO OUTFILE "hdfs://path/to/result_"
    FORMAT AS CSV
    PROPERTIES
    (
    "broker.name" = "Broker_Doris",
    "broker.hadoop.security.authentication" = "kerberos",
    "broker.kerberos_principal" = "doris@YOUR.COM",
    "broker.kerberos_keytab" = "/home/doris/my.keytab",
    "column_separator" = ",",
    "line_delimiter" = "\\n",
    "max_file_size" = "100MB"
    );
    If the final file generated is no larger than 100MB, it will be: result_0.csv. If it is larger than 100MB, it may be: result_0.csv, result_1.csv, ....
    2. Example 2 Export the simple query result to the file hdfs://path/to/result.parquet. Specify the export format as PARQUET. Use Broker_Doris and set Kerberos authentication information.
    SELECT c1, c2, c3 FROM tbl
    INTO OUTFILE "hdfs://path/to/result_"
    FORMAT AS PARQUET
    PROPERTIES
    (
    "broker.name" = "Broker_Doris",
    "broker.hadoop.security.authentication" = "kerberos",
    "broker.kerberos_principal" = "doris@YOUR.COM",
    "broker.kerberos_keytab" = "/home/doris/my.keytab",
    "schema"="required,int32,c1;required,byte_array,c2;required,byte_array,c2"
    );
    Exporting query results to a parquet file requires explicitly specifying the schema.
    3. Example 3 Export the query result of the CTE statement to the file hdfs://path/to/result.txt. The default export format is CSV. Use Broker_Doris and set hdfs high availability information. Use the default row and column separators.
    WITH
    x1 AS
    (SELECT k1, k2 FROM tbl1),
    x2 AS
    (SELECT k3 FROM tbl2)
    SELEC k1 FROM x1 UNION SELECT k3 FROM x2
    INTO OUTFILE "hdfs://path/to/result_"
    PROPERTIES
    (
    "broker.name" = "Broker_Doris",
    "broker.username"="user",
    "broker.password"="passwd",
    "broker.dfs.nameservices" = "my_ha",
    "broker.dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
    "broker.dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
    "broker.dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
    "broker.dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
    );
    If the final file generated is no larger than 1GB, it will be: result_0.csv. If it is larger than 1GB, it may be: result_0.csv, result_1.csv, ....
    4. Example 4 Export the query results of the UNION statement to the file bos://bucket/result.txt. Specify the export format as PARQUET. Use Broker_Doris and set hdfs for high availability. The PARQUET format does not need to specify a column divider. After the export is finished, an identification file will be generated.
    SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
    INTO OUTFILE "bos://bucket/result_"
    FORMAT AS PARQUET
    PROPERTIES
    (
    "broker.name" = "Broker_Doris",
    "broker.bos_endpoint" = "http://bj.bcebos.com",
    "broker.bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
    "broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyyy",
    "schema"="required,int32,k1;required,byte_array,k2"
    );
    5. Example 5 Export the query results of the select statement to the file cos://${bucket_name}/path/result.txt. Specify the export format as csv. After the export is completed, an identification file will be generated.
    select k1,k2,v1 from tbl1 limit 100000
    into outfile "cosn://my_bucket/export/my_file_"
    FORMAT AS CSV
    PROPERTIES
    (
    "broker.name" = "Broker_Doris",
    "broker.fs.cosn.userinfo.secretId" = "xx", "broker.fs.cosn.userinfo.secretKey" = "xx", "broker.fs.cosn.bucket.endpoint_suffix" = "cos.<REGION>.myqcloud.com",
    "column_separator" = ",",
    "line_delimiter" = "\\n",
    "max_file_size" = "1024MB",
    "success_file_name" = "SUCCESS"
    )
    If the final file is not larger than 1GB, it will be: my_file_0.csv. If it is larger than 1GB, it might be: my_file_0.csv, result_1.csv, .... Verify on cos
    Paths that do not exist will be automatically created.
    Confirm the access.key/secret.key/endpoint with the cos staff. Especially the value of the endpoint, is not required to be filled in the bucket_name.
    6. Example 6 Export to bos using the s3 protocol, and enable concurrency export.
    set enable_parallel_outfile = true;
    select k1 from tb1 limit 1000
    into outfile "s3://my_bucket/export/my_file_"
    format as csv
    properties
    (
    "AWS_ENDPOINT" = "http://cos.<REGION>.mycloud.com",
    "AWS_ACCESS_KEY" = "xxxx",
    "AWS_SECRET_KEY" = "xxx",
    "AWS_REGION" = "<REGION>"
    )
    The prefix of the final generated file is my_file_{fragment_instance_id}_.
    7. Example 7 Export to bos using the s3 protocol, and enable the concurrency export session variable.
    set enable_parallel_outfile = true;
    select k1 from tb1 order by k1 limit 1000
    into outfile "s3://my_bucket/export/my_file_"
    format as csv
    properties
    (
    "AWS_ENDPOINT" = "http://cos.<REGION>.mycloud.com",
    "AWS_ACCESS_KEY" = "xxxx",
    "AWS_SECRET_KEY" = "xxx",
    "AWS_REGION" = "<REGION>"
    )
    However, as the query statement has a top sort node, this query cannot be concurrency exported even if the concurrency export session variable is enabled.
    8. Example 8 Export using the hdfs method. Export a simple query result to the file hdfs://path/to/result.txt. Specify the export format as CSV. Use this and set the kerberos authentication information.
    SELECT * FROM tbl
    INTO OUTFILE "hdfs://path/to/result_"
    FORMAT AS CSV
    PROPERTIES
    (
    "hdfs.fs.defaultFS" = "hdfs://namenode:port",
    "hdfs.hadoop.security.authentication" = "kerberos",
    "hdfs.kerberos_principal" = "doris@YOUR.COM",
    "hdfs.kerberos_keytab" = "/home/doris/my.keytab"
    );

    Returned result

    The export command is a synchronous command. When the command returns, it means the operation has ended. At the same time, a line of results will be returned to display the execution result of the export. If the export is successful and returned, the result is as follows:
    mysql> select * from tbl1 limit 10 into outfile "file:///home/work/path/result_";
    +------------+-----------+----------+--------------------------------------------------------------------+
    | FileNumber | TotalRows | FileSize | URL |
    +------------+-----------+----------+--------------------------------------------------------------------+
    | 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
    +------------+-----------+----------+--------------------------------------------------------------------+
    1 row in set (0.05 sec)
    FileNumber: The total number of files eventually generated.
    TotalRows: Number of rows in the result set.
    FileSize: The total size of the exported files. Unit: bytes.
    URL: If it is exported to a local disk, it shows specifically which Compute Node it was exported to.
    If a concurrent export was performed, multiple rows of data will be returned.
    +------------+-----------+----------+--------------------------------------------------------------------+
    | FileNumber | TotalRows | FileSize | URL |
    +------------+-----------+----------+--------------------------------------------------------------------+
    | 1 | 3 | 7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
    | 1 | 2 | 4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ |
    +------------+-----------+----------+--------------------------------------------------------------------+
    2 rows in set (2.218 sec)
    If an error occurs, an error message would be returned such as:
    mysql> SELECT * FROM tbl INTO OUTFILE ...
    ERROR 1064 (HY000): errCode = 2, detailMessage = Open broker writer failed ...

    Notes

    If concurrency is not enabled for exporting, the results are exported by a single BE node in a single thread. Thus, exporting time is directly proportional to the size of the result set. Enabling concurrency for export can reduce the required time.
    The export command will not check whether the files or file paths exist. Whether paths will be created automatically or existing files will be overwritten depends entirely on the semantics of the remote storage system.
    If there are errors during the export process, some residual export files may be left on the remote storage system. Doris will not clean these files. Users need to clean them manually.
    The timeout of the export command is the same as the query timeout. It can be set using SET query_timeout=xxx.
    For queries where the result set is empty, a file of size zero will still be generated.
    File splitting ensures that a row of data is completely stored in a single file, so the file size doesn't strictly equal max_file_size.
    For some functions outputting non-visible characters, such as BITMAP, and HLL types, the output is \\N, that is NULL.
    Currently, some geographic information functions, such as ST_Point, output types as VARCHAR, but the actual output values are encoded binary characters. These cases may lead to garbled outputs. For geographic functions, please use ST_AsText for output.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support