tencent cloud

Feedback

Importing Data in JSON Format

Last updated: 2024-06-27 11:06:15
    Tencent Cloud TCHouse-D has supported the import of data in JSON format since version 0.12.

    Supported Import Methods

    Currently, only the following methods support the import of data in JSON format:
    Stream Load
    Routine Load
    For specific descriptions of the above import methods, please see the related documentation. This document mainly introduces the instructions for using the JSON part in these import methods.

    Supported JSON Formats

    Currently only the following two JSON formats are supported:
    1. Multi-row data represented by Array. JSON format with Array as the root node. Each element in the Array represents a row of data to be imported, usually an Object. An example is as follows:
    [
    { "id": 123, "city" : "beijing"},
    { "id": 456, "city" : "shanghai"},
    ...
    ]
    [
    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian"}},
    { "id": 456, "city" : { "name" : "beijing", "region" : "chaoyang"}},
    ...
    ]
    This method is commonly used for the Stream Load import method to represent multiple rows of data in a batch of import data. This method must be used in conjunction with the Setting strip_outer_array=true. Doris will expand the array when parsing, and then parse each Object in it as a row of data.
    2. Single row data represented by an Object. JSON format with Object as the root node. The entire Object represents a row of data to be imported. An example is as follows:
    { "id": 123, "city" : "beijing" }
    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
    This method is commonly used in the Routine Load import method, such as representing a message in Kafka, that is, a row of data.
    3. Multi-row Object data separated by fixed separators. A row of data represented by an Object represents a row of data to be imported, as shown in the following example:
    { "id": 123, "city" : "beijing" }
    { "id": 456, "city" : "shanghai"}
    ...
    This method is commonly used for the Stream Load import method, in order to represent multiple rows of data in a batch of import data. This method must be used in conjunction with the Setting read_JSON_by_line=true. The special separator also needs to specify the line_delimiter parameter, the default is \\n. Doris will separate according to the separator during parsing, and then parse each row of Object as a row of data.

    fuzzy_parse parameter

    In Stream load, you can add the fuzzy_parse parameter to accelerate the import efficiency of JSON data. This parameter is typically used for importing formats like multi-line data represented as an Array, so it is usually used in conjunction with strip_outer_array=true. This feature requires that each row of data in the Array has the same order of fields. Doris will only parse based on the order of fields in the first row, then access the subsequent data using subscripts. This method can improve the import efficiency by 3-5X.

    JSON Path

    Doris supports extracting data specified in JSON through JSON Path.
    Note
    Because for Array type data, Doris will first expand the array and finally process it in a single row as an Object format. Therefore, the examples in this document are all explained with JSON data in the form of a single Object.
    Do not specify JSON Path.
    If JSON Path is not specified, Doris will use the column name in the table to find the element in the Object by default. An example is as follows: The table contains two columns: id, city. The JSON data is as follows:
    { "id": 123, "city" : "beijing" }
    So Doris uses id, city for matching, and gets the final data 123 and beijing. If the JSON data is as follows:
    { "id": 123, "name" : "beijing"}
    Then use id, city for matching to get the final data 123 and null.
    Specify JSON Path Specify a set of JSON Path in the form of a JSON data. Each element in the array represents a column to extract. An example is as follows:
    ["$.id", "$.name"]
    ["$.id.sub_id", "$.name[0]", "$.city[0]"]
    Doris will use the specified JSON Path for data matching and extraction.
    Matching non-primitive types The values that are finally matched in the preceding examples are all primitive types, such as integers, strings, and so on. Doris currently does not support complex types, such as Array, Map, etc. So when a non-basic type is matched, Doris will convert this type to a JSON format character string and import it as a string type. An example is as follows: The JSON data is:
    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
    The JSON Path is ["$.city"]. The matched element is:
    { "name" : "beijing", "region" : "haidian" }
    This element will be converted into a string for subsequent import operations:
    "{'name':'beijing','region':'haidian'}"
    Match failed When the match fails, null will be returned. An example is as follows: The JSON data is:
    { "id": 123, "name" : "beijing"}
    The JSON Path is ["$.id", "$.info"]. The matched elements are 123 and null. At present, Doris does not distinguish between the null value represented in the JSON data and the null value produced when the matching fails. Suppose the JSON data is:
    { "id": 123, "name" : null }
    Then using the following two JSON Paths will result in the same result: 123 and null.
    ["$.id", "$.name"]
    ["$.id", "$.info"]
    Exact match failed To prevent misoperation due to parameter setting errors. When Doris tries to match a row of data, if all columns fail to match, it will be considered as an error row. Suppose the JSON data is:
    { "id": 123, "city" : "beijing" }
    If the JSON Path is incorrectly written (or when the JSON Path is not specified, the columns in the table do not include id and city):
    ["$.ad", "$.infa"]
    This would cause exact match to fail, and the row is marked as an error instead of outputting null, null.

    JSON Path and Columns

    JSON Path is used to specify how to extract data in the JSON format, while Columns specify the mapping and conversion relationship of the columns. They can be used together. In other words, it is equivalent to rearranging the columns of a JSON format data according to the column order specified in JSON Path through JSON Path. Then, through Columns, you can map the rearranged source data to the columns of the table through Columns. An example is as follows: Data content:
    {"k1" : 1, "k2": 2}
    Table structure: k2 int, k1 int Import Statement 1 (Take Stream Load as an example):
    curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\\"$.k2\\", \\"$.k1\\"]" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
    In Import Statement 1, only JSON Path is specified, and Columns is not specified. The function of JSON Path is to extract JSON data in the order of the fields in JSON Path and then write it in the order of the table structure. The final imported data result is as follows:
    +------+------+
    | k1 | k2 |
    +------+------+
    | 2 | 1 |
    +------+------+
    You will see that the actual k1 column imported the value of the "k2" column in the JSON data. This is because the field names in JSON are not equivalent to the field names in the table structure. We need to explicitly specify the mapping between the two. Import Statement 2:
    curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\\"$.k2\\", \\"$.k1\\"]" -H "columns: k2, k1" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
    Compared with Import Statement 1, the Columns field has been added here to describe the mapping relationship of columns in the order of k2, k1. That is, after extracting in the order of the fields in the JSON Path, specify the value of column k2 in the table for the first column, and the value of column k1 in the table for the second column. The final imported data result is as follows:
    +------+------+
    | k1 | k2 |
    +------+------+
    | 1 | 2 |
    +------+------+
    Of course, as with other imports, column transformations can be performed in Columns. An example is shown as below:
    curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\\"$.k2\\", \\"$.k1\\"]" -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
    The above example will import the value of k1 multiplied by 100. The final imported data result is as follows:
    +------+------+
    | k1 | k2 |
    +------+------+
    | 100 | 2 |
    +------+------+

    NULL and Default Value

    An example data is as follows:
    [
    {"k1": 1, "k2": "a"},
    {"k1": 2},
    {"k1": 3, "k2": "c"},
    ]
    Table structure is: k1 int null, k2 varchar(32) null default "x" Import statement as follows:
    curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
    The import results that users may expect are as follows, that is, for missing columns, fill in the default values.
    +------+------+
    | k1 | k2 |
    +------+------+
    | 1 | a |
    +------+------+
    | 2 | x |
    +------+------+
    | 3 | c |
    +------+------+
    But the actual import result is as follows, that is, for missing columns, NULL is added.
    +------+------+
    | k1 | k2 |
    +------+------+
    | 1 | a |
    +------+------+
    | 2 | NULL |
    +------+------+
    | 3 | c |
    +------+------+
    This is because Doris doesn't know that the "missing column is k2 column in the table” from the information in the import statement. If you want to import the above data according to the expected result, the import statement is as follows:
    curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -H "jsonpaths: [\\"$.k1\\", \\"$.k2\\"]" -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

    LargeInt and Decimal

    Doris supports data types with wider range and higher precision, such as LargeInt and Decimal. However, since Doris uses Rapid JSON library which can parse numbers up to the range of Int64 and Double, this may lead to precision loss, data transfoemation errors, etc., while importing LargeInt or Decimal via JSON. The sample data is as follows:
    [
    {"k1": 1, "k2":9999999999999.999999 }
    ]
    While importing into k2 column of type Decimal(16, 9), the data is 9999999999999.999999. While importing via JSON, due to the precision lost during double transformation, the imported data is 10000000000000.0002, causing import errors. In order to solve this problem, Doris provides a num_as_string toggle. When parsing JSON data, Doris will transform numerical types into strings, and then import them without losing precision.
    curl -v --location-trusted -u root: -H "format: JSON" -H "num_as_string: true" -T example.JSON http://127.0.0.1:8030/api/db1/tbl1/_stream_load
    However, switching this on might trigger some unexpected side effects. At present, Doris does not support composite types, such as Array, and Map, etc. So when a non-primitive type is matched, Doris will transform that type into a string of JSON format, and num_as_string will also transform numerical values in composite types into strings, for example: JSON data is: { "id": 123, "city" : { "name" : "beijing", "city_id" : 1 }} With num_as_string disabled, the imported city column data is: { "name" : "beijing", "city_id" : 1 } But with num_as_string on, the imported city column data is: { "name" : "beijing", "city_id" : "1" }
    Note
    This causes the previously numerical type city_id in composite type to be processed as string column and be added with quotes, resulting in a change compared to the original data.
    Therefore, when using JSON import, try to avoid simultaneous import of LargeInt, Decimal and composite types. If this cannot be avoided, it is necessary to fully understand the side effects on the import of composite types after num_as_string is enabled.

    Application Example

    Stream Load

    Because of the indivisible feature of JSON format, when using Stream Load to import JSON format files, all the file content will be loaded into memory before being processed. Therefore, if the file is too large, it may occupy too much memory. Suppose the table structure is:
    id INT NOT NULL,
    city VARHCAR NULL,
    code INT NULL
    1. Import single-line data 1:
    {"id": 100, "city": "beijing", "code" : 1}
    Do not specify JSON Path:
    curl --location-trusted -u user:passwd -H "format: json" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
    Import result:
    100 beijing 1
    Specifying Json Path:
    curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\\"$.id\\",\\"$.city\\",\\"$.code\\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
    Import result:
    100 beijing 1
    2. Importing single-line data 2:
    {"id": 100, "content": {"city": "beijing", "code" : 1}}
    Specifying Json Path:
    curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\\"$.id\\",\\"$.content.city\\",\\"$.content.code\\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
    Import result:
    100 beijing 1
    3. Importing multi-line data:
    [
    {"id": 100, "city": "beijing", "code" : 1},
    {"id": 101, "city": "shanghai"},
    {"id": 102, "city": "tianjin", "code" : 3},
    {"id": 103, "city": "chongqing", "code" : 4},
    {"id": 104, "city": ["zhejiang", "guangzhou"], "code" : 5},
    {
    "id": 105,
    "city": {
    "order1": ["guangzhou"]
    },
    "code" : 6
    }
    ]
    Specifying Json Path:
    curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\\"$.id\\",\\"$.city\\",\\"$.code\\"]" -H "strip_outer_array: true" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
    Import result:
    100 beijing 1
    101 shanghai NULL
    102 tianjin 3
    103 chongqing 4
    104 ["zhejiang","guangzhou"] 5
    105 {"order1":["guangzhou"]} 6
    4. Transform the imported data The data is still the multiple rows of data in example 3, and now it is required to import the imported data of the code column after adding 1.
    curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\\"$.id\\",\\"$.city\\",\\"$.code\\"]" -H "strip_outer_array: true" -H "columns: id, city, tmpc, code=tmpc+1" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
    Import result:
    100 beijing 2
    101 shanghai NULL
    102 tianjin 4
    103 chongqing 5
    104 ["zhejiang","guangzhou"] 6
    105 {"order1":["guangzhou"]} 7

    Routine Load

    The principle of Routine Load processing JSON data is the same as Stream Load, no further explanation here. For Kafka data source, the content in each Massage is considered as a complete JSON data. If multiple rows of data are represented in an Array format in a Massage, multiple lines will be imported, but the Kafka offset will only increase by 1. However, if an Array format JSON represents multiple lines of data, but the parsing of JSON fails due to a JSON format error, the number of error lines will only increase by 1 (because parsing fails, in fact, Doris cannot determine how many lines of data it contains, and can only record it as one line of error data).
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support