tencent cloud

Feedback

Parquet Shipping

Last updated: 2024-01-20 17:44:35

    Overview

    You can log in to the CLS console and ship data in Parquet format to COS. Parquet files can be loaded in Hive mainly for big data computing and analysis. This document describes how to create a Parquet shipping task.
    Note:
    Parquet files are mainly used for big data platforms. As Parquet comes with compression and Snappy/lzop/GZIP can be used for further compression, the file to be shipped needs to be large in size. We recommend you configure a value of greater than 200 MB (which becomes around 50 MB when shipped to COS).

    Prerequisite

    1. You have activated CLS, created a logset and a log topic, and successfully collected the log data.
    2. You have activated COS and created a bucket in the target region for log topic shipping. For more information, see Creating Bucket.
    3. Sub-accounts and collaborators need to be authorized by the root account. For more information on granting permissions, see CAM Access Management. For more information on copying authorization policies, see Examples of Custom Access Policies.
    4. You have authorized the CLS service role to access COS. If you perform operations in the console, the system will guide you through the authorization process. If you directly call APIs, manual authorization will be required. For more information, see Viewing and Configuring Shipping Permissions.

    Directions

    1. Log in to the CLS console.
    2. Click Log Topic on the left sidebar.
    3. Click the desired log topic ID/name to go to the log topic management page.
    4. Select the Ship to COS tab, click Add Shipping Configuration, and finish the configuration.
    The parameters are described as follows:
    Configuration Item
    Description
    Limit
    Required
    Shipping Task Name
    Configures the name of a shipping task
    The name can contain letters, numbers, underscores (_), and hyphens (-)
    Yes
    COS Bucket
    Target bucket for shipping. The target bucket must be in the same region as the current log topic.
    A value selected from the drop-down list
    Yes
    COS Path
    COS bucket path, which is in the format of `/year/month/day/hour/` by default, such as `/2022/7/31/14/`. This format is used in COS for storing shipped log files. Here, the strftime syntax is supported. For example, if a log was shipped at 14:00 on July 31, 2022, the generated `/%Y/%m/%d/` path is `/2022/7/31/`, and the `/%Y%M%d/%H/` path is `/2022/07/31/14/`.
    Cannot start with /
    No
    Filename
    Option 1 (recommended): Use the shipping time as the name. For example, `202208251645_000_132612782.gz` indicates the shipping time_log topic partition_offset. This type of files can be loaded in Hive.Option 2: Use a random number as the name. This is the legacy practice of naming files and cannot be recognized by Hive as it cannot recognize filenames starting with "_". You can add a custom prefix to the COS path, such as `/%Y%M%d/%H/Yourname`.
    /
    Yes
    Compression Format
    To reduce read traffic fees, log files are compressed before being shipped to COS. Snappy, lzop, and GZIP are supported.
    GZIP, Snappy, and lzop
    Yes
    File Size
    It indicates the size of the raw log file to be shipped and is used together with the shipping interval parameter. When either condition is met, compression will be performed accordingly. For example, if the size is set to 256 MB and the interval is set to 15 minutes, when the file reaches 256 MB within five minutes, the size condition will be met first to trigger shipping.
    The value must be a number ranging from 5 to 256 in MB.
    Yes
    Shipping Interval
    It indicates the interval for shipping and is used together with the file size parameter. When either condition is met, compression will be performed accordingly. For example, if the size is set to 256 MB and the interval is set to 15 minutes, when the file reaches only 200 MB in 15 minutes, the shipping interval condition will be met first to trigger shipping.
    Value range: 300-900s
    Yes
    5. Click Next to enter the Advanced Configuration page. Set Shipping Format to Parquet. __SOURCE__, __FILENAME__, and __HOSTNAME__ are CLS metadata fields and can be deleted if you don't need them. The configuration items are as described below.
    The parameters are described as follows:
    Configuration Item
    Description
    Limit
    Required
    Key
    The key field to write into the Parquet file. The system will automatically pull keys from the log for your choice. To add other fields to the log, click Add. Note that new keys cannot have the same name as existing keys. The field name can contain letters, digits, underscores, and hyphens.
    A value selected from the drop-down list
    Yes
    Data Type
    This field indicates the data type in the Parquet file, such as string, boolean, int32, int64, float, and double.
    A value selected from the drop-down list
    Yes
    Assigned Value for Parsing Failure
    You can assign a custom value when the data type parsing (conversion) fails. For the string type, an empty string indicates "", and `NULL` indicates unknown. You can also assign a custom value for the boolean, integer, or float type.
    A value selected from the drop-down list
    Yes
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support