tencent cloud

Feedback

Direct Query of DLC Internal Storage with StarRocks

Last updated: 2024-09-18 18:03:58

    Introduction of StarRocks

    StarRocks is a new-generation, high-performance, and all-scene MPP database. It leverages the best research outcomes of relational OLAP databases and distributed storage systems in the era of big data. Building on industry practices, StarRocks optimizes and upgrades its architecture, and introduces numerous new features, resulting in a cutting-edge enterprise-level product. StarRocks is committed to delivering a fast and unified analysis experience, meeting enterprises' various data analysis needs. It supports multiple data models (detailed, aggregate, and update models) and diverse data import methods (batch and real-time). StarRocks can handle data of up to 10,000 columns and integrate with various existing systems such as Spark, Flink, Hive, and ElasticSearch.
    In Tencent Cloud EMR, we offer a fully out-of-the-box StarRocks service. For more details, see StarRocks Introduction.

    StarRocks + DLC Lakehouse Unified Query Acceleration

    Tencent Cloud Data Lake Compute (DLC) supports a lakehouse unified query acceleration mode based on EMR StarRocks. This allows you to seamlessly analyze DLC data sources and perform complex SQL queries without the need to import DLC data into StarRocks or create external tables of DLC in StarRocks. Leveraging StarRocks' MPP vectorized query capabilities improves data analysis efficiency while reducing ops complexity and costs.
    Next, this document will guide you through how to enable the DLC + (EMR) StarRocks lakehouse unified query acceleration.

    Prerequisites

    1. You have purchased an EMR StarRocks cluster.
    2. You have enabled the DLC service.
    Note:
    1. Currently, cross-region federated DLC is not supported. Plan your environment accordingly to ensure that EMR StarRocks and DLC are in the same region.
    2. Queries on DLC native tables created before June 12, 2024, are not supported (these tables have storage paths in the format of lakefs://, which is currently incompatible). However, DLC native tables created after June 12, 2024 (with storage paths in the format of cosn://) and all DLC external tables are supported.
    3. StarRocks only supports querying DLC data; it does not support writing to or deleting DLC data.
    To enable query acceleration of DLC + StarRocks lakehouse unification, you first need to enable DLC external access, allowing the StarRocks cluster to access data stored in DLC's internal managed storage. Then, create a DLC External Catalog in the StarRocks cluster, allowing you to use the StarRocks compute engine to directly analyze data stored in DLC.

    Enabling DLC External Access

    Go to the DLC console, click the Storage Configuration menu, and select Enable External Access.
    Step 1: Click "enable external access to managed storage". Once it is enabled, the EMR StarRocks cluster under the same Tencent Cloud account will be able to access DLC's internal managed storage. Subsequently, you can use this switch to enable or disable external access to DLC's internal managed storage.
    
    The account enabling the external access should have DLC management permissions. Use the root account (or a sub-account with DLC administrator permissions) to perform this operation.
    Note:
    1. If you already have DLC administrator permissions, you can skip this step.
    2. If you're using a sub-account without DLC administrator permissions, see Sub-account Permission Management to request authorization from an account with DLC administrator permissions.
    Step 2: To ensure that your EMR StarRocks cluster can correctly access the DLC metadata Catalog service, you need to bind the VPC of the EMR StarRocks cluster to the DLC network.
    1. Click Bind VPC, select EMR StarRocks as the type in the dialog box, and select the EMR StarRocks cluster instance ID you want to bind from the EMR instance dropdown list.
    2. The VPC of the StarRocks cluster will be automatically filled. You can enter a recognizable alias in the remarks field for easier identification.
    
    Step 3: After completing the VPC binding, you can connect to the DLC metadata service from the StarRocks cluster using the URI connection string displayed in the Catalog access address. For example:
    Example of URI Connection String:
    Catalog access address: thrift://172.17.1.18:7004
    At this point, your EMR StarRocks cluster is ready to directly analyze internal managed data of DLC. Before you start the analysis, you need to create a DLC External Catalog in StarRocks.

    Creating DLC External Catalog

    Log in to StarRocks and create a DLC Catalog in StarRocks. For more details about External Catalogs, see CREATE EXTERNAL CATALOG | Overview.

    Syntax

    CREATE EXTERNAL CATALOG dlc_iceberg_cos_catalog PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hive", "iceberg.catalog.hive.metastore.uris" = "thrift://169.254.0.171:8007", "aws.s3.endpoint" = "cos.ap-chongqing.myqcloud.com", "aws.s3.access_key" = "Tencent Cloud secret id", "aws.s3.secret_key" = "Tencent Cloud secret_key" );

    Parameter Description

    Parameter
    Description
    type
    The type of data source, which is set to Iceberg by default.
    iceberg.catalog.type
    The type of metadata service used by the Iceberg cluster, which is set to hive.
    iceberg.catalog.hive.metastore.uris
    The URI for the DLC metadata.
    aws.s3.endpoint
    The endpoint for accessing S3-compatible object storage. For Tencent Cloud COS, the format is cos.<region>.myqcloud.com, where <region> can be ap-beijing, ap-shanghai, ap-guangzhou, etc.
    aws.s3.access_key
    The SecretID from the Tencent Cloud account keys.
    aws.s3.secret_key
    The SecretKey from the Tencent Cloud account keys.
    Note:
    1. For security reasons, you need to use the SecretID and SecretKey of the root account in the above configuration to properly access the DLC internal storage. You can obtain the SecretID and SecretKey by logging into Tencent Cloud with the root account and navigating to the CAM Control Console > log in to - Tencent Cloud page.
    2. If you need to use the SecretID and SecretKey of a sub-account to access DLC internal managed storage, see the final section of this document Using Sub-account SecretID/SecretKey.

    Example

    The following example demonstrates how to create a DLC Catalog named DLC_catalog:
    CREATE EXTERNAL CATALOG dlc_hive_cos_catalog PROPERTIES ( "type" = "Iceberg", "iceberg.catalog.type" = "hive", "hive.metastore.uris" = "thrift://169.254.0.171:8007", "aws.s3.endpoint" = "cos.ap-chongqing.myqcloud.com", "aws.s3.access_key" = "AKIDEH5iCawJfjmlfIrIf1EE**********", "aws.s3.secret_key" = "RgkdMrGZhZDhy3ymU6VfbN**********"
    )
    If you need to query Hive tables stored in DLC managed storage directly through StarRocks, you will need to create a separate DLC Hive catalog, setting the type to hive. An example is provided below:
    CREATE EXTERNAL CATALOG dlc_hive_cos_catalog PROPERTIES ( "type" = "hive", "iceberg.catalog.type" = "hive", "hive.metastore.uris" = "thrift://169.254.0.171:8007", "aws.s3.endpoint" = "cos.ap-chongqing.myqcloud.com", "aws.s3.access_key" = "AKIDEH5iCawJfjmlfIrIf1EELH6lDakrlT2a*****", "aws.s3.secret_key" = "RgkdMrGZhZDhy3ymU6VfbN3UhfOvEYjk*****" )

    Querying DLC Data

    Preparing Tables and Data in DLC

    Here is an example of creating an Iceberg table:
    CREATE TABLE test_sr_ofs.`customer`( `c_custkey` bigint, `c_name` string, `c_address` string, `c_nationkey` int, `c_phone` string, `c_acctbal` double, `c_mktsegment` string, `c_comment` string) using iceberg;
    )
    Querying Data in StarRocks
    # Log in to the StarRocks Node. mysql -h 172.30.0.xxx -P9030 -u root -p #Specify the Iceberg Catalog. set catalog dlc_iceberg_cos_catalog; #Specify the Database. use test_sr_ofs; #Query the customer table data. select * from customer limit 5;
    The query results are as follows:
    

    Using Sub-account SecretID/SecretKey for Access (Optional)

    For data security reasons, after DLC external access is enabled, it is required by default to use the root account's SecretID and SecretKey to access DLC internal storage from EMR StarRocks. However, if your business scene requires the use of a sub-account's SecretID and SecretKey, you will need to use the root account to create a custom policy in CAM and bind it to the corresponding sub-account.

    Explanation of the Principle

    After you follow the instructions in this document to enable DLC external access, the system essentially grants your root account permissions to access DLC's internal managed storage. However, for data security reasons, sub-accounts under your root account do not have this access by default. To grant a sub-account access to DLC's internal managed storage, you need to create a custom policy in CAM by Tencent Cloud and bind it to the sub-account. The specific steps are as follows:

    Step 1: Generating a Custom Policy

    Log in to Tencent Cloud using the root account and go to the DLC console, click the Storage Configuration menu and select Enable External Access. In the Sub-account Authorization section, click "Click to Handle". In the pop-up dialog box, click Copy to obtain the custom CAM policy that you need to create.
    

    Step 2: Creating a Custom Policy to Allow Sub-account Access to DLC Internal Storage

    1. Log in to the Policy page of the CAM console using the root account.
    2. Select Create Your Own Definition Policy , create it based on the policy syntax, select the blank template and click Next to proceed.
    Note:
    To grant sub-accounts data access permissions, the root account can only do so via custom policies. Predefined policies do not support this authorization.
    3. Fill out the form as follows:
    Policy name: Create a unique and meaningful policy name, such as cos-child-account.
    Remarks: Optional, you can write your own description.
    Policy content: Paste the custom policy copied from Step 1. For example:
    { "statement": [ { "action": [ "cos:*" ], "effect": "allow", "resource": [ "qcs::cos:ap-shanghai:uid/1305424723:dlc03ff-100018379117-1647867281-100017307912-1304028854/*", "qcs::cos:ap-shanghai:uid/1305424723:dlc0a65-100018379117-1680005779-100017307912-1304028854/*" ] } ], "version": "2.0" }
    Note:
    The above policy grants the sub-account permissions to operate the DLC managed storage for which the root account has operation permissions. In this example, uid/1305424723 refers to the APPID of the root account (A), and dlc0a65-100018379117-1680005779-100017307912-1304028854/* represents the DLC internal managed storage that you are authorized to operate.
    4. Click Complete to finish the creation of the policy.

    Step 3: Authorizing the Sub-account to Access DLC Internal Storage

    1. In the Policy list, find the policy you just created and click Associate User/Group/Role on the right.
    2. In the pop-up window, select the sub-account that needs access to the DLC internal managed storage, and click OK.
    3. Once the authorization is completed, the sub-account will be able to access the DLC internal managed storage using its SecretID and SecretKey.
    
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support