tencent cloud

Feedback

Overview

Last updated: 2024-06-27 11:14:36
    Multi-Catalog is a feature supported by Doris since version 1.2.0, designed to make it easier to connect to external data catalogs to enhance Doris's data lake analysis and federated data querying capabilities.
    In previous versions of Doris, user data is in a two-tiered structure: database and table.
    Thus, connections to external catalogs could only be done at the database or table level. For example, users could create a mapping to table in an external catalog through create external table, or to a database through create external database. If there were large amounts of databases or tables in the external data catalog, users would need to create mappings to them one by one, which could be a heavy workload.
    With the advent of Multi-Catalog, Doris now has a new three-tiered metadata hierarchy (catalog > database > table), which means users can connect to external data at the catalog level. The currently supported external catalogs include: Hive, Iceberg, Hudi, Elasticsearch, and data to access various databases via a standard JDBC.
    Note:
    This feature is applicable to TCHouse-D 1.2 and subsequent versions, compared to the external table connection, it is recommended to use the Multi-Catalog to realize multi-data catalog federated queries.

    Basic Concepts

    Internal Catalog
    Existing databases and tables in Doris are all under the Internal Catalog, which is the default catalog in Doris and cannot be modified or deleted.
    External Catalog
    You can create an External Catalog using the CREATE CATALOG command, and view the existing Catalogs via the SHOW CATALOGS command.
    Switching Catalog
    After login, you will enter the Internal Catalog by default. Then, you can view or switch to your target database using command like SHOW DATABASES and USE DB. Users can switch Catalogs using the SWITCH command. For example:
    SWITCH internal;
    SWITCH hive_catalog;
    After switching catalog, you can directly view and switch to your target database in the corresponding catalog using commands like SHOW DATABASES and USE DB. Doris will automatically access the database and table in the catalog. You can view and access data in External Catalogs the same way as doing that in the Internal Catalog. Doris only supports read-only access to data in External Catalogs currently.
    Deleting Catalog
    Databases and tables in External Catalogs are for read only, but you can delete Catalogs (Internal Catalog cannot be deleted) using the DROP CATALOG command. This operation will only delete the mapping in Doris to the corresponding catalog. It doesn't change the external catalog itself by all means.
    Resource
    A Resource is a collection of configurations. Users can create a Resource using the CREATE RESOURCE command, which can then be used when creating Catalogs. A Resource can be used by multiple Catalogs to reuse the configurations.

    Connection Sample

    Connecting to Hive

    The followings are the instruction on how to connect to a Hive catalog using the Catalog feature.
    For more information on Hive, please see: Hive Catalog.
    1. Creating Catalog
    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004'
    );
    2. Viewing Catalog
    After creating, you can view the existing Catalog using the SHOW CATALOGS command:
    mysql> SHOW CATALOGS;
    +-----------+-------------+----------+
    | CatalogId | CatalogName | Type |
    +-----------+-------------+----------+
    | 10024 | hive | hms |
    | 0 | internal | internal |
    +-----------+-------------+----------+
    3. Switching Catalog
    Switch to the Hive Catalog using the SWITCH command, and view the databases in it:
    mysql> SWITCH hive;
    Query OK, 0 rows affected (0.00 sec)
    
    mysql> SHOW DATABASES;
    +-----------+
    | Database |
    +-----------+
    | default |
    | random |
    | ssb100 |
    | tpch1 |
    | tpch100 |
    | tpch1_orc |
    +-----------+
    4. Using Catalog
    After switching to the catalog, you can normally use the features of the internal data source. For example, you can switch to database tpch100, and view the tables in it:
    mysql> USE tpch100;
    Database changed
    
    mysql> SHOW TABLES;
    +-------------------+
    | Tables_in_tpch100 |
    +-------------------+
    | customer |
    | lineitem |
    | nation |
    | orders |
    | part |
    | partsupp |
    | region |
    | supplier |
    +-------------------+
    You can view the schema of Table lineitem:
    mysql> DESC lineitem;
    +-----------------+---------------+------+------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-----------------+---------------+------+------+---------+-------+
    | l_shipdate | DATE | Yes | true | NULL | |
    | l_orderkey | BIGINT | Yes | true | NULL | |
    | l_linenumber | INT | Yes | true | NULL | |
    | l_partkey | INT | Yes | true | NULL | |
    | l_suppkey | INT | Yes | true | NULL | |
    | l_quantity | DECIMAL(15,2) | Yes | true | NULL | |
    | l_extendedprice | DECIMAL(15,2) | Yes | true | NULL | |
    | l_discount | DECIMAL(15,2) | Yes | true | NULL | |
    | l_tax | DECIMAL(15,2) | Yes | true | NULL | |
    | l_returnflag | TEXT | Yes | true | NULL | |
    | l_linestatus | TEXT | Yes | true | NULL | |
    | l_commitdate | DATE | Yes | true | NULL | |
    | l_receiptdate | DATE | Yes | true | NULL | |
    | l_shipinstruct | TEXT | Yes | true | NULL | |
    | l_shipmode | TEXT | Yes | true | NULL | |
    | l_comment | TEXT | Yes | true | NULL | |
    +-----------------+---------------+------+------+---------+-------+
    Query Example:
    mysql> SELECT l_shipdate, l_orderkey, l_partkey FROM lineitem limit 10;
    +------------+------------+-----------+
    | l_shipdate | l_orderkey | l_partkey |
    +------------+------------+-----------+
    | 1998-01-21 | 66374304 | 270146 |
    | 1997-11-17 | 66374304 | 340557 |
    | 1997-06-17 | 66374400 | 6839498 |
    | 1997-08-21 | 66374400 | 11436870 |
    | 1997-08-07 | 66374400 | 19473325 |
    | 1997-06-16 | 66374400 | 8157699 |
    | 1998-09-21 | 66374496 | 19892278 |
    | 1998-08-07 | 66374496 | 9509408 |
    | 1998-10-27 | 66374496 | 4608731 |
    | 1998-07-14 | 66374592 | 13555929 |
    +------------+------------+-----------+
    You can also perform associated queries with tables in other data catalogs:
    mysql> SELECT l.l_shipdate FROM hive.tpch100.lineitem l WHERE l.l_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;
    +------------+
    | l_shipdate |
    +------------+
    | 1993-02-16 |
    | 1995-06-26 |
    | 1995-08-19 |
    | 1992-07-23 |
    | 1998-05-23 |
    | 1997-07-12 |
    | 1994-03-06 |
    | 1996-02-07 |
    | 1997-06-01 |
    | 1996-08-23 |
    +------------+
    The table is identified in the format of catalog.database.table. For example internal.db1.partin the above snippet.
    If the target table is in the current Database of the current Catalog, catalog and database in the format can be omitted.
    You can use the INSERT INTO command to insert the table data from the Hive Catalog into a table in the Internal Catalog to achieve the effect of importing data from External Catalog to Internal Catalog:
    mysql> SWITCH internal;
    Query OK, 0 rows affected (0.00 sec)
    
    mysql> USE db1;
    Database changed
    
    mysql> INSERT INTO part SELECT * FROM hive.tpch100.part limit 1000;
    Query OK, 1000 rows affected (0.28 sec)
    {'label':'insert_212f67420c6444d5_9bfc184bf2e7edb8', 'status':'VISIBLE', 'txnId':'4'}

    Connecting to Iceberg

    For more details, see Iceberg Catalog.

    Connecting to Hudi

    For more details, see Hudi Catalog.

    Connecting to Elasticsearch

    For more details, see ES Catalog.

    Connecting with JDBC

    For more details, see JDBC Catalog.

    Column type map

    After a user creates a Catalog, Doris will automatically synchronize the databases and tables from data directories. The following shows how Doris maps different types of catalogs and tables. For external table types that currently cannot be mapped to Doris column types, such as UNION and INTERVAL, etc., Doris maps them to an UNSUPPORTED type. Here are examples of queries in a table containing UNSUPPORTED types:
    Suppose the table is of the following schema:
    k1 INT,
    k2 INT,
    k3 UNSUPPORTED,
    k4 INT
    select * from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
    select * except(k3) from table; // Query OK.
    select k1, k3 from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
    select k1, k4 from table; // Query OK.
    For column mapping rules of different data sources, see the documentation of different data sources.

    Permission Management

    Access from Doris to databases and tables in an External Catalog is not under the permission control of the external catalog itself, but is authorized by Doris.
    Along with the new multi-catalog feature, we also added permission management at the Catalog level (See Authority Management ).

    Specifying the Database to be Synchronized

    You can specify the databases to be synchronized by setting include_database_list and exclude_database_list in the Catalog configuration.
    include_database_list: Only synchronize the specified databases, separated by ','. Default value is '', meaning to synchronize all databases. db name is case sensitive.
    exclude_database_list: Specify databases that do not need to be synchronized, separated by ','. Default value is '', meaning no filters takes effect, and synchronizing all databases. db name is case sensitive.
    Note:
    When the include_database_list and exclude_database_list configurations overlap, theexclude_database_listwill take effect first.
    To connect to JDBC, the above 2 configurations should work with the configuration only_specified_database. For more information, see JDBC Catalog.

    Metadata Update

    By default, changes in metadata of external data sources, including creation or deletion of tables and columns etc., will not be synchronized into Doris.

    Manual Update

    Users need to manually update the metadata using the REFRESH CATALOG command.

    Automatic Update

    Hive Metastore

    Currently, Doris only supports automatic update of metadata in Hive Metastore. It perceives changes in metadata by the FE node which regularly reads the notification event from HMS. The supported events are as follows:
    Event
    Event Behavior and Corresponding Operation
    CREATE DATABASE
    Create a database in the corresponding catalog.
    DROP DATABASE
    Delete a database in the corresponding catalog.
    ALTER DATABASE
    The impact of such alterations mainly includes changes in properties, comments, or default storage location of the database. These changes do not affect Doris' queries in External Catalogs so they will not be synchronized.
    CREATE TABLE
    Create a table in the corresponding database.
    DROP TABLE
    Delete a table in the corresponding database, and invalidate the cache o that table.
    ALTER TABLE
    If it is a renaming, delete the table of the old name, and then create a new table with the new name. Otherwise, invalidate the cache of that table.
    ADD PARTITION
    Add a partition to the cached partition list of the corresponding table.
    DROP PARTITION
    Delete a partition from the cached partition list of the corresponding table, and invalidate the cache of that partition.
    ALTER PARTITION
    If it is a renaming, delete the partition of the old name, and then create a new partition with the new name. Otherwise, invalidate the cache of that partition.
    Note:
    After data ingestion, changes in partition tables will follow the ALTER PARTITION event logic, while those in non-partition tables will follow the ALTER TABLE event logic.
    If changes are conducted on the file system directly instead of through the HMS, the HMS will not generate an event. As a result, such changes will not be perceived by Doris.
    The feature has the following parameters in fe.conf:
    1. enable_hms_events_incremental_sync: This specifies whether to enable the automatic incremental synchronization for metadata, which is disabled by default.
    2. hms_events_polling_interval_ms: This specifies the interval between two reading events, which is set to 10000 by default. (Unit: millisecond).
    3. hms_events_batch_size_per_rpc: This specifies the maximum number of events that are read at a time, which is set to 500 by default.
    To enable automatic update, you need to modify the hive-site.xml of HMS and then restart HMS:
    <property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
    </property>
    <property>
    <name>hive.metastore.dml.events</name>
    <value>true</value>
    </property>
    <property>
    <name>hive.metastore.transactional.event.listeners</name>
    <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
    </property>
    
    Note: To enable automatic update, whether for existing Catalogs or newly created Catalogs, all you need is to set enable_hms_events_incremental_sync to true, and then restart the FE node. You do not need to manually update the metadata before or after the restart.

    Timing Refresh

    When creating a Catalog, specify the refresh time parameter metadata_refresh_interval_sec in properties, in seconds. If this parameter is set when creating a Catalog, the master node of FE will refresh the Catalog regularly according to the parameter value. Currently, three types are supported:
    HMS:Hive MetaStore
    ES:Elasticsearch
    JDBC: Standard interface for database access (JDBC)

    Example

    -- Set catalog refresh interval to 20 seconds
    CREATE CATALOG es PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200",
    "metadata_refresh_interval_sec"="20"
    );
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support