tencent cloud

Feedback

ES Catalog

Last updated: 2024-07-04 10:17:13
    Elasticsearch Catalog (ES) supports auto-mapping of ES metadata. Users can utilize the full-text search capalibity of ES in combibation of the distributed query planning capability of Doris to provide a full-fledged OLAP analysis scenario solution.
    1. Multi-index distributed Join query in ES.
    2. Joint query across Doris and ES as well as full-text search and filter.
    Note:
    This feature applies to Tencent Cloud TCHouse-D 1.2 and subsequent versions and supports Elasticsearch 5.x and above versions.

    Create Catalog

    CREATE CATALOG es PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200"
    );
    Since there is no concept of database in Elasticsearch, after connecting to ES, Doris will automatically generate a unique Database:default_db. After switching to the ES Catalog, you will be in the default_db, so you don't need to execute the USE default_db command.

    Parameter description

    Parameter
    Mandatory
    Default Value
    Description
    hosts
    Yes
    -
    ES address, can be one or multiple addresses, or the Cloud Load Balancer address of ES.
    user
    No
    Empty
    ES username
    password
    No
    Empty
    Password of the corresponding user
    doc_value_scan
    No
    true
    Whether to obtain value of the target field by ES/Lucene columnar storage
    keyword_sniff
    No
    true
    Whether to sniff the text.fields in ES based on keywords. If this is set to false, the system will perform matching after according to the content after tokenization.
    nodes_discovery
    No
    true
    Whether to enable ES node discovery, set to true by default, set to false in network isolation environments and only connected to the specified nodes
    ssl
    No
    false
    Whether to enable HTTPS access mode for ES, currently follows a "Trust All" method in FE/BE
    mapping_es_id
    No
    false
    Whether to map the _id field in the ES index
    like_push_down
    No
    true
    Whether to convert like into wildcard push down to ES, which will increase ES consumption of CPU.
    Note:
    In terms of authentication, only Http Basic authentication is supported, and it requires the user to have read permission for the index and paths including /_cluster/state/ and_nodes/http; If you have not enabled security authentication for the cluster, you dont't need to set the user and password.
    If there are multiple types in the index in 5.x and 6.x, the first type is taken by default.

    Column type map

    ES Type
    Doris Type
    Comment
    null
    null
    -
    boolean
    boolean
    -
    byte
    tinyint
    -
    short
    smallint
    -
    integer
    int
    -
    long
    bigint
    -
    unsigned_long
    largeint
    -
    float
    float
    -
    half_float
    float
    -
    double
    double
    -
    scaled_float
    double
    -
    date
    date
    It only supports default/yyyy-MM-dd HH:mm:ss/yyyy-MM-dd/epoch_millis format
    keyword
    string
    -
    text
    string
    -
    ip
    string
    -
    nested
    string
    -
    object
    string
    -
    other
    unsupported
    -

    Array Type

    Elasticsearch does not have an explicit array type, but one of its fields can contain0 or more values. To indicate that a field is an array type, a specificdorisstructural annotation can be added to the_metasection of the index mapping. For Elasticsearch 6.x and before versions, please see_meta.
    For example, suppose there is an indexdoccontaining the following data structure:
    {
    "array_int_field": [1, 2, 3, 4],
    "array_string_field": ["doris", "is", "the", "best"],
    "id_field": "id-xxx-xxx",
    "timestamp_field": "2022-11-12T12:08:56Z",
    "array_object_field": [
    {
    "name": "xxx",
    "age": 18
    }
    ]
    }
    The array fields of this structure can be defined by using the following command to add the field property definition to the_meta.dorisproperty of the target index mapping.
    # ES 7.x and above
    curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
    {
    "_meta": {
    "doris":{
    "array_fields":[
    "array_int_field",
    "array_string_field",
    "array_object_field"
    ]
    }
    }
    }'
    
    # ES 6.x and before
    curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
    {
    "_doc": {
    "_meta": {
    "doris":{
    "array_fields":[
    "array_int_field",
    "array_string_field",
    "array_object_field"
    ]
    }
    }
    }
    }
    array_fields: used to indicate a field that is an array type.

    Best Practice

    Filter Condition Pushdown

    ES Catalog supports filter condition pushdown to ES, which means only the filtered data will be returned. This could significantly improve query performance and reduce usage of CPU, memory, and IO in both Doris and Elasticsearch. The following operators will be optimized into the following ES queries:
    SQL syntax
    ES 5.x+ syntax
    =
    term query
    in
    terms query
    > , < , >= , ⇐
    range query
    and
    bool.filter
    or
    bool.should
    not
    bool.must_not
    not in
    bool.must_not + terms query
    is\\_not\\_null
    exists query
    is\\_null
    bool.must_not + exists query
    esquery
    ES-native json QueryDSL

    Columnar Scan for Faster Queries (enable\\_docvalue\\_scan=true)

    Set "enable_docvalue_scan" = "true"
    After enabling, Doris will obtain data from ES according to the following two rules:
    Try and see: Doris will automatically check whether columnar storage is enabled for the target field (doc_value: true). If it is, Doris will obtain all the values in the fields from the columnar storage.
    Auto-downgrading: If any one of the target fields is not available in columnar storage, Doris will parse and obtain all target data from row storage from row storage_source.
    Benefits
    By default, Doris On ES obtains all target columns from row storage_source, which is in row storage and JSON format. Compared to columnar storage,_source is slow in batch read. In particular, when the system only needs to read small number of columns, the performance of docvalue is about a dozen times faster than that of_source.
    Note:
    Columnar storage is not available fortextfields in ES. Thus, if you need to obtain fields containingtextvalues, you will need to obtain them from_source.
    When obtaining large number of fields ( > = 25), the performances ofdocvalueandsource are basically equivalent.

    Sniffing Keyword Fields

    Set "enable_keyword_sniff" = "true"
    ES allows direct data ingestion without creating an index since it will automatically create an index after ingestion. For string fields, ES will create a field with bothtextandkeywordtypes. This is how the multi-field feature of ES works. The mapping is as follows:
    "k4": {
    "type": "text",
    "fields": {
    "keyword": {
    "type": "keyword",
    "ignore_above": 256
    }
    }
    }
    For example, to conduct "=" filtering on k4, Doris On ES will convert the filtering operation into an ES TermQuery.
    The original SQL filtering conditions:
    k4 = "Doris On ES"
    The converted ES query DSL:
    "term" : {
    "k4": "Doris On ES"
    }
    Since the first field of k4 istext, it will be tokenized by the analyzer set for k4 (or by the standard analyzer if no analyzer has been set for k4) after data import. As a result, it will be tokenized into three terms: doris, on, and ES.. The details are as follows:
    POST /_analyze
    {
    "analyzer": "standard",
    "text": "Doris On ES"
    }
    The tokenization results:
    {
    "tokens": [
    {
    "token": "doris",
    "start_offset": 0,
    "end_offset": 5,
    "type": "<ALPHANUM>",
    "position": 0
    },
    {
    "token": "on",
    "start_offset": 6,
    "end_offset": 8,
    "type": "<ALPHANUM>",
    "position": 1
    },
    {
    "token": "es",
    "start_offset": 9,
    "end_offset": 11,
    "type": "<ALPHANUM>",
    "position": 2
    }
    ]
    }
    If you conduct a query as follows:
    "term" : {
    "k4": "Doris On ES"
    }
    Since there is no term in the dictionary that matches the termDoris On ES, no results will be returned. However, if you have setenable_keyword_sniff: true, the system will automatically convertk4 = "Doris On ES"tok4.keyword = "Doris On ES"to fully match the SQL semantics. The converted ES query DSL will be:
    "term" : {
    "k4.keyword": "Doris On ES"
    }
    k4.keyword is ofkeywordtype, so the data is written into ES as a complete term, allowing for successful matching.

    Auto Node Discovery, Set to True by Default (nodes\\_discovery=true)

    Set "nodes_discovery" = "true"
    When the configuration is set to true, Doris will discover all available data nodes (the allocated tablets) in ES. If the addresses of ES data nodes are not accessed by Doris BE, then set "nodes_discovery" = "false". ES clusters are deployed in private networks that are isolated from the public Internet, so users will need proxy access.

    HTTPS Access Mode for ES Clusters

    Set "ssl" = "true". A temporary solution is to implement a "Trust All" method in FE/BE. The real user configuration certificate will be used in the future.

    Query Usage

    You can use the ES external tables in Doris the same way as using Doris internal tables, except that the Doris data models (Rollup, Pre-Aggregation, and Materialized Views, etc.) are unavailable.

    Basic Query

    select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'

    Extended esquery(field, QueryDSL)

    Theesquery(field, QueryDSL) function can be used to push some queries that cannot be expressed in SQL, such as match_phrase, and geoshape, etc., to ES for filtering. The first parameter (the column name) inesqueryis used to associate withindex, and the second parameter is the JSON expression of basicQuery DSL in ES, which is surrounded by {}. Theroot key in JSON is unique, which can be match_phrase, geo_shape, orbool, etc.
    match_phrase query:
    select * from es_table where esquery(k4, ' {
    "match_phrase": {
    "k4": "doris on es"
    }
    }');
    geo related query:
    select * from es_table where esquery(k4, ' {
    "geo_shape": {
    "location": {
    "shape": {
    "type": "envelope",
    "coordinates": [
    [
    13,
    53
    ],
    [
    14,
    52
    ]
    ]
    },
    "relation": "within"
    }
    }
    }');
    bool query:
    select * from es_table where esquery(k4, ' {
    "bool": {
    "must": [
    {
    "terms": {
    "k1": [
    11,
    12
    ]
    }
    },
    {
    "terms": {
    "k2": [
    100
    ]
    }
    }
    ]
    }
    }');

    Suggestions for Time Fields

    Note:
    These are only applicable to ES external tables. Time fields will be automatically mapped to Date or Datetime type in ES Catalogs.
    In ES, the usage of time fields is very flexible, but in ES external tables, improper type setting of time fields in ES external tables will result in filter conditions pushdown failures.
    It is recommended to allow the highest level of format compatibility for time fields when creating an index:
    "dt": {
    "type": "date",
    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
    }
    It is recommended to set its type asdateordatetime(orvarchar) when creating this field in Doris. You can use the following SQL statements to push the filter conditions down to ES:
    select * from doe where k2 > '2020-06-21';
    
    select * from doe where k2 < '2020-06-21 12:00:00';
    
    select * from doe where k2 < 1593497011;
    
    select * from doe where k2 < now();
    
    select * from doe where k2 < date_format(now(), '%Y-%m-%d');
    Note:
    If theformatis not set for the time field in ES, the default format of time fields is:
    strict_date_optional_time||epoch_millis
    If the date fields imported into ES that are Timestamps, they need to be converted intoms.msis the internal processing format in ES; otherwise, errors will occur in ES external tables.

    Obtaining ES Metadata Field _id

    Each imported file, if not specified with an _id, will be given a globally unique _id, which is the primary key. Users can also specify an _id with unique business meanings to the files during import;
    If you need to obtain this field value from ES external tables, you can add a varchar-typed _id when creating tables:
    CREATE EXTERNAL TABLE `doe` (
    `_id` varchar COMME "",
    `city` varchar COMMENT ""
    ) ENGINE=ELASTICSEARCH
    PROPERTIES (
    "hosts" = "http://127.0.0.1:8200",
    "user" = "root",
    "password" = "root",
    "index" = "doe"
    }
    To obtain this field values from ES Catalogs, please set "mapping_es_id" = "true" .
    Note:
    The_id field only supports=andinfiltering.
    The _id field must be of varchar type.

    FAQs

    Are X-Pack authenticated ES clusters supported?

    All ES clusters with HTTP Basic authentications are supported.

    Does some queries require longer response time than those in ES?

    Yes, such as _count related queries, ES can directly read the metadata associated with the number of the specified files instead of filtering the original data. This is a huge time saver.

    Can aggregation operations be pushed down?

    Currently, Doris On ES does not support pushdown for aggregations such as sum, avg, and min/max, etc. In such operations, Doris obtains all files that meet the specified conditions from ES and then conduct computing internally.

    Appendix

    How Doris Conducts Queries in ES

    +----------------------------------------------+
    | |
    | Doris +------------------+ |
    | | FE +--------------+-------+
    | | | Request Shard Location
    | +--+-------------+-+ | |
    | ^ ^ | |
    | | | | |
    | +-------------------+ +------------------+ | |
    | | | | | | | | |
    | | +----------+----+ | | +--+-----------+ | | |
    | | | BE | | | | BE | | | |
    | | +---------------+ | | +--------------+ | | |
    +----------------------------------------------+ |
    | | | | | | |
    | | | | | | |
    | HTTP SCROLL | | HTTP SCROLL | |
    +-----------+---------------------+------------+ |
    | | v | | v | | |
    | | +------+--------+ | | +------+-------+ | | |
    | | | | | | | | | | |
    | | | DataNode | | | | DataNode +<-----------+
    | | | | | | | | | | |
    | | | +<--------------------------------+
    | | +---------------+ | | |--------------| | | |
    | +-------------------+ +------------------+ | |
    | Same Physical Node | |
    | | |
    | +-----------------------+ | |
    | | | | |
    | | MasterNode +<-----------------+
    | ES | | |
    | +-----------------------+ |
    +----------------------------------------------+
    
    
    1. FE sends a request to the specified host for table creation in order to obtain information about the HTTP port and the index shard allocation. If the request fails, it will sequentially traverse the host list until it succeeds or fails completely.
    2. Based on the information about node and index metadata from FE, Doris generates a query plan and sends it to the corresponding BE node.
    3. The BE node will preferentially send request to the locally deployed ES node following the principle of proximity, and obtain data from each tablet of the ES index concurrently through theHTTP Scrollmethod from_sourceordocvalue.
    4. Doris returns the computing results to the user after calculation.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support