Tencent Cloud

Recent Pages

ES Catalog

Last updated: 2024-07-04 10:17:13

Elasticsearch Catalog (ES) supports auto-mapping of ES metadata. Users can utilize the full-text search capalibity of ES in combibation of the distributed query planning capability of Doris to provide a full-fledged OLAP analysis scenario solution.
1. Multi-index distributed Join query in ES.
2. Joint query across Doris and ES as well as full-text search and filter.
Note:
This feature applies to Tencent Cloud TCHouse-D 1.2 and subsequent versions and supports Elasticsearch 5.x and above versions.
Create Catalog
CREATE CATALOG es PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200"
);
Since there is no concept of database in Elasticsearch, after connecting to ES, Doris will automatically generate a unique Database:default_db. After switching to the ES Catalog, you will be in the default_db, so you don't need to execute the USE default_db command.
Parameter description
Parameter
Mandatory
Default Value
Description
hosts
Yes
-
ES address, can be one or multiple addresses, or the Cloud Load Balancer address of ES.
user
No
Empty
ES username
password
No
Empty
Password of the corresponding user
doc_value_scan
No
true
Whether to obtain value of the target field by ES/Lucene columnar storage
keyword_sniff
No
true
Whether to sniff the text.fields in ES based on keywords. If this is set to false, the system will perform matching after according to the content after tokenization.
nodes_discovery
No
true
Whether to enable ES node discovery, set to true by default, set to false in network isolation environments and only connected to the specified nodes
ssl
No
false
Whether to enable HTTPS access mode for ES, currently follows a "Trust All" method in FE/BE
mapping_es_id
No
false
Whether to map the _id field in the ES index
like_push_down
No
true
Whether to convert like into wildcard push down to ES, which will increase ES consumption of CPU.
Note:
In terms of authentication, only Http Basic authentication is supported, and it requires the user to have read permission for the index and paths including /_cluster/state/ and_nodes/http; If you have not enabled security authentication for the cluster, you dont't need to set the user and password.
If there are multiple types in the index in 5.x and 6.x, the first type is taken by default.
Column type map
ES Type
Doris Type
Comment
null
null
-
boolean
boolean
-
byte
tinyint
-
short
smallint
-
integer
int
-
long
bigint
-
unsigned_long
largeint
-
float
float
-
half_float
float
-
double
double
-
scaled_float
double
-
date
date
It only supports default/yyyy-MM-dd HH:mm:ss/yyyy-MM-dd/epoch_millis format
keyword
string
-
text
string
-
ip
string
-
nested
string
-
object
string
-
other
unsupported
-
Array Type
Elasticsearch does not have an explicit array type, but one of its fields can contain0 or more values. To indicate that a field is an array type, a specificdorisstructural annotation can be added to the_metasection of the index mapping. For Elasticsearch 6.x and before versions, please see_meta.
For example, suppose there is an indexdoccontaining the following data structure:
{
  "array_int_field": [1, 2, 3, 4],
  "array_string_field": ["doris", "is", "the", "best"],
  "id_field": "id-xxx-xxx",
  "timestamp_field": "2022-11-12T12:08:56Z",
  "array_object_field": [
    {
      "name": "xxx",
      "age": 18
    }
  ]
}
The array fields of this structure can be defined by using the following command to add the field property definition to the_meta.dorisproperty of the target index mapping.
# ES 7.x and above
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
    "_meta": {
        "doris":{
            "array_fields":[
                "array_int_field",
                "array_string_field",
                "array_object_field"
            ]
        }
    }
}'
﻿
# ES 6.x and before
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
    "_doc": {
        "_meta": {
            "doris":{
                "array_fields":[
                    "array_int_field",
                    "array_string_field",
                    "array_object_field"
                ]
            }
    }
    }
}
array_fields: used to indicate a field that is an array type.
Best Practice
Filter Condition Pushdown
ES Catalog supports filter condition pushdown to ES, which means only the filtered data will be returned. This could significantly improve query performance and reduce usage of CPU, memory, and IO in both Doris and Elasticsearch. The following operators will be optimized into the following ES queries:
SQL syntax
ES 5.x+ syntax
=
term query
in
terms query
> , < , >= , ⇐
range query
and
bool.filter
or
bool.should
not
bool.must_not
not in
bool.must_not + terms query
is\\_not\\_null
exists query
is\\_null
bool.must_not + exists query
esquery
ES-native json QueryDSL
Columnar Scan for Faster Queries (enable\\_docvalue\\_scan=true)
Set "enable_docvalue_scan" = "true"
After enabling, Doris will obtain data from ES according to the following two rules:
Try and see: Doris will automatically check whether columnar storage is enabled for the target field (doc_value: true). If it is, Doris will obtain all the values in the fields from the columnar storage.
Auto-downgrading: If any one of the target fields is not available in columnar storage, Doris will parse and obtain all target data from row storage from row storage_source.
Benefits
By default, Doris On ES obtains all target columns from row storage_source, which is in row storage and JSON format. Compared to columnar storage,_source is slow in batch read. In particular, when the system only needs to read small number of columns, the performance of docvalue is about a dozen times faster than that of_source.
Note:
Columnar storage is not available fortextfields in ES. Thus, if you need to obtain fields containingtextvalues, you will need to obtain them from_source.
When obtaining large number of fields ( > = 25), the performances ofdocvalueandsource are basically equivalent.
Sniffing Keyword Fields
Set "enable_keyword_sniff" = "true"
ES allows direct data ingestion without creating an index since it will automatically create an index after ingestion. For string fields, ES will create a field with bothtextandkeywordtypes. This is how the multi-field feature of ES works. The mapping is as follows:
"k4": {
   "type": "text",
   "fields": {
      "keyword": {   
         "type": "keyword",
         "ignore_above": 256
      }
   }
}
For example, to conduct "=" filtering on k4, Doris On ES will convert the filtering operation into an ES TermQuery.
The original SQL filtering conditions:
k4 = "Doris On ES"
The converted ES query DSL:
"term" : {
    "k4": "Doris On ES"
}
Since the first field of k4 istext, it will be tokenized by the analyzer set for k4 (or by the standard analyzer if no analyzer has been set for k4) after data import. As a result, it will be tokenized into three terms: doris, on, and ES.. The details are as follows:
POST /_analyze
{
  "analyzer": "standard",
  "text": "Doris On ES"
}
The tokenization results:
{
   "tokens": [
      {
         "token": "doris",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "on",
         "start_offset": 6,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "es",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}
If you conduct a query as follows:
"term" : {
    "k4": "Doris On ES"
}
Since there is no term in the dictionary that matches the termDoris On ES, no results will be returned. However, if you have setenable_keyword_sniff: true, the system will automatically convertk4 = "Doris On ES"tok4.keyword = "Doris On ES"to fully match the SQL semantics. The converted ES query DSL will be:
"term" : {
    "k4.keyword": "Doris On ES"
}
k4.keyword is ofkeywordtype, so the data is written into ES as a complete term, allowing for successful matching.
Auto Node Discovery, Set to True by Default (nodes\\_discovery=true)
Set "nodes_discovery" = "true"
When the configuration is set to true, Doris will discover all available data nodes (the allocated tablets) in ES. If the addresses of ES data nodes are not accessed by Doris BE, then set "nodes_discovery" = "false". ES clusters are deployed in private networks that are isolated from the public Internet, so users will need proxy access.
HTTPS Access Mode for ES Clusters
Set "ssl" = "true". A temporary solution is to implement a "Trust All" method in FE/BE. The real user configuration certificate will be used in the future.
Query Usage
You can use the ES external tables in Doris the same way as using Doris internal tables, except that the Doris data models (Rollup, Pre-Aggregation, and Materialized Views, etc.) are unavailable.
Basic Query
select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'
Extended esquery(field, QueryDSL)
Theesquery(field, QueryDSL) function can be used to push some queries that cannot be expressed in SQL, such as match_phrase, and geoshape, etc., to ES for filtering. The first parameter (the column name) inesqueryis used to associate withindex, and the second parameter is the JSON expression of basicQuery DSL in ES, which is surrounded by {}. Theroot key in JSON is unique, which can be match_phrase, geo_shape, orbool, etc.
match_phrase query:
select * from es_table where esquery(k4, ' {
        "match_phrase": {
           "k4": "doris on es"
        }
    }');
geo related query:
select * from es_table where esquery(k4, ' {
      "geo_shape": {
         "location": {
            "shape": {
               "type": "envelope",
               "coordinates": [
                  [
                     13,
                     53
                  ],
                  [
                     14,
                     52
                  ]
               ]
            },
            "relation": "within"
         }
      }
   }');
bool query:
select * from es_table where esquery(k4, ' {
         "bool": {
            "must": [
               {
                  "terms": {
                     "k1": [
                        11,
                        12
                     ]
                  }
               },
               {
                  "terms": {
                     "k2": [
                        100
                     ]
                  }
               }
            ]
         }
      }');
Suggestions for Time Fields
Note:
These are only applicable to ES external tables. Time fields will be automatically mapped to Date or Datetime type in ES Catalogs.
In ES, the usage of time fields is very flexible, but in ES external tables, improper type setting of time fields in ES external tables will result in filter conditions pushdown failures.
It is recommended to allow the highest level of format compatibility for time fields when creating an index:
 "dt": {
     "type": "date",
     "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
 }
It is recommended to set its type asdateordatetime(orvarchar) when creating this field in Doris. You can use the following SQL statements to push the filter conditions down to ES:
select * from doe where k2 > '2020-06-21';
﻿
select * from doe where k2 < '2020-06-21 12:00:00'; 
﻿
select * from doe where k2 < 1593497011; 
﻿
select * from doe where k2 < now();
﻿
select * from doe where k2 < date_format(now(), '%Y-%m-%d');
Note:
If theformatis not set for the time field in ES, the default format of time fields is:
  strict_date_optional_time||epoch_millis
If the date fields imported into ES that are Timestamps, they need to be converted intoms.msis the internal processing format in ES; otherwise, errors will occur in ES external tables.
Obtaining ES Metadata Field _id
Each imported file, if not specified with an _id, will be given a globally unique _id, which is the primary key. Users can also specify an _id with unique business meanings to the files during import;
If you need to obtain this field value from ES external tables, you can add a varchar-typed _id when creating tables:
CREATE EXTERNAL TABLE `doe` (
  `_id` varchar COMME "",
  `city`  varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://127.0.0.1:8200",
"user" = "root",
"password" = "root",
"index" = "doe"
}
To obtain this field values from ES Catalogs, please set "mapping_es_id" = "true" .
Note:
The_id field only supports=andinfiltering.
The _id field must be of varchar type.
FAQs
Are X-Pack authenticated ES clusters supported?
All ES clusters with HTTP Basic authentications are supported.
Does some queries require longer response time than those in ES?
Yes, such as _count related queries, ES can directly read the metadata associated with the number of the specified files instead of filtering the original data. This is a huge time saver.
Can aggregation operations be pushed down?
Currently, Doris On ES does not support pushdown for aggregations such as sum, avg, and min/max, etc. In such operations, Doris obtains all files that meet the specified conditions from ES and then conduct computing internally.
Appendix
How Doris Conducts Queries in ES
+----------------------------------------------+
|                                              |
| Doris      +------------------+              |
|            |       FE         +--------------+-------+
|            |                  |  Request Shard Location
|            +--+-------------+-+              |       |
|               ^             ^                |       |
|           |                       |          |       |
|  +-------------------+ +------------------+  |       |
|  |            |      | |    |             |  |       |
|  | +----------+----+ | | +--+-----------+ |  |       |
|  | |      BE       | | | |      BE      | |  |       |
|  | +---------------+ | | +--------------+ |  |       |
+----------------------------------------------+       |
   |        |          | |        |         |          |
   |        |          | |        |         |          |
   |    HTTP SCROLL    | |    HTTP SCROLL   |          |
+-----------+---------------------+------------+       |
|  |        v          | |        v         |  |       |
|  | +------+--------+ | | +------+-------+ |  |       |
|  | |               | | | |              | |  |       |
|  | |   DataNode    | | | |   DataNode   +<-----------+
|  | |               | | | |              | |  |       |
|  | |               +<--------------------------------+
|  | +---------------+ | | |--------------| |  |       |
|  +-------------------+ +------------------+  |       |
|   Same Physical Node                         |       |
|                                              |       |
|           +-----------------------+          |       |
|           |                       |          |       |
|           |      MasterNode       +<-----------------+
| ES        |                       |          |
|           +-----------------------+          |
+----------------------------------------------+
﻿
﻿
1. FE sends a request to the specified host for table creation in order to obtain information about the HTTP port and the index shard allocation. If the request fails, it will sequentially traverse the host list until it succeeds or fails completely.
2. Based on the information about node and index metadata from FE, Doris generates a query plan and sends it to the corresponding BE node.
3. The BE node will preferentially send request to the locally deployed ES node following the principle of proximity, and obtain data from each tablet of the ES index concurrently through theHTTP Scrollmethod from_sourceordocvalue.
4. Doris returns the computing results to the user after calculation.

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support

tencent cloud

Recent Pages

ES Catalog

Create Catalog

Parameter description

Column type map

Array Type

Best Practice

Filter Condition Pushdown

Columnar Scan for Faster Queries (enable\\_docvalue\\_scan=true)

Sniffing Keyword Fields

Auto Node Discovery, Set to True by Default (nodes\\_discovery=true)

HTTPS Access Mode for ES Clusters

Query Usage

Basic Query

Extended esquery(field, QueryDSL)

Suggestions for Time Fields

Obtaining ES Metadata Field `_id`

FAQs

Are X-Pack authenticated ES clusters supported?

Does some queries require longer response time than those in ES?

Can aggregation operations be pushed down?

Appendix

How Doris Conducts Queries in ES

Was this page helpful?

Was this page helpful?

Parameter	Mandatory	Default Value	Description
`hosts`	Yes	-	ES address, can be one or multiple addresses, or the Cloud Load Balancer address of ES.
`user`	No	Empty	ES username
`password`	No	Empty	Password of the corresponding user
`doc_value_scan`	No	true	Whether to obtain value of the target field by ES/Lucene columnar storage
`keyword_sniff`	No	true	Whether to sniff the text.fields in ES based on keywords. If this is set to false, the system will perform matching after according to the content after tokenization.
`nodes_discovery`	No	true	Whether to enable ES node discovery, set to true by default, set to false in network isolation environments and only connected to the specified nodes
`ssl`	No	false	Whether to enable HTTPS access mode for ES, currently follows a "Trust All" method in FE/BE
`mapping_es_id`	No	false	Whether to map the `_id` field in the ES index
`like_push_down`	No	true	Whether to convert like into wildcard push down to ES, which will increase ES consumption of CPU.

ES Type	Doris Type	Comment
null	null	-
boolean	boolean	-
byte	tinyint	-
short	smallint	-
integer	int	-
long	bigint	-
unsigned_long	largeint	-
float	float	-
half_float	float	-
double	double	-
scaled_float	double	-
date	date	It only supports default/yyyy-MM-dd HH:mm:ss/yyyy-MM-dd/epoch_millis format
keyword	string	-
text	string	-
ip	string	-
nested	string	-
object	string	-
other	unsupported	-

SQL syntax	ES 5.x+ syntax
=	term query
in	terms query
> , < , >= , ⇐	range query
and	bool.filter
or	bool.should
not	bool.must_not
not in	bool.must_not + terms query
is\\_not\\_null	exists query
is\\_null	bool.must_not + exists query
esquery	ES-native json QueryDSL

tencent cloud

Sign Up

Log in

Recent Pages

ES Catalog

Create Catalog

Parameter description

Column type map

Array Type

Best Practice

Filter Condition Pushdown

Columnar Scan for Faster Queries (enable\\_docvalue\\_scan=true)

Sniffing Keyword Fields

Auto Node Discovery, Set to True by Default (nodes\\_discovery=true)

HTTPS Access Mode for ES Clusters

Query Usage

Basic Query

Extended esquery(field, QueryDSL)

Suggestions for Time Fields

Obtaining ES Metadata Field _id

FAQs

Are X-Pack authenticated ES clusters supported?

Does some queries require longer response time than those in ES?

Can aggregation operations be pushed down?

Appendix

How Doris Conducts Queries in ES

Was this page helpful?

Was this page helpful?

Obtaining ES Metadata Field `_id`