Tencent Cloud

Recent Pages

ES External Table

Last updated: 2024-07-04 10:14:17

Note:
The content showcased in this document is only suitable for Tencent Cloud TCHouse-D v1.1 and below. For later versions, it is recommended to use the Multi-Catalog feature for interfacing with external data directories.
Doris-On-ES combines the distributed query planning capability of Doris with the full-text search capability of ES to provide a full-fledged OLAP solution that is able to perform:
1. Multi-index distributed Join query in ES.
2. Joint query of tables in Doris and ES, more complex full-text search filtering.
This document mainly introduces the implementation principles and usage methods of this feature.
Definitions
Doris- related
FE: Frontend, the front-end node of Doris, responsible for metadata management and request access.
BE: Backend, the back-end node of Doris, responsible for query execution and data storage.
ES-related
DataNode: Data storage and computing node of ES.
MasterNode: The master node of ES, managing metadata, node, and data distribution, etc.
scroll: ES built-in dataset cursor feature, used for stream scanning and filtering of data.
_source: The original JSON format document content passed in during import.
doc_values: column storage definition of fields in ES/Lucene.
keyword: string type field, ES/Lucene does not tokenize text content.
text: string type field, ES/Lucene will tokenize text content. The tokenizer needs to be specified by the user. The standard English tokenizer is used by default.
Usage
Creating ES Index
PUT test
{
   "settings": {
      "index": {
         "number_of_shards": "1",
         "number_of_replicas": "0"
      }
   },
   "mappings": {
      "doc": { // ES 7.x and later do not require type to be specified when creating an index. There will be a default and unique `_doc` type
         "properties": {
            "k1": {
               "type": "long"
            },
            "k2": {
               "type": "date"
            },
            "k3": {
               "type": "keyword"
            },
            "k4": {
               "type": "text",
               "analyzer": "standard"
            },
            "k5": {
               "type": "float"
            }
         }
      }
   }
}
Importing data into the ES Index
OST /_bulk
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Elasticsearch", "k4": "Trying out Elasticsearch", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Doris", "k4": "Trying out Doris", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris On ES", "k4": "Doris On ES", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris", "k4": "Doris", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "ES", "k4": "ES", "k5": 10.0}
Creating an ES External Table in Doris
For more information on the syntax of creating a table, see CREATE TABLE.
CREATE EXTERNAL TABLE `test` // schema is not specified. Automatically ES mapping is pulled for table creation 
ENGINE=ELASTICSEARCH 
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"type" = "doc",
"user" = "root",
"password" = "root"
);
﻿
CREATE EXTERNAL TABLE `test` (
  `k1` bigint(20) COMMENT "",
  `k2` datetime COMMENT "",
  `k3` varchar(20) COMMENT "",
  `k4` varchar(100) COMMENT "",
  `k5` float COMMENT ""
) ENGINE=ELASTICSEARCH // The ENGINE must be Elasticsearch
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"type" = "doc",
"user" = "root",
"password" = "root"
);
Parameter description:
Parameter
Description
hosts
ES cluster address, can be one or multiple addresses, or the load balancer address of ES
index
The index name of the corresponding ES, supporting alias. If using doc_value, you need to use the real name.
type
The index type. This parameter is not required to import for ES 7.x and later versions.
user
ES cluster username
password
Cipher information for the application
For clusters before ES 7.x, ensure to select the correct index type when creating a table.
Currently, only supports HTTP Basic authentication is supported, and users must have read permissions to the path and index of :/_cluster/state/,_nodes/http. If cluster security authentication is not enabled, we do not need to set username and password.
The column names in the Doris table need to match the field names in ES exactly, and the field types should be consistent.
The ENGINE must be Elasticsearch.
Filter Condition Pushdown
One significant feature of Doris On ES is the pushdown of filter conditions: the conditions are pushed down to ES, so only data that truly meets the criteria is returned, significantly improving query performance and reducing the CPU, memory, and IO usage of both Doris and Elasticsearch.
enable_new_es_dsl indicates whether to use the new DSL generation logic. All subsequent bug fixes and iterations are developed in the new DSL. The value is true by default, and can be modified in fe.conf.
The following operators will be optimized to the following ES Queries:
SQL syntax
ES 5.x+ syntax
=
term query
in
terms query
> , < , >= , ⇐
range query
and
bool.filter
or
bool.should
not
bool.must_not
not in
bool.must_not + terms query
is_not_null
exists query
is_null
bool.must_not + exists query
esquery
ES Native json format QueryDSL
Data Type mapping
Doris/ES
byte
short
integer
long
float
double
keyword
text
date
tinyint
✓
-
-
-
-
-
-
-
-
smallint
✓
✓
-
-
-
-
-
-
-
int
✓
✓
✓
-
-
-
-
-
-
bigint
✓
✓
✓
✓
-
-
-
-
-
float
-
-
-
-
✓
-
-
-
-
double
-
-
-
-
-
✓
-
-
-
char
-
-
-
-
-
-
✓
✓
-
varchar
-
-
-
-
-
-
✓
✓
-
date
-
-
-
-
-
-
-
-
✓
datetime
-
-
-
-
-
-
-
-
✓
Enable Columnar Scan for Faster Queries (enable_docvalue_scan=true)
CREATE EXTERNAL TABLE `test` (
  `k1` bigint(20) COMMENT "",
  `k2` datetime COMMENT "",
  `k3` varchar(20) COMMENT "",
  `k4` varchar(100) COMMENT "",
  `k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"enable_docvalue_scan" = "true"
);
Parameter description:
Parameter
Description
enable_docvalue_scan
Whether to enable the function of obtaining the value of queried field through ES/Lucene columnar storage. The default value is false.
After enabling, Doris will access data from ES according to the following two principles:
Try and see: Doris will automatically check if columnar storage is enabled for the target fields (doc_value: true). If it is, Doris will obtain all values in the fields from the columnar storage.
Automatic downgrade: If any field to be accessed does not have columnar storage, all field values will be parsed from row storage_source.
Benefits
By default, Doris On ES obtains all required columns from _source. The _source is in row storage and JSON format. Compared to columnar storage, it is slow in batch read. In particular, when the system only needs to read small number of columns, the performance of docvalue can be about a dozen times faster than that of _source.
Note
1. The text type fields in ES do not have columnar storage, so if the field value that needs to be accessed is of text type, it will automatically downgrade to access from _source.
2. In the case of excessive number of 'access' fields (> = 25), the performance of accessing field values from docvalue will be basically the same as accessing field values from _source.
Sniff Keyword Field (enable_keyword_sniff=true)
CREATE EXTERNAL TABLE `test` (
  `k1` bigint(20) COMMENT "",
  `k2` datetime COMMENT "",
  `k3` varchar(20) COMMENT "",
  `k4` varchar(100) COMMENT "",
  `k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"enable_keyword_sniff" = "true"
);
Parameter description:
Parameter
Description
enable_keyword_sniff
Whether to sniff string type tokenization type (text) fields in ES, and obtain extra non-tokenized (keyword) field name (multi-fields mechanism)
ES allows direct data import without creating an index since it will automatically create a new index after import. For string fields, ES will create a field with both text and keyword types. This is how the multi-fields feature of ES works. The mapping is as follows:
"k4": {
   "type": "text",
   "fields": {
      "keyword": {   
         "type": "keyword",
         "ignore_above": 256
      }
   }
}
To conduct condition filtering on k4, the filtering operation will be converted to an ES TermQuery.
SQL filter conditions:
k4 = "Doris On ES"
The converted ES query DSL is:
"term" : {
    "k4": "Doris On ES"
﻿
}
Because the first field of k4 is text, it will be tokenized by the analyzer set for k4 (or by the standard analyzer if no analyzer has been set for k4) after data ingestion. As a result, it will be tokenized into three terms: Doris, On, and ES. The details are as follows:
POST /_analyze
{
  "analyzer": "standard",
  "text": "Doris On ES"
}
The result of segmentation is:
{
   "tokens": [
      {
         "token": "doris",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "on",
         "start_offset": 6,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "es",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}
What is used at the time of query is:
"term" : {
    "k4": "Doris On ES"
}
Doris On ES this term does not match any term in the dictionary, will not return any results, while enabling enable_keyword_sniff: true will automatically convert k4 = "Doris On ES" to k4.keyword = "Doris On ES" to fully match SQL semantics, the converted ES query DSL is:
"term" : {
    "k4.keyword": "Doris On ES"
}
k4.keyword is of type keyword, and when data is written into ES, it is a complete term, so it can match.
Auto Node Discovery
DSet to True by Default (nodes_discovery=true).
CREATE EXTERNAL TABLE `test` (
  `k1` bigint(20) COMMENT "",
  `k2` datetime COMMENT "",
  `k3` varchar(20) COMMENT "",
  `k4` varchar(100) COMMENT "",
  `k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"nodes_discovery" = "true"
);
Parameter description:
Parameter
Description
nodes_discovery
Whether to start EST node discovery. The default value is true.
When the configuration is true, Doris will discover all available data nodes (the allocated tablets) in ES. If Doris BE hasn't accessed the ES data node addresses, then set the configuration to false. ES cluster are deployed in private networks that are isolated from the public Internet, so users will need proxy access.
Is HTTPS access pattern enabled for the ES cluster
If enabled, it should be set totrue. The default value is false(http_ssl_enabled=true).
CREATE EXTERNAL TABLE `test` (
  `k1` bigint(20) COMMENT "",
  `k2` datetime COMMENT "",
  `k3` varchar(20) COMMENT "",
  `k4` varchar(100) COMMENT "",
  `k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"http_ssl_enabled" = "true"
);
Parameter description:
Parameter
Description
http_ssl_enabled
Is HTTPS access pattern enabled for the ES cluster
A temporary solution is to implement a trusts all method in FE/BE. In the future, the real user configuration certificates will be used.
Query Usage
You can use the ES external tables in Doris the same way as using Doris internal tables, except that the Doris data models (Rollup, Pre-Aggregation, and Materialized Views) cannot be used.
Basic query
select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'
Extended esquery(field, QueryDSL)
Theesquery(field, QueryDSL) function can be used to pushe queries that cannot be expressed in SQL, such as match_phrase, and geoshape, etc., to ES for filtering. The first parameter ofesqueryis used to associate withindex, while the second parameter is the JSON expression of basicQuery DSLin ES, enclosed by{}. Theroot keyin JSON is unique, which can be match_phrase, geo_shape, or bool, etc.
A match_phrase query:
select * from es_table where esquery(k4, ' {
        "match_phrase": {
           "k4": "doris on es"
        }
    }');
A geo related query:
select * from es_table where esquery(k4, ' {
      "geo_shape": {
         "location": {
            "shape": {
               "type": "envelope",
               "coordinates": [
                  [
                     13,
                     53
                  ],
                  [
                     14,
                     52
                  ]
               ]
            },
            "relation": "within"
         }
      }
   }');
A bool query:
select * from es_table where esquery(k4, ' {
         "bool": {
            "must": [
               {
                  "terms": {
                     "k1": [
                        11,
                        12
                     ]
                  }
               },
               {
                  "terms": {
                     "k2": [
                        100
                     ]
                  }
               }
            ]
         }
      }');
Principles
+----------------------------------------------+
|                                              |
| Doris      +------------------+              |
|            |       FE         +--------------+-------+
|            |                  |  Request Shard Location
|            +--+-------------+-+              |       |
|               ^             ^                |       |
|           |                       |          |       |
|  +-------------------+ +------------------+  |       |
|  |            |      | |    |             |  |       |
|  | +----------+----+ | | +--+-----------+ |  |       |
|  | |      BE       | | | |      BE      | |  |       |
|  | +---------------+ | | +--------------+ |  |       |
+----------------------------------------------+       |
   |        |          | |        |         |          |
   |        |          | |        |         |          |
   |    HTTP SCROLL    | |    HTTP SCROLL   |          |
+-----------+---------------------+------------+       |
|  |        v          | |        v         |  |       |
|  | +------+--------+ | | +------+-------+ |  |       |
|  | |               | | | |              | |  |       |
|  | |   DataNode    | | | |   DataNode   +<-----------+
|  | |               | | | |              | |  |       |
|  | |               +<--------------------------------+
|  | +---------------+ | | |--------------| |  |       |
|  +-------------------+ +------------------+  |       |
|   Same Physical Node                         |       |
|                                              |       |
|           +-----------------------+          |       |
|           |                       |          |       |
|           |      MasterNode       +<-----------------+
| ES        |                       |          |
|           +-----------------------+          |
+----------------------------------------------+
1. After creating the ES external table, the FE will send a request to the specified host in order to obtain information about the HTTP port and the index shard allocation. If the request fails, it will traverse the host list in sequence until it succeeds or completely fails.
2. When querying, it will generate a query plan comcurrency for the corresponding BE node based on some node information obtained by the FE and the Metadata information of the index.
3. Following the principle of proximity, the BE node sends request to the locally deployed ES node, and obtain data from _source or docvalue from each tablet of ES index concurrently by way of HTTP Scroll.
4. After Doris finishes calculating the result, it returns to the user.
Best Practice
Recommended usage of time type fields
In ES, the usage of time fields is very flexible, while in Doris On ES, improper type setting of time fields will result in th failure of filter condition pushdown.
It is recommended to allow the highest level of format compatibility for time fields when creating an index:
 "dt": {
     "type": "date",
     "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
 }
When creating this field in Doris, it is recommended to set its type to date or datetime (or varchar). You can use the following SQL statements to push the filter conditions down to ES:
select * from doe where k2 > '2020-06-21';
﻿
select * from doe where k2 < '2020-06-21 12:00:00'; 
﻿
select * from doe where k2 < 1593497011; 
﻿
select * from doe where k2 < now();
﻿
select * from doe where k2 < date_format(now(), '%Y-%m-%d');
Note that if the format of the time field in ES is not set, the default time field format is:
strict_date_optional_time||epoch_millis
Timestamps imported into ES needs to be converted into ms. ms is the internal processing format in ES. Otherwise, errors will occur in ES external tables.
Obtain ES Metadata Field _id
Each imported files, if not specified with an _id, will be given a globally unique _id, which is the primary key. Users can assign an _id with unique business meaning to the files during importing. To obtain such field values from Doris On ES, you can add a varchar-typed _id field when creating tables.
CREATE EXTERNAL TABLE `doe` (
  `_id` varchar COMMENT "",
  `city`  varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://127.0.0.1:8200",
"user" = "root",
"password" = "root",
"index" = "doe"
}
Note
The _id field only supports = and in filtering.
The _id field must be of varchar type.
FAQs
1. Doris On ES has a version requirement for ES.
ES major version is greater than 5. Data scanning method is different before ES 2.x and after ES 5.x. Currently, only ES 5.x and later versions are supported.
2. Are X-Pack authenticated ES cluster supported?
All ES clusters with HTTP Basic authentications are supported.
3. Why do some queries require longer response time than those in ES?
Yes, for _count queries, ES can directly read the metadata related to the number of the specified files instead of filtering the original data.
4. Can Aggregation operations be pushed down?
Currently, Doris On ES does not support pushdown of Aggregation operations such as sum, avg, and min/max, etc. In such operations, Doris obtains all files that met the specified conditions from ES and then conducts computing internally.

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support

tencent cloud

Recent Pages

ES External Table

Definitions

Doris- related

ES-related

Usage

Creating ES Index

Importing data into the ES Index

Creating an ES External Table in Doris

Filter Condition Pushdown

Data Type mapping

Enable Columnar Scan for Faster Queries (enable_docvalue_scan=true)

Benefits

Note

Sniff Keyword Field (enable_keyword_sniff=true)

Auto Node Discovery

Is HTTPS access pattern enabled for the ES cluster

Query Usage

Basic query

Extended esquery(field, QueryDSL)

Principles

Best Practice

Recommended usage of time type fields

Obtain ES Metadata Field `_id`

FAQs

Was this page helpful?

Was this page helpful?

Parameter	Description
hosts	ES cluster address, can be one or multiple addresses, or the load balancer address of ES
index	The index name of the corresponding ES, supporting alias. If using doc_value, you need to use the real name.
type	The index type. This parameter is not required to import for ES 7.x and later versions.
user	ES cluster username
password	Cipher information for the application

SQL syntax	ES 5.x+ syntax
=	term query
in	terms query
> , < , >= , ⇐	range query
and	bool.filter
or	bool.should
not	bool.must_not
not in	bool.must_not + terms query
is_not_null	exists query
is_null	bool.must_not + exists query
esquery	ES Native json format QueryDSL

Doris/ES	byte	short	integer	long	float	double	keyword	text	date
tinyint	✓	-	-	-	-	-	-	-	-
smallint	✓	✓	-	-	-	-	-	-	-
int	✓	✓	✓	-	-	-	-	-	-
bigint	✓	✓	✓	✓	-	-	-	-	-
float	-	-	-	-	✓	-	-	-	-
double	-	-	-	-	-	✓	-	-	-
char	-	-	-	-	-	-	✓	✓	-
varchar	-	-	-	-	-	-	✓	✓	-
date	-	-	-	-	-	-	-	-	✓
datetime	-	-	-	-	-	-	-	-	✓

Parameter	Description
enable_docvalue_scan	Whether to enable the function of obtaining the value of queried field through ES/Lucene columnar storage. The default value is false.

Parameter	Description
enable_keyword_sniff	Whether to sniff string type tokenization type (text) `fields` in ES, and obtain extra non-tokenized (keyword) field name (multi-fields mechanism)

Parameter	Description
nodes_discovery	Whether to start EST node discovery. The default value is true.

Parameter	Description
http_ssl_enabled	Is HTTPS access pattern enabled for the ES cluster

tencent cloud

Sign Up

Log in

Recent Pages

ES External Table

Definitions

Doris- related

ES-related

Usage

Creating ES Index

Importing data into the ES Index

Creating an ES External Table in Doris

Filter Condition Pushdown

Data Type mapping

Enable Columnar Scan for Faster Queries (enable_docvalue_scan=true)

Benefits

Note

Sniff Keyword Field (enable_keyword_sniff=true)

Auto Node Discovery

Is HTTPS access pattern enabled for the ES cluster

Query Usage

Basic query

Extended esquery(field, QueryDSL)

Principles

Best Practice

Recommended usage of time type fields

Obtain ES Metadata Field _id

FAQs

Was this page helpful?

Was this page helpful?

Obtain ES Metadata Field `_id`