tencent cloud

Feedback

Index, Sorted Column and Prefix Index

Last updated: 2024-06-27 10:55:08

    Index

    Doris supports a wide range of index structures to reduce data scans and improve query efficiency. The currently supported index types are:
    Sorted Compound Key Index: Users can specify three columns at most to form a compound sort key. This index can effectively prune data to better support high concurrency reporting scenarios.
    Z-order Index: It allows highly efficient range queries for any combination of fields in the data model.
    Min/Max: This enables effective filtering of equivalence and range queries for numeric types.
    Bloom Filter: It is highly effective in equivalence filtering and pruning of high cardinality columns.
    Invert Index: It enables quick search for any field.
    Unlike traditional database, Doris does not support creating indexes on any column. MPP-structured OLAP databases like Doris usually handle large amounts of data through improved concurrency.

    Sort Column (Sort Key):

    To improve the query performance, Doris has optimized the organization structure of data storage. Essentially, Doris stores data in an ordered data structure similar to SSTable (Sorted String Table). This structure can sort storage according to specific columns (one or multiple), also known as sort columns. In such a data structure, it is very efficient to use sorting columns as conditions to retrieve data.
    In the three data models of Aggregate, Unique, and Duplicate, the underlying data is sorted and stored according to the columns specified in AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY of the table creation statement. Rollup can specify its own sort keys, but the sort keys must be a prefix of the Rollup column order.
    Note
    In the column definitions of the table creation statement, the definition of sort column must appear before the definitions of other columns.
    The order of sort columns is determined by the column order in the table creation statement.

    Prefix Index

    This is an index method based on the sorted data structure (SSTable), implementing fast data query with given prefix columns. For queries that can utilize the sorting structure, Doris uses binary search algorithm to locate the target data range. However, if there are a lot of rows in the table, direct binary search on sort columns requires loading all the data in the filter column into memory, which would consume a large amount of memory. To optimize this, Doris introduces sparse Shortkey Index (prefix index) based on the sort key. The content of the Sortkey Index is 1024 times less than the data volume (Doris combines every 1024 rows of data into a logical data block, called Data Block, and each Data Block stores a row of index in the prefix index), therefore the content will be fully cached in memory, which can effectively speed up the query in actual search.
    Large number of Sort Key columns will occupy a lot of memory. For performance purposes, prefix index items are restricted as follows:
    At most 3 columns can be selected as Shortkey columns.
    Columns of FLOAT / DOUBLE type cannot be used.
    Prefix index can only be constructed in the order of sort keys.
    VARCHAR / CHAR type columns can only appear once and must be at the last position.
    All columns do not exceed 36 bytes, and VARCHAR / CHAR columns are truncated according to the remaining byte number.
    When the user specifies the property of the short_key in the table creation statement, for example, "short_key" = "4" specifies 4 columns as short_key, which can break through the above restrictions.

    Sample code

    1. The prefix index of the following table structure is user_id(8 Bytes) + age(4 Bytes) + message(prefix 20 Bytes).
    ColumnName
    Type
    user_id
    BIGINT
    age
    INT
    message
    VARCHAR(100)
    max_dwell_time
    DATETIME
    min_dwell_time
    DATETIME
    2. The prefix index of the following table structure is user_name(20 Bytes). Even if it does not contain 36 bytes, it is directly truncated and does not continue backwards because it encounters VARCHAR.
    ColumnName
    Type
    user_name
    VARCHAR(20)
    age
    INT
    message
    VARCHAR(100)
    max_dwell_time
    DATETIME
    min_dwell_time
    DATETIME
    When our query condition is aprefix of the prefix index, it can greatly speed up the query speed. For example, we perform the following query in the first example:
    SELECT * FROM table WHERE user_id=1829239 and age=20;
    The efficiency of this query will be much higher than the following query:
    SELECT * FROM table WHERE age=20;
    Therefore, when creating a table, selecting the column order correctly can greatly improve the query efficiency.

    Best Practice

    Adjusting the Prefix Index through ROLLUP

    Since the column order has been specified when a table is created, a table only has one prefix index. This may not satisfy the requirements of efficiency for queries that use the column that cannot hit the prefix index as the conditions. Therefore, we can adjust the column order artificially by creating a ROLLUP.

    Optimizing the Order of Sort Columns to Improve Query Performance

    When the Sort Key involves multiple columns, pay attention to the order. It is recommended to place highly distinct and frequently queried columns in front.
    Note the number of sort columns:
    1. If a large number of columns are selected for sort columns, the sorting when importing data will increase the time of the whole importing process.
    2. A small number of well-designed sort columns can also quickly locate the position of the data row. Adding more columns for sorting will not improve query.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support