tencent cloud

Feedback

Bitmap Index

Last updated: 2024-06-27 11:10:34
    Users can speed up their queries by creating a bitmap index. This document mainly introduces how to create a bitmap index, as well as some considerations and common problems when creating a bitmap index.

    Definitions

    Bitmap index: An index represented by a bitmap, with a bitmap established for every key value of the index column. It is a fast data structure that can speed up query performance. Schematic diagram of index principle:
    

    How It Works

    It is mainly created for columns with a large number of identical values (for example: categories, operators, department IDs, warehouse IDs, etc.). An index block stores the key values and the start and end Rowids in an index row, as well as the position encoding of these key values. Each bit in the position encoding indicates whether there is a data row corresponding to the key value. A block may point to the location of tens or even hundreds of rows of data.
    When querying by key value, you can quickly locate the data based on the starting Rowid and the status of the bitmap.
    When performing and, or, or in(x, y, ...) queries based on key values, you can directly use the bitmap of the index to perform bitwise operations and quickly get the results of the row data.
    When executing select count(xx), you can quickly get statistical data by directly accessing the index.

    Syntax

    Creating and deleting are essentially a schema change job. For more details, see Schema Change.

    Creating Index

    Supports designating a Bitmap index when creating a table.
    CREATE TABLE create_table_with_bitmap ( l_shipdate date NOT NULL, l_comment varchar(44) NOT NULL, INDEX shipdate_bm_index (l_shipdate) USING BITMAP COMMENT 'shipdate bitmap index' ) ENGINE=OLAP DUPLICATE KEY(l_shipdate) COMMENT 'OLAP' DISTRIBUTED BY HASH(l_shipdate) BUCKETS 96 PROPERTIES ( "replication_allocation" = "tag.location.default: 3", "in_memory" = "false", "storage_format" = "V2", "disable_auto_compaction" = "false" );
    Also supports modifying an existing table to add a Bitmap index for a certain column, such as creating a Bitmap index for siteid on table1:
    CREATE INDEX [IF NOT EXISTS] index_name ON table1 (siteid) USING BITMAP COMMENT 'balabala';

    Viewing Index

    Displays the index of the specified table_name:
    SHOW INDEX FROM example_db.table_name;

    Deleting Index

    Deletes the index of the specified table_name:
    DROP INDEX [IF EXISTS] index_name ON [db_name.]table_name;
    

    Basic Principles

    TChouse-D, the data warehouse used by Tencent Cloud, uses column storage. Without an index, to determine whether a column F is equal to a certain value V, you would have to traverse all the data in a column to get the row number equal to V. Bitmap index, to avoid traversing all data during retrieval, sequentially saves all values of the column (ordered dictionary) based on the original column stored data, as well as the row number corresponding to each value. By this value, you can quickly find all row numbers equal to this value during retrieval.
    For example, if a column of data is [x, x, y, y, y, z, y, x, z, x], including 10 rows, then the ordered dictionary of the bitmap index of this column data is {x, y, z}, and the bitmaps corresponding to x,y,z are:
    Bitmap of x: [0, 1, 7, 9]
    Bitmap of y: [2, 3, 4, 6]
    Bitmap of z: [5, 8]
    

    Notes

    At present, the index command ([show | create | drop] index) only supports bitmap type index.
    Bitmap index is created only on a single column.
    Bitmap index can be applied on all columns of Duplicate, Uniq data models and key columns of Aggregate model.
    The supported data types for bitmap index are:
    TINYINT
    SMALLINT
    INT
    BIGINT
    CHAR
    VARCHAR
    DATE
    DATETIME
    LARGEINT
    DECIMAL
    BOOL
    Bitmap index is effective only under Segment V2. When creating an index, the table's storage format will be converted to the V2 format by default (you can view the table's storage format by the command: show create table table_name).

    Use Cases of Bitmap Index

    Bitmap index is suitable for columns with low cardinality, and it is advised to be in the range of 100 to 100,000, such as: job, city, etc. If there is too much repetition, it has no obvious advantage compared with other types of indexes; if there is too little repetition, both spatial efficiency and performance will significantly decrease.
    For certain types of queries such as count, or, and, etc. logical operations, only bit operation is needed, for example: combined query through multiple conditions, select count(*) from table where job = 'doctor' and phonetype = 'iphone' and gender ='male'; for scenarios like this, if a bitmap index is established on each query condition column, the database can perform efficient bit operation, precisely pinpoint the needed data, reduce disk IO. Moreover, the smaller the query result set, the more obvious the advantage of bitmap index.
    It is suitable for ad-hoc queries, multi-dimensional analysis, and other OLAP scenarios. If a table has 100 columns, users would use 20 columns as query conditions (arbitrarily using any of the N columns on these 20 columns). It's almost impossible to create a suitable b-tree index. However, if 20 bitmap indexes are created on these columns, then all queries can apply the index.
    Bitmap index supports exact value queries and range lookups, supporting =, <, <=, >, >= expressions.

    Non-Use Cases of Bitmap Index

    Columns with low repetitive values, such as: ID Card numbers, mobile phone numbers, etc.
    Columns with high redundancy, such as: gender, which can establish bitmap index, but it is not recommended to use alone as a query condition, it's better to filter with other conditions.
    Column that often needs to be updated.

    Best Practice

    Below is based on the lineitem table of the TPCH test set (600 million records), building a Bitmap index and using it.
    enter image description here
    
    Assuming for the lineitem table, the following sql is expected to be optimized, and the duration before optimization is about 900ms.
    select l_quantity,l_returnflag,l_commitdate from lineitem where l_suppkey = 'xxx';

    Analyze whether Bitmap index is applicable

    Query condition l_suppkey, at this moment it does not hit the prefix index and there is no other index. The field type of l_suppkey is INT, which can use the Bitmap index.
    View the cardinality of the l_suppkey column. In the 600 million data table, the value of l_suppkey is 1 million. The cardinality isn't particularly high, here it tries to use the Bitmap index for optimization.
    enter image description here
    

    Create Bitmap and wait for index building

    Creating Bitmap index:
    create index suppkey_bitmap_index on lineitem(l_suppkey) using bitmap comment 'test';
    View the index building progress via the 'show alter table column' command
    enter image description here
    
    View the existing index of the lineitem table via 'show create table lineitem' command
    enter image description here
    

    Comparing the effects before and after optimization

    After optimization, the duration is 40ms.
    enter image description here
    
    Before optimization, the duration is 240ms (here, the lineitem2 table used is a backup of the lineitem table, the data volume and table structure are consistent).
    enter image description here
    

    FAQs

    Does 'select count(*)' have an impact on Bitmap index columns?

    Yes, by viewing the profile of the statement select count(*) from lineitem where l_suppkey = '388041';, you can see that the RowsBitmapIndexFiltered field has filtered out a lot of data.
    RowsBitmapIndexFiltered: 200.01525M (200015250)

    Why is it not suitable to use bitmap index when the cardinality of column values is large?

    Through testing, a bitmap index was built on the l_partkey column (20 million values) of the 600 million records in the lineitem table. Before optimization, the single query duration was 600ms, and after optimization, it was 120ms. The efficiency has increased fivefold. It can be seen that creating an index for columns with high cardinality can also improve query efficiency.
    However, storage consumption is quite large. The lineitem table with 600 million data and 3 replicas occupies about 62G of storage. After the bitmap index is built on the l_partkey column, the Storage of the lineitem table increased to 95G. In theory, the storage growth of a bitmap index build without compression is 20 million (value) x 600 million (number of rows) bits.
    It can be seen that building a bitmap index for high cardinality columns is not cost-effective. In this case, it is recommended to use a bloomfilter index.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support