BloomFilter is a fast search algorithm for multi-hash function maps proposed by Bloom in 1970. It is typically used in applications that need to quickly determine whether an Element belongs to a collection, but it does not strictly require 100% accuracy. BloomFilter has the following features:
It is a high-space-efficiency probabilistic data structure, and used to check whether an Element is in a collection.
For an invocation to check whether an Element exists, BloomFilter tells the invoker one of the two outcomes: the Element may exist or it definitely does not exist.
The disadvantage is that there may be false judgement. That is, it tell you that the Element may exist, but actually it does not necessarily exist.
BloomFilter is actually made up of a super-long binary bit array and a series of hash functions. The binary bit array is initially all 0. When a given Element to be queried is given, this Element will be calculated by a series of hash functions to map out a series of values, and all of the values are set to 1 at the offset of the bit array.
The diagram below shows a Bloom Filter example with m=18, k=3 (m is the size of the Bit array, and k is the number of Hash functions). The three Elements x, y, z in the collection are hashed into the bit array through the three different hash functions. When we query Element w, if there is a bit that is 0 after the Hash function calculation, we know that w is not in the set.
Then how do we determine whether an Element is in the collection? This Element is also hashed to get all the offset positions. If all these positions are 1, then the Element is in the collection. If there is one that is not 1, then it is not in the collection. It's as simple as that.
Doris BloomFilter Index and Use Cases
The following is an example: in HBase, if you want to find a short row that takes up 100 bytes of storage, a 64KB HFile data block should contain (64 * 1024) /100 = 655.36, rounded to about 700 rows. If you can only index on the starting row key of the entire data block, it won't provide you with fine-grained index information, because the row you are looking for may fall on the row range of that data block, or it may not be on that data block at all, or there may not even be that row in the table, or the row data is in another HFile, or even in the MemStore. All these situations would cause additional IO overhead when reading the data block from disk, and it would also overuse the data block cache. When dealing with a huge data set and high concurrency reads, it will severely affect performance.
Therefore, HBase provides a BloomFilter, which allows you to do an inverse test of the data in each data block in your storage.
When a row is requested, the BloomFilter checks if the row is in this data block or not. The BloomFilter will answer "not in the row"/"unknown". This is why we call it an inverse test. BloomFilters can also be applied to cells in the row, so when you access a column, you can use the same inverse test.
However, using a BloomFilter does come with a cost. Storing this index takes up additional space. BloomFilters grow as their indexed data objects grow. When space is not an issue, they can help you completely use the performance of your system.
Doris's BloomFilter index can be specified at the time of table creation, or completed through the ALTER operation on the table. The BloomFilter is essentially a bitmap structure used for quick determination of whether a given value is in a set. This type of judgement may have a probability of error. So, if it returns false then it definitely is not in the set. However, if it returns true, it may still not be in the set.
Doris's BloomFilter Index is also created on a Block basis. Each Block generates a BloomFilter index entry from the values of specified columns in the Block as a collection, allowing for quick filtering of data that does not meet the query conditions.
Creating a BloomFilter Index
The creation of a Doris BloomFilter index is achieved through adding the attribute "bloom_filter_columns"="k1, k2, k3" in the PROPERTIES of the create table statement, where k1, k2, and k3 are the key column names of the BloomFilter index you want to create. The following example shows that we've created a BloomFilter index for saler_id, category_id in the table.
CREATE TABLE IF NOT EXISTS sale_detail_bloom (
sale_date date NOT NULL COMMENT "Sale Date",
customer_id int NOT NULL COMMENT "Customer ID",
saler_id int NOT NULL COMMENT "Salesperson",
sku_id int NOT NULL COMMENT "Product ID",
category_id int NOT NULL COMMENT "Product Category",
sale_count int NOT NULL COMMENT "Sale Quantity",
sale_price DECIMAL(12,2) NOT NULL COMMENT "Unit Price",
sale_amt DECIMAL(20,2) COMMENT "Total Sale Amount"
)
Duplicate KEY(sale_date, customer_id,saler_id,sku_id,category_id)
PARTITION BY RANGE(sale_date)
(
PARTITION P_202111 VALUES [('2021-11-01'), ('2021-12-01'))
)
DISTRIBUTED BY HASH(saler_id) BUCKETS 10
PROPERTIES (
"replication_num" = "3",
"bloom_filter_columns"="saler_id,category_id",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-2147483648",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "P_",
"dynamic_partition.replication_num" = "3",
"dynamic_partition.buckets" = "3"
);
You can also modify an already created table to build Bloomfilter index on specified columns.
alter table sale_detail_bloom set ("bloom_filter_columns" = "saler_id,category_id");
Viewing BloomFilter Index
To view the BloomFilter index we've built on the table, use:
SHOW CREATE TABLE <table_name>;
Deleting BloomFilter Index
To delete an index is to remove the index column from the bloom_filter_columns attribute:
ALTER TABLE <db.table_name> SET ("bloom_filter_columns" = "");
Modifying BloomFilter Index
Modifying an index is to modify the bloom_filter_columns attribute of the table:
ALTER TABLE <db.table_name> SET ("bloom_filter_columns" = "k1,k3");
Use Cases of Doris BloomFilter
Creating a BloomFilter index on a column can be considered when the following conditions are met:
1. Firstly, BloomFilter is applicable for non-prefix filtering.
2. Queries will be filtered frequently based on this column, and most of the query conditions are 'in' and '=' filtering.
3. Unlike Bitmap, BloomFilter is suitable for high cardinality columns, such as UserID. This is because if it is created on a low cardinality column, such as the "gender" column, almost every Block will contain all values, and then the BloomFilter index will be meaningless.
Usage Precautions for Doris BloomFilter
1. BloomFilter index creation is not supported for Tinyint, Float, Double type columns.
2. BloomFilter index only speeds up in and = filtering queries.
3. To check whether a specific query has hit the BloomFilter index, one can check the query's Profile information.
4. It supports creating bloomfilter indexes on multiple columns of a single table, but only one of bitmap index/bloomfilter index is supported for a specific column.
Usage Requirements
Supported column types:
SMALLINT
INT
UNSIGNEDINT
BIGINT
LARGEINT
CHAR
VARCHAR
DATE
DATETIME
DECIMAL
Other types are not supported yet
Conditions under which the query takes effect:
"=", for example: field = value
"is", for example: field IS NULL
"in", for example: field IN {value1, value2, ...}
Best Practice
Here, through the query optimization of the lineitem2
table that has 600 million records, we showcase the best practice of using BloomFilter index.
Assuming we have the following SQL query that needs optimization:
select l_partkey,l_orderkey from lineitem2 where l_partkey =xxx;
Analyze whether the SQL query can adopt a BloomFilter index for optimization
The column l_partkey, which is the condition for the SQL query, has a high resolution of 20,000,000 distinct entries, of type INT, and uses an equality condition in the query. It can therefore be optimized using BloomFilter.
Create and wait for the BloomFilter index to get built
Creating BloomFilter Index:
alter table lineitem2 set ("bloom_filter_columns" = "l_partkey");
View the completion of index creation with show alter table column. Note that if the building process isn't done, it won't use the BloomFilter index while querying.
Upon successful creation, you can examine it with the statement: show create table.
Verifying the effectiveness of the optimization
Querying the non-optimized table lineitem, which takes 490ms.
Viewing with profile, it directly uses vectorized processing for filtering.
Using lineitem2 table optimized by Bloomfilter, which takes 50ms.
Viewing with profile, we see that it applies BloomFilter index for filtering, complemented with a small amount of vectorized processing.
Compared with lineitem, lineitem2 has increased the storage by 1.7G.
Was this page helpful?