How to choose a data model?
For data that is already aggregated, either the Aggregate or Duplicate model can be used.
For detail data that has already been cleaned, if you need to perform aggregation for querying statistical values, choose the Aggregate model.
For detail data that has already been cleaned, if you need to perform detailed queries, choose the Duplicate model (which does not include pre-aggregation, resulting in lower aggregation query efficiency. You can enhance aggregation query performance through materialized views).
If there is a unique Key and you need to deduplicate by Key, choose the Unique Model.
How to set up partitions?
Choose partition fields with values that are as evenly distributed as possible. It is recommended to use fields such as dates or IDs for partitioning.
How to set up replicas?
The number of replicas can be set to 3 (it can also be set to 2, but setting it to 1 is not recommended as it may lead to data unavailability during subsequent rolling upgrades).
How to set up buckets?
Use Keys with a uniform Hash distribution as bucket Keys to avoid data skewing.
The number of buckets should not be too many or too few. It is recommended to keep each bucket between 1 and 10 GB in size. For small tables, a few buckets are usually sufficient.
The bucket Key can be one or multiple keys. Using multiple keys ensures more balanced data distribution, while a single key is easier to match (a single bucket key should generally be a Key with high cardinality).
How to choose field formats?
For each field in the table, prioritize using Integer types instead of string types. This greatly improves query and version merge efficiency.
Use decimal for floating-point numbers instead of double or float.
Must-Knows for Data Import
For real-time imports, it is recommended to use Stream Load, and for offline imports, Broker Load is preferred. The basic principle of import is batch loading: Reduce concurrency and try to import as much data as possible in a single instance to minimize merge costs and avoid impacting read efficiency. (For example, the total number of imports per minute should not exceed 20 times. Considering various types of concurrency, high-frequency imports are currently not suitable). Too many small files can severely affect query efficiency.
When importing data, pay special attention to filtering NULL values for fields used as Hash keys to avoid data skew. Too many small files can severely affect query efficiency.
When data is imported, make sure to specify the partitions to be imported. Otherwise, importing data into large tables may easily fail.
It is recommended to use CSV for data import and avoid JSON format.
Must-Knows for Schema Changes
To ensure cluster stability and meet business requirements, we recommend thoroughly evaluating field types before creating a table. When the table Schema needs to be modified, we only recommend the Add Column operation. We guarantee that new columns will be added as quickly as possible.
Must-Knows for Data Clearing
If you need to clear partition data, it is recommended to prioritize the truncate operation (which is equivalent to dropping and then creating the partition) instead of the delete operation. The delete operation significantly impacts query performance.
Other
For unfamiliar operations, it is best to type help keyword in the command line for guidance (e.g., help stream load to understand how to perform real-time data import). Alternatively, you can Submit Ticket to get more detailed assistance with your issue.
Was this page helpful?