Statistics and Space Maintenance

Tencent Cloud TCHouse-P

Product Introduction

Purchase Guide

Operation Guide

Managing Cluster

Creating Cluster

Cluster Information

Scaling Cluster

Managing IP Allowlist and Blocklist

Applying for Public IP

Managing Resource Queue

Terminating Cluster

Accessing Data Warehouse

Connecting to Database

Managing User Permission

Defining Database

Managing Data

Monitoring and Alarming

Access Management

Performance Metrics

Data Ingestion

Importing TencentDB Data Offline with DataX

Syncing Incremental Data from MySQL with DataX

Importing and Exporting COS Data at High Speed with External Table

Syncing EMR Data with External Table

Implementing CDWPG UPSERT with Rule

Data Warehouse Development

Creating Airflow in Cloud

API Documentation

Making API Requests

Information Query APIs

DescribeInstance

DescribeInstances

DescribeInstanceState

DescribeSimpleInstances

Instance APIs

Practical Tutorial

Data Warehouse Table Development

Table Distribution Key Selection

Table Storage Format Selection

Table Partition Usage

Extension Usage

Cold Data Backup

Statistics and Space Maintenance

FAQs

CDWPG Policy

Data Processing And Security Agreement

DocumentationTencent Cloud TCHouse-PPractical TutorialStatistics and Space Maintenance

Statistics and Space Maintenance

Download PDF

Last updated: 2024-11-27 15:36:05

Statistics and Space Maintenance

Last updated: 2024-11-27 15:36:05

Download PDF

Background
Cluster statistics are critical to the use of clusters, and Tencent Cloud TCHouse-P's query optimizer is based on the dynamically calculated cost to determine how to make selections. How is the cost calculated? Usually, the calculation is based on the cost model and statistics. The cost model reasonableness and statistics accuracy will affect the effectiveness of query optimization.
The utilization of cluster tablespaces also affects the query cost. When a table has UPDATE operations (including INSERT VALUES, UPDATE, DELETE, ALTER TABLE, and ADD COLUMN), it will leave garbage data that is no longer used in the system table and the updated table, causing system performance degradation and taking up a lot of disk space. Therefore, you need to regularly monitor the data bloat of the table.
The following details the regular monitoring and maintenance of statistics and data bloat.
Statistics Collection
ANALYZE collects table statistics in a database and stores the results in the pg_statistic system directory. The query planner uses such statistics to help determine the most efficient execution plan for a query. The statistics contain various information such as the amount of data and indexes in the table. A good query plan is based on accurate table statistics.
ANALYZE description
ANALYZE is a command provided by Greenplum to collect statistics and supports column, table, and database granularities as shown below:
CREATE TABLE foo (id int NOT NULL, bar text NOT NULL) DISTRIBUTED BY (id); // Create the test table `foo`
ANALYZE foo(bar);  // Collect statistics of the `bar` column only
ANALYZE foo; // Collect statistics of the `foo` table
ANALYZE; // Collect statistics of all tables in the current database. You need to have the permission to do so
Use limits of ANALYZE
ANALYZE will put a SHARE UPDATE EXCLUSIVE lock on the target table, which will conflict with DDL statements.
ANALYZE speed
ANALYZE is a sampling statistical algorithm that usually does not scan all data in a table, but still consumes some time and computing resources for large tables.
Time to use ANALYZE
As mentioned above, ANALYZE locks tables and consumes system resources, so it is important to run it at the right time and as little as possible. We recommend you run ANALYZE in the following four scenarios.
After data is batch loaded; for example, after data is imported to a newly created table.
After an index is created.
After INSERT, UPDATE, and DELETE operations are performed on a large amount of data.
After VACUUM FULL is executed.
Analyzing partitioned table
As long as you keep the default value and do not modify the system parameter optimizer_analyze_root_partition, there is no difference in manipulating a partitioned table. Just run ANALYZE on the root table, and the system will automatically collect the statistics of partitioned tables on all leaf nodes.
If the number of partitioned tables is high, running ANALYZE on the root table can be time-consuming. Partitioned tables are usually time-dimensional, and historical partitioned tables are not modified; therefore, we recommend you run ANALYZE separately in partitions where data changes.
Data Bloat
The Greenplum database's heap table uses PostgreSQL's multiple version concurrency control (MVCC) storage implementation. The database will logically delete a deleted or updated row, but an invisible image of that row will remain in the table. As more operations are performed, the table will have more and more invisible data, which will take up a lot of storage space, thereby causing a serious performance degradation in table operations. Additionally, the data bloat can occupy a lot of space, which needs to be regularly fixed.
Data bloat monitoring
The gp_toolkit schema provides a gp_bloat_diag view that determines the table bloat by the ratio of the actual number of pages to the expected number of pages. To use this view, make sure that the latest statistics are collected for all tables in the database. Then, run the following SQL:
gpadmin=# SELECT * FROM gp_toolkit.gp_bloat_diag;
 bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages |                bdidiag                
----------+------------+------------+-------------+-------------+---------------------------------------
    21488 | public     | t1         |          97 |           1 | significant amount of bloat suspected
(1 row)
Here, bdirelpage is the actual number of pages and bdiexppages is the expected number of pages in the t1 table. A bloat ratio exceeding 4 will be recorded in the table, and there may also be a slight bloat even when no ratio is recorded. You can also compare the space of tables at different times to determine whether data has bloated.
Cleaning data bloat in table
The VACUUM <tablename> command adds expired rows to the shared free space map, so such space can be reused. When VACUUM is run periodically on a table with frequent UPDATE operations, the space occupied by expired rows can be reused quickly, easing table bloat. The cycle to run VACUUM is determined by the speed of UPDATE and DELETE on the table.
Note: 
VACUUM holds a SHARE UPDATE EXCLUSIVE lock as ANALYZE does and may interlock with DDL operations.
When the table experiences a significant bloat, VACUUM can only slow down the process but not immediately reclaim the space. Therefore, you need to run VACUUM FULL to immediately reclaim all the bloat space. However, VACUUM FULL will add ACCESS EXCLUSIVE to the manipulated table, during which all other DDL and DML statements on the table will be locked. Proceed with caution when running VACUUM FULL as the operation may be time-consuming for large tables.
Another way to address bloat is to redistribute data in the table:
1. Record the distribution key of the table.
2. Change the distribution policy of the table to random distribution.
ALTER TABLE <tablename> SET WITH (REORGANIZE=false) 
             DISTRIBUTED randomly;
3. Change the distribution policy back to the initial one.
ALTER TABLE <table_name> SET WITH (REORGANIZE=true) 
DISTRIBUTED BY (<original distribution columns>);
Handling index bloat
The VACUUM FULL command will only reclaim spaces from tables. To reclaim spaces from indexes, you need to rebuild them with the REINDEX command.
REINDEX TABLE <table_name>    --- Rebuild all indexes in the table  
REINDEX INDEX <index_name> --- Rebuild a specified index
Note: 
Note that both REINDEX and VACUUM FULL will add ACCESS EXCLUSIVE.
Regular Cluster Maintenance
When using a cluster, you need to eliminate data bloat and maintain statistics regularly so that the cluster can deliver an optimal performance.
Cleaning without table locking
As mentioned above, VACUUM <tablename> can mark reclaimable space without table locking and slow down data bloat. Cleaning without locking the table will not affect data table reads/writes, but DDL statements cannot be used, and some CPU and memory resources will be used.
Recommended frequency:
Once a day or at least twice a week if a large amount of data is updated in real time and many data records change every day.
Once a week under normal circumstances or once a month if data is not updated frequently.
Use the following script to clean a table with a cron job.
#!/bin/bash
export PGHOST=xxx.xxx.x.x
export PGPORT=5436
export PGUSER=test
export PGPASSWORD=test
db="test"
psql -d $db -e -c "VACUUM test_table;"
Full cleaning with table locking
If the data in a table is updated and deleted frequently, we recommend you plan a business pause window to run VACUUM FULL and REINDEX to reclaim all the bloat space in the table. Cleaning with table locking will put an ACCESS EXCLUSIVE lock on the manipulated table, during which you cannot perform any operations on the table.
1. Run VACUUM FULL <tablename>. 
2. Run REINDEX TABLE <tablename> (you can skip this step for tables without indexes).  
3. Run ANALYZE <tablename> .
Recommended frequency: Once a week or once a day if almost all data is updated daily.
Use the following script to regularly clean cluster tables, preferably during off hours in the early morning on the weekend.
#!/bin/bash
export PGHOST=xxx.xxx.x.x
export PGPORT=5436
export PGUSER=test
export PGPASSWORD=test
db="test"
psql -d $db -e -c "VACUUM FULL test_table;"
psql -d $db -e -c "REINDEX TABLE test_table;"
psql -d $db -e -c "ANALYZE test_table;"
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service free trial

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

E-commerce

E-commerce retail solutions

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Financial Services

Financial Services Solution

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha

Cloud Workload Protection Platform

Data Security Governance Center

Key Management Service