DLC Source Table FAQs

Data Lake Compute

Product Introduction

Purchase Guide

Configuration Adjustment Fees

Getting Started

Complete Process for New User Activation

DLC Data Import Guide

Quick Start with Data Analytics in Data Lake Compute

Quick Start with Permission Management in Data Lake Compute

Quick Start with Partition Table

Enabling Data Optimization

Cross-Source Analysis of EMR Hive Data

Standard Engine Configuration Guide

Operation Guide

Console Operation Introduction

Data Development and Exploration

Data Exploration

SQL Editor

Data Query Task

SELECT Task

Querying Partition Table

Querying JSON Data

Querying Data from Other Sources

Using View

INSERT INTO

Querying Script Parameters

Obtaining Task Results

Query Script Analysis

Data Job

Overview

Configuring Data Access Policy

Creating Data Job

Managing Data Job

PySpark Dependency Package Management

Resource Management

Engine Management

Network Connection Configuration

Storage Configuration

Managed Storage Configuration

Binding a Metadata Acceleration Bucket

Metadata Management

Data Catalogs and DMC

Data Table Management

Data View Management

Function Management

Partition Field Policy

Ops Management

Historical Task Instances

Historical task(Old version)

Session Management

Insight Management

Task Insights

System Management

User and Permission Management

CAM Service

Permission Overview

User and Work Group

Sub-Account Permission Management

Monitoring and Alarms

Data Engine Monitoring

Data Job Monitoring

Access Point Gateway Engine Monitoring

Monitoring Alarm Configuration

Audit Log

Development Guide

SparkJar Job Development Guide

PySpark Job Development Guide

Query Performance Optimization Guide

UDF Function Development Guide

Materialized View

System Restraints

Metadata Information

Computing Task

Client Access

JDBC Access

DLC JDBC Access

Hive JDBC Access

Presto JDBC Access

Configuring Public Access for Standard Engine

TDLC Command Line Interface Tool Access

Third-party Software Linkage

Python Access

Practical Tutorial

Table Creation Practice

Using Apache Airflow to Schedule DLC Engine to Submit Tasks

Direct Query of DLC Internal Storage with StarRocks

DLC Native Table

DLC Source Table Core Capabilities

DLC Source Table Operation Configuration

DLC Source Table Lake Ingestion Practice

DLC Source Table FAQs

SQL Statement

SuperSQL Statement

Overview of SuperSQL Statement

Unified Statement

DDL Statement

ALTER DATABASE

ALTER DATABASE SET DBPROPERTIES

ALTER DATABASE SET LOCATION

DROP DATABASE

CREATE TABLE

REPLACE TABLE AS SELECT

SHOW COLUMNS IN TABLE

ALTER TABLE

ALTER TABLE ADD COLUMNS

ALTER TABLE ADD COLUMN AFTER/FIRST

ALTER TABLE DROP COLUMN

ALTER TABLE ADD PARTATION

SHOW PARTITIONS

ALTER TABLE DROP PARTITION

ALTER TABLE ADD PARTITION FIELD

ALTER TABLE DROP PARTITION FIELD

ALTER TABLE ... RENAME COLUMN

ALTER TABLE SET TBLPROPERTIES

ALTER TABLE SET LOCATION

ALTER TABLE ... WRITE ORDERED BY

ALTER TABLE ... WRITE DISTRIBUTED BY PARTITION

ALTER TABLE ... SET IDENTIFIER FIELDS

ALTER TABLE ... DROP IDENTIFIER FIELDS

ALTER VIEW

ALTER VIEW SET TBLPROPERTIES

DML Statement

DQL Statement

Iceberg Table Statement

Differences in Statement Between Iceberg External Tables and Native Tables

Materialized View Statement

SQL Implicit Conversion

Functions

Unified Functions

Overview of Unified Functions

Binary Functions

Bitwise Functions

Collection Functions

Date and Time Functions

JSON Functions

Mathematical Functions

Presto Built-in Functions

Comparison of Hive Functions

Overview of Standard Spark Statement

Overview of Standard Presto Statement

API Documentation

Making API Requests

Data Table APIs

DescribeLakeFsDirSummary

DescribeLakeFsInfo

QueryResult

GenerateCreateMangedTableSql

Task APIs

Metadata APIs

DescribeForbiddenTablePro

DescribeDLCCatalogAccess

GrantDLCCatalogAccess

RevokeDLCCatalogAccess

DropDMSTable

DropDLCTable

DescribeDMSDatabaseList

Service Configuration APIs

CreateCHDFSBindingProduct

DeleteCHDFSBindingProduct

DescribeOtherCHDFSBindingList

CreateStoreLocation

DescribeStoreLocation

ModifyDataEngineDescription

RollbackDataEngineImage

SwitchDataEngine

SwitchDataEngineImage

UpgradeDataEngineImage

DeleteThirdPartyAccessUser

DescribeDataEngineImageVersions

DescribeSubUserAccessPolicy

DescribeThirdPartyAccessUser

RegisterThirdPartyAccessUser

RestartDataEngine

UpdateUserDataEngineConfig

UpdateDataEngineConfig

Permission Management APIs

Database APIs

ModifyAdvancedStoreLocation

ModifyGovernEventRule

DescribeAdvancedStoreLocation

Data Source Connection APIs

CheckDataEngineImageCanBeRollback

CheckDataEngineImageCanBeUpgrade

DescribeDataEnginePythonSparkImages

Data Optimization APIs

GetOptimizerPolicy

Data Engine APIs

CreateDataEngine

DescribeDataEnginesScaleDetail

DeleteDataEngine

RenewDataEngine

SuspendResumeDataEngine

UpdateDataEngine

DescribeUpdatableDataEngines

DescribeDataEngine

DescribeUserDataEngineConfig

CheckDataEngineConfigPairsValidity

General Reference

Operation Guide on Connecting Third-Party Software to DLC

Connecting CBoard to DLC

DLC Policy

Data Privacy And Security Agreement

Service Level Agreement

DocumentationData Lake ComputePractical TutorialDLC Native TableDLC Source Table FAQs

DLC Source Table FAQs

Download PDF

Last updated: 2024-07-31 17:35:14

DLC Source Table FAQs

Last updated: 2024-07-31 17:35:14

Download PDF

Why Must Data Optimization Be Enabled for Upsert Write Scenarios in DLC Native Table (Iceberg)?
1. DLC Native Table (Iceberg) uses the MOR (Merge On Read) table format. When Upsert writes occur upstream, updates write a delete file marking a record as deleted and then add a new data file to the new modification record.
2. Without committing and merging, the job engine needs to merge the original data it has read, the delete file, and the new data file when reading data to get the latest data. This will lead the job engine to consume significant resources and time. Small file merging in data optimization reads and merges these files in advance, writing them into new data files so that the job engine can directly read the latest files without needing to merge data files.
3. DLC metadata (Iceberg) uses a snapshot mechanism, and even if new snapshots are generated during the write, historical snapshots are not cleaned up. The snapshot expiration capability of the data optimization can remove old snapshots, freeing up storage space and preventing unused historical data from occupying storage space.
How to Handle Timeout in Data Optimization Tasks?
The system sets a default timeout for running data optimization tasks (2 hours by default) to prevent a task from occupying resources for too long and hindering other tasks. When the timeout expires, the system cancels the optimization task. According to different types of tasks, see the following handling procedures.
1. If small file merge tasks frequently time out, it indicates data accumulation and that current resources are insufficient for merging. Temporarily expanding resources (or setting the table to use dedicated optimization resources) can address the accumulated data, and then revert the settings.
2. If small file merge tasks occasionally time out, it may indicate insufficient optimization resources. Consider scaling-out data resources to some extent and monitoring if there are timeouts in subsequent governance tasks of multiple cycles. Occasional small file merge timeouts will not immediately impact query performance but may lead to continuous timeouts and eventually affect query performance if the issue is not addressed timely. DLC enables segmented submissions for small file merges by default, so parts of the finished task can still be submitted successfully and are still effective.
3. If a snapshot expiration task times out, it occurs in two stages. In the first stage, the snapshot is removed from the metadata, and this process usually does not time out. In the second stage, the data files associated with the removed snapshot are deleted from storage. This stage requires individually comparing files to be deleted. There might be timeouts if there are many files to be deleted. Timeouts for this type of task can be ignored. Files that were not deleted due to the timeout will be treated as orphan files and will be cleaned up in subsequent orphaned file removal processes.
4. If orphan file removal tasks time out, the handling of orphan files is similar to removing orphan files. As long as the deleted files are still valid when scanned, the system will continue to scan and execute in subsequent cycles, as orphan file removal is a periodic task. If a task times out, it will be retried in the next cycle.
Why Does Iceberg Occasionally Read an Old Snapshot Shortly after Inserting Data?
1. Iceberg provides a default caching capability for the catalog, with a default duration of 30 seconds. In extreme cases, if two queries for the same table occur very close together in time and are not executed in the same session, there is a very low probability that the query will access the previous snapshot before the cache expires and updates are fetched.
2. The Iceberg community recommends enabling this parameter. DLC also enabled it by default in earlier versions to speed up task execution and reduce visits to metadata during queries. However, if two tasks have very close read and write intervals, the described situation may occur in extreme cases.
3. In the latest versions of the DLC engine, this parameter is disabled by default. When it comes to the scenes users may encounter, if users who purchased the engine before January 2024 need to ensure strong data consistency in queries, they can manually disable this parameter by following the configuration method below to modify the engine parameters:
"spark.sql.catalog.DataLakeCatalog.cache-enabled": "false"
"spark.sql.catalog.DataLakeCatalog.cache.expiration-interval-ms": "0"
Why Should DLC Native Table (Iceberg) Be Partitioned?
1. Data optimization jobs are first divided by partitions. If the native table (Iceberg) has no partitions, most small file merges that involve modifying tables will only have a single job operate. Therefore, the merges cannot be parallel, and this significantly reduces merge efficiency.
2. If the Table Has No Upstream Partition Fields, How Can It Be Partitioned? In this case, consider using Iceberg's bucket partitioning. For detailed description, see DLC Native Table Core Capabilities.
How to Handle Write Conflicts in DLC Native Table (Iceberg)?
1. To ensure ACID compliance, Iceberg checks the current view for changes during commits. If changes are detected, a conflict is assumed. Then, the commit operation is rolled back. The current view is merged, and the commit is retried.
2. The system provides default retry counts and intervals for conflicts. If multiple commit attempts still result in conflicts, the write operation fails. For default conflict parameters, see DLC Native Table Core Capabilities.
3.  If conflicts occur, users can adjust the number and interval of retries. The following example sets the number of conflict retries to 10. For more details on parameter meanings, see DLC Native Table Core Capabilities.
// Set conflict retry count to 10
﻿
ALTER TABLE `DataLakeCatalog`.`axitest`.`upsert_case` SET TBLPROPERTIES('commit.retry.num-retries' = '10');
The DLC Native Table (Iceberg) has Been Deleted, But Why Is The Storage Space Capacity Not Released?
When the DLC native table (Iceberg) is dropped, the metadata is deleted immediately, and the data is deleted asynchronously. The data is first moved to the recycle bin directory, and the data is removed from the storage one day later.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service free trial

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

E-commerce

E-commerce retail solutions

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Financial Services

Financial Services Solution

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha

Cloud Workload Protection Platform

Data Security Governance Center

Key Management Service