Data Integration Overview

Data Integration quickly connects and integrates various self-built data on the cloud or on-premises, solving data platform construction, database migration and backup, business upgrades, integration, data access acceleration, full-text search, and other scenarios involving data integration and synchronization.
Use Limits
1. Data Synchronization: Data Integration only supports the transmission of data objects that can be abstracted into logical two-dimensional tables, such as structured, semi-structured, and unstructured data (e.g., COS, where the specific synchronized data must be abstracted into structured data). The FTP method supports the synchronization of completely unstructured files (e.g., meteorological files) to HDFS, but this method does not support data content extraction.
2. Network Connectivity: Supports data storage synchronization within a single region and between some cross-regions, fulfilling data exchange and synchronization needs. Some regions can transfer via Classic Network but cannot guarantee connectivity. If the Classic Network is tested and found to be disconnected, it is recommended to use the public network for connection.
3. Task Execution: Running data integration tasks requires the use of Data Integration resource groups. Please create a resource group before using any data integration feature. Resource groups include offline packages, real-time packages, etc., and can be purchased as needed depending on the type of task to be run.
4. Data Consistency: Data Integration synchronization supports at least once delivery but does not fully guarantee exactly once delivery (i.e., it cannot ensure no data duplication). Ensuring no data duplication relies on the primary key and destination capability.
5. Data Types and Precision: During offline or real-time synchronization, it's important to pay attention to the matching of field types and precision conversion between the source and target sides of the synchronization task. If the source and target types are incompatible, or if the target field types have a maximum value smaller than the source's maximum value (or a minimum value larger than the source's minimum value, or precision lower than the source's precision), there may be a risk of write failure or precision truncation.
Offline Synchronization
Data Integration provides offline data synchronization capabilities, which periodically reads data in bulk from source databases and synchronizes it to the destination.
Real-time Synchronization
Data Integration offers real-time data synchronization capabilities, supporting stream data transmission. This includes single table, database and table sharding, and multi-database multi-table granularity of real-time data consumption, with task types including single table synchronization, whole database synchronization, and log collection.
Single table synchronization: The source end is a single table or database and table sharding, and the target end only supports one table. Single table synchronization uses a fixed schema matching method, requiring the task to specify the field mapping relationship between the source and target tables. During task execution, only the specified source field content is written to the target field.
Whole Database Synchronization: Supports syncing all data within an entire instance from the source side, or specified multiple databases and table objects to multiple tables on the target side. This task does not require specifying the field mapping relationships between the source and target sides; by default, all source table fields are read, and fields between tables are matched by name by default.
Log collection: Log collection actively reports log file data from CVM cloud instances, self-built servers, or within TKE to an external target end using Agent or SDK methods.
Concepts
Data Source
During the Data Integration process, data sources serve as the target objects for reading/writing. A data source can be a database or a data warehouse (such as EMR engine instances). Before configuring a Data Integration synchronization task, you need to configure the necessary source and target database or data warehouse information on the Data Source Management Page. Once configured, you can control the database or data warehouse for synchronization reading and writing by selecting the data source name in the synchronization task.
Network Connectivity
Before using  Data Integration synchronization tasks, it's necessary to ensure network connectivity (including read and write ends) between the data sources and the  Data Integration resource groups. Access should not be denied due to whitelist restrictions or similar reasons; otherwise, data transmission sync cannot be completed.
If a data source is enabled on the public network: A NAT Gateway needs to be purchased and created, allowing integrated resources to connect to the data source's VPC through the gateway.
If the data source is within a VPC:
If it is in the same VPC as the integrated resources: it can be used directly.
If the integration resource is located in a different VPC: A Peering Connection must be purchased to connect the integration with the data source's VPC.
If the data source is located in an IDC or other classic network environment: A VPN or Direct Connect Gateway must be purchased to connect the integration with the data source's VPC.
Speed Limit
The speed limit is the maximum transmission speed allowed for Data Integration synchronization tasks.
Concurrency Level
Concurrency is the maximum number of parallel read or write operations in data synchronization tasks. Concurrency affects the efficiency of data synchronization. A higher concurrency setting corresponds to higher resource consumption. Due to resource limitations or the nature of the task itself, the actual concurrency may be less than or equal to this value.
Dirty Data
Dirty data refers to data that fails to write during synchronization due to field type mismatches or exceptions occurring when writing to the target data source. All data that fails to write is classified as dirty data. For example, data from a source of type String being written to a target field of type INT that cannot be written due to unreasonable type conversion.
In offline synchronization, you can configure a dirty data threshold within the task to control the maximum number of dirty data records during the synchronization process. When the task exceeds this threshold, it will be interrupted.
In real-time synchronization, you can configure a dirty data archiving method to write the failed dirty data into archive storage uniformly, ensuring that the real-time data flow is not interrupted.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

Use Limits

Offline Synchronization

Real-time Synchronization

Concepts

About Tencent Cloud

Help & Support

Resources

User Center

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Tencent Big Model

Internet of Things

Stream Services

Cloud Real-time Rendering

Cloud Resource Management

More

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Management and Audit Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Image Creation

Natural Language Processing

Communication

Media Process Services

Education Sevices

Developer Tools

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Voice Technology

Optical Character Recognition

Interactive Video Services

Media SDK

Medical Services

Monitor and Operation

Use Limits

Offline Synchronization

Real-time Synchronization

Concepts

About Tencent Cloud

Help & Support

Resources

User Center