tencent cloud

Feedback

Data Integration Overview

Last updated: 2024-11-01 17:00:28
    Data Integration quickly connects and integrates various self-built data on the cloud or on-premises, solving data platform construction, database migration and backup, business upgrades, integration, data access acceleration, full-text search, and other scenarios involving data integration and synchronization.

    Use Limits

    1. Data Synchronization: Data Integration only supports the transmission of data objects that can be abstracted into logical two-dimensional tables, such as structured, semi-structured, and unstructured data (e.g., COS, where the specific synchronized data must be abstracted into structured data). The FTP method supports the synchronization of completely unstructured files (e.g., meteorological files) to HDFS, but this method does not support data content extraction.
    2. Network Connectivity: Supports data storage synchronization within a single region and between some cross-regions, fulfilling data exchange and synchronization needs. Some regions can transfer via Classic Network but cannot guarantee connectivity. If the Classic Network is tested and found to be disconnected, it is recommended to use the public network for connection.
    3. Task Execution: Running data integration tasks requires the use of Data Integration resource groups. Please create a resource group before using any data integration feature. Resource groups include offline packages, real-time packages, etc., and can be purchased as needed depending on the type of task to be run.
    4. Data Consistency: Data Integration synchronization supports at least once delivery but does not fully guarantee exactly once delivery (i.e., it cannot ensure no data duplication). Ensuring no data duplication relies on the primary key and destination capability.
    5. Data Types and Precision: During offline or real-time synchronization, it's important to pay attention to the matching of field types and precision conversion between the source and target sides of the synchronization task. If the source and target types are incompatible, or if the target field types have a maximum value smaller than the source's maximum value (or a minimum value larger than the source's minimum value, or precision lower than the source's precision), there may be a risk of write failure or precision truncation.

    Offline Synchronization

    Data Integration provides offline data synchronization capabilities, which periodically reads data in bulk from source databases and synchronizes it to the destination.

    Real-time Synchronization

    Data Integration offers real-time data synchronization capabilities, supporting stream data transmission. This includes single table, database and table sharding, and multi-database multi-table granularity of real-time data consumption, with task types including single table synchronization, whole database synchronization, and log collection.
    Single table synchronization: The source end is a single table or database and table sharding, and the target end only supports one table. Single table synchronization uses a fixed schema matching method, requiring the task to specify the field mapping relationship between the source and target tables. During task execution, only the specified source field content is written to the target field.
    Whole Database Synchronization: Supports syncing all data within an entire instance from the source side, or specified multiple databases and table objects to multiple tables on the target side. This task does not require specifying the field mapping relationships between the source and target sides; by default, all source table fields are read, and fields between tables are matched by name by default.
    Log collection: Log collection actively reports log file data from CVM cloud instances, self-built servers, or within TKE to an external target end using Agent or SDK methods.

    Concepts

    Data Source
    During the Data Integration process, data sources serve as the target objects for reading/writing. A data source can be a database or a data warehouse (such as EMR engine instances). Before configuring a Data Integration synchronization task, you need to configure the necessary source and target database or data warehouse information on the Data Source Management Page. Once configured, you can control the database or data warehouse for synchronization reading and writing by selecting the data source name in the synchronization task.
    Network Connectivity
    Before using Data Integration synchronization tasks, it's necessary to ensure network connectivity (including read and write ends) between the data sources and the Data Integration resource groups. Access should not be denied due to whitelist restrictions or similar reasons; otherwise, data transmission sync cannot be completed.
    If a data source is enabled on the public network: A NAT Gateway needs to be purchased and created, allowing integrated resources to connect to the data source's VPC through the gateway.
    If the data source is within a VPC:
    If it is in the same VPC as the integrated resources: it can be used directly.
    If the integration resource is located in a different VPC: A Peering Connection must be purchased to connect the integration with the data source's VPC.
    If the data source is located in an IDC or other classic network environment: A VPN or Direct Connect Gateway must be purchased to connect the integration with the data source's VPC.
    Speed Limit
    The speed limit is the maximum transmission speed allowed for Data Integration synchronization tasks.
    Concurrency Level
    Concurrency is the maximum number of parallel read or write operations in data synchronization tasks. Concurrency affects the efficiency of data synchronization. A higher concurrency setting corresponds to higher resource consumption. Due to resource limitations or the nature of the task itself, the actual concurrency may be less than or equal to this value.
    Dirty Data
    Dirty data refers to data that fails to write during synchronization due to field type mismatches or exceptions occurring when writing to the target data source. All data that fails to write is classified as dirty data. For example, data from a source of type String being written to a target field of type INT that cannot be written due to unreasonable type conversion.
    In offline synchronization, you can configure a dirty data threshold within the task to control the maximum number of dirty data records during the synchronization process. When the task exceeds this threshold, it will be interrupted.
    In real-time synchronization, you can configure a dirty data archiving method to write the failed dirty data into archive storage uniformly, ensuring that the real-time data flow is not interrupted.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support