tencent cloud

All product documents
Tencent Cloud WeData
Data Management
Last updated: 2024-11-01 16:26:14
Data Management
Last updated: 2024-11-01 16:26:14

Enter the Data Management page

1. log in to WeData console.
2. click Project List in the left menu to find the target project for the data management feature.
3. After selecting a project, click to enter the Data Development module.
4. click Data Management in the left menu.

Data Management Overview

Currently, WeData supports the creation of Hive and DLC database tables within the EMR and DLC engines of the system source.
Note:
Data sources can only be displayed in the Data Management directory after binding the computational storage engine in the project management page.

Data Management Directory

The directory tree is used to display the hierarchy and relationships of all database tables in the data source. This feature allows you to:
Quickly locate target tables. The directory tree feature allows users to quickly locate the position of target tables, improving operational efficiency and reducing operation time and the possibility of errors.
Display relationships between database tables. Through the directory tree feature, users can clearly see the hierarchy and relationships between database tables, making it easier to analyze and understand the associations and dependencies between them.
Manage and maintain the data warehouse. Through the directory tree feature, users can classify and manage databases by data warehouse layers, facilitating maintenance and adjustment of databases, such as deleting or renaming tables, fields, etc.
Convenient search feature. Through the search box feature of the directory tree, users can easily browse and search database tables and jump to the target table for operations.

Database Table Search

The search feature helps users quickly locate and browse target database tables or datasets. It provides users with a clear hierarchy view and quick search functionality, allowing users to easily find the required data, thereby improving data management and query efficiency.
Enter the name of the database or data table in the search window. The database catalog will search for the corresponding database structure. The search feature supports fuzzy search.




Refresh Directory

The directory tree refresh feature is used to reload the data source, database, and data table to update the contents displayed in the directory tree. This helps users update and sync the latest data from the data source to ensure that users get the latest data table information.




Database management

New database

Depending on the bound data source, you can create a database under Hive or DLC data source.
In the data management directory, click Create New Database. Follow the prompts to select the data source type, data source, customize the database name, and description information (optional). Once configured, the database can be created in the corresponding data source.




Hive database




Hive database information:
Information
Description
Data source type
Select Hive type.
Data Source
Select Hive type data source.
Database name
Customize Hive database name.
Description
Optional: Customize description content.

DLC database

If creating a database under DLC data source, you can configure event strategies and governance rules for the database.



DLC database information:
Information
Description
Basic Information Configuration
Data source type
Select DLC type.
Data Source
Select DLC type data source.
Database name
Custom Definition DLC database name.
Description
Optional: Customize description content.
Event Policy Configuration
AddDataFiles
Set the maximum number of files to be added. Exceeding this value will trigger small file merging.
AddPositionDeletes
Set the maximum number of Position deletes. Exceeding this value will trigger small file merging.
AddEqualityDeletes
Set the maximum number of Equality deletes. Exceeding this value will trigger small file merging.
AddDeleteFiles
Set the number of delete files. When the total of expired snapshot's AddDataFiles + AddDeleteFiles exceeds the threshold AddDataFiles + AddDeleteFiles, the snapshot will be deleted from that point.
Governance Rule Configuration
Small File Combination
Once enabled, a large number of data files smaller than the threshold will be combined into larger files, reducing the number of files and improving query performance.
Delete Expired Snapshot
Once enabled, expired historical snapshot information will be automatically cleaned up, reducing the number of metadata/data files, saving storage space, and improving query speed.
Delete Orphan Files
Once enabled, invalid data files will be automatically cleaned up periodically, saving storage space.
Metadata Merge
Once enabled, metadata manifests files will be automatically merged, reducing the number of manifests files and improving data query efficiency.

Dropping a Database

In the data management directory tree, move the cursor over the database you want to delete, click

to expand the database operation menu, then click Delete Database. Confirm in the popup to delete the corresponding database.



Note:
A library cannot be recovered once deleted, so please delete it carefully.

Managing a data table

Note:
The database must be created before creating a data table.
In the data management directory, click Create New Data Table. In the popup, follow the prompts to select the data source type, data source, and database. Define the data table name and complete the configuration, then click OK to enter the data table's Basic Attributes and Field design page.




Hive Data Table

1. When using EMR as the computational storage engine, you can create Hive data tables under the Hive data source in data management.
Note:
The Hive service must be started in the EMR cluster. If Ranger is enabled in Hive, make sure the Ranger's username and password are correct. Currently, the feature for modifying and adding fields is not provided.



2. After completing the basic information in the popup for creating a new table, you can enter the data table design page to configure the table's basic attributes and field information.



3. Hive Table Configuration:
Information
Description
Table Creation Method
Wizard Mode
Using the traditional method to manually add fields, define the field name, field Chinese name, field English name, column type, whether it is partitioned, and description after inserting the field.

DDL Mode
Use SQL Create Table statements to create data tables. Only the CREATE TABLE statement is supported for new tables, and only the ALTER TABLE ADD / REPLACE COLUMNS statement is supported for editing tables. For example:
create table if not exists WeData_demo_db.user_info ( user_id string COMMENT 'User ID', user_name string COMMENT 'Username', user_age int COMMENT 'Age', city string COMMENT 'City', ) COMMENT 'User Information Table';

Note:
During the table creation process, ensure the table name in the DDL statement matches the name entered when creating the new data table.
Permissions on Table
Project sharing
Assign data table permissions to the current project. All members within the project will have data table permissions, including editing, inquiring, and deleting.
Individuals and administrators only
Assign data table permissions to the creator individual and the current project's administrator.
(Note: Data permissions take effect in approximately 30 seconds)
Lifecycle
EMR-Hive tables do not support lifecycle configuration. The current configuration is ineffective. Please be aware that this configuration item will be removed in future iterations.
Storage Class
Support choosing four types of storage methods:
TEXTFILE: A type of text format storage where plain text files are stored, with each line representing a record.
PARQUET: A columnar storage format that divides data into rows and columns and stores them by column on the disk. It can be faster than row-based storage in certain scenarios and supports column compression.
ORC: An optimized column storage format for storing and processing large-scale data. It uses advanced compression algorithms and indexing technology to improve processing speed and query efficiency.
CSV: A common text format that uses commas as field delimiters and encloses each field value in quotation marks.
Field Separator
Separate each field in the data table for reading and processing in a program or system. Five types of field delimiters are supported: \u0001 (Hive default), | (vertical bar), (space), ; (semicolon), , (comma), \t (tab)
Field configuration
A field contains configuration information such as field name, field description, column type, and partition status.
Partition Field Description: All fields cannot be selected as partition fields; at least one field must be a non-partition field. Partition fields do not support array, map, decimal types.
4. After completing the configuration of the data table's Basic Attributes and Fields, click Save in the upper left corner to finish creating the data table. You can see the created data table in the data management directory on the left.




DLC Data Table

1. When using DLC as the computational storage engine, you can create DLC data tables under the DLC data source in data management.
Note:
Currently, DLC table creation only supports visual table creation; DDL table creation is not yet supported. Please create tables directly in the SQL statements in data development.



2. After completing the basic information in the new table popup, you can enter the data table design page. You need to configure the data table format, field information, and parameter attributes.



3. DLC Table Creation Configuration:
Information
Description
Data Table Format
Select Table Creation Type
You can choose to create an internal table or an external table.
Data Table Source
Specify whether to create an empty table or COS COS when creating an internal table.
Storage Path
COS COS and external tables require the location full path.
Data Format
Data formats include: CSV, JSON, PARQUET, ORC, AVRO.
Data Table Version
Select the data table version, V1 or V2.
upsert
When selecting the data table version V2, you can choose whether to use upsert for writing.
Basic Attributes
Chinese name
Custom Definition of table Chinese name.
Description
Custom Description Information.
Field Information
Field name
Design table field names.
Field Type
Supports DLC data table field types.
Description
Custom Definition of field description information.
Whether to use partitioning
Design partitioning, including partition field, conversion strategy, and policy parameters.
Event Policy Configuration
AddDataFiles: Set the maximum number of files to be added. Exceeding this value will trigger small file merging.
AddPositionDeletes: Set the maximum number of Position deletes. Exceeding this value will trigger small file merging.
AddEqualityDeletes: Set the maximum number of Equality deletes. Exceeding this value will trigger small file merging.
AddDeleteFiles: Set the number of delete files. When the total of expired snapshot's AddDataFiles + AddDeleteFiles exceeds the threshold AddDataFiles + AddDeleteFiles, the snapshot will be deleted from that point.
Governance Rule Configuration
Support enabling data table governance rules. Governance rule configuration items can choose to inherit the governance rules of the database selected when the current data table was created, or the data table can have its own Definition governance rules. The following governance rules are included:
Small File Merge: Once enabled, a large number of data files smaller than the threshold will be combined into larger files, reducing the number of files and improving query performance.
Delete Expired Snapshots: Once enabled, expired historical snapshot information will be automatically cleaned up, reducing the number of metadata/data files, saving storage space, and improving query speed.
Delete Orphaned Files: Once enabled, invalid data files will be automatically cleaned up periodically, saving storage space.
Metadata Merge: Once enabled, metadata manifests files will be automatically merged, reducing the number of manifests files, and improving data query efficiency.
Attribute settings
Parameter configuration
Support self Definition data table parameter configuration, such as format-version, write.upsert.enabled.




Upload Data Table

1. In the data management directory or main data management interface, click Create Table by Uploading File. Currently, only Hive Type Data Table uploads are supported.
Upload Example:
Note:
1. Currently support uploading CSV and TSV files, with a maximum file size of 100 MB.
2. You need to bind the EMR cluster with the WeData project, which includes the corresponding Hive service.
3. If Ranger is configured in project management, the Ranger username and password must be correct.
4. The EMR_QCSRole role set for the COS bucket must have access permissions to COS, otherwise, there will be an error indicating a problem with the COS path when importing data.



2. In the pop-up window, follow the prompts to select the data source type, data source, database, bucket, custom table name, and the uploaded table creation resources.



3. File upload configuration:
Information
Description
Data source type
Hive type data sources are supported.
Data Source
Select the WeData data source under the corresponding data source type.
Database
Displays the Hive databases bound to the current project and links by data source type. Searching by library name is supported.
Bucket
COS bucket for temporarily storing uploaded files.
Table name
The default is to automatically enter the uploaded file name without the suffix, but you can customize the name.
Upload resources
Click to upload or drag and drop to upload, a progress bar is provided. The upload format is: CSV or TSV format.
4. Here, as an example, the data format for a CSV file is as follows:



5. After completing the popup message configuration, click OK to enter the table creation page.



6. On the table creation page, you can set table permissions, the Chinese name for the table, and the table description information. The parsed uploaded file will provide fields, data preview, and support configuration for file format, column separator, column quotation marks, first row field confirmation, file encoding method, and field attributes.
Information
Description
Basic Attributes
Permissions on Table
Select the permission ownership after creating the current data table, either for in-project sharing or for use by the individual and administrator only.
Chinese name
The default automatically incorporates the file name without the suffix, can be customized.
Description
Custom Data Table Description Information.
File Attributes
Data preview
After file parsing, only the first 500 rows of data are displayed. Click Re-upload to open the file upload dialog for re-uploading the table file.



File Format
Drop-down selection supports CSV,TSV.
Column delimiter
Users can enter custom input, a single character or a Unicode escape sequence like \u0001.
CSV default: , (comma)
TSV default: \t (tab character)
Column Quotes
The default is double quotes. Users can switch to single quotes.
First line is column name
The default is no. It can be switched to yes.
File encoding method
Default is UTF-8. Users can choose UTF-8, GBK, ISO-8859-1.
Field attributes
Field name
Field names are parsed according to the first line of the file being the column names attribute. If the first line of data in the file is not the column name, use column_1, column_2, column_3, ... column_x to sequentially fill in the field names. Users can also custom define and modify the field names.
Field Chinese Name
Custom Definition Field Chinese Name.
Field English Name
Custom Definition Field English Name.
Column type
Choose the corresponding data type supported by the data source based on the data source type.
Description
Custom Definition of field description information.
7. After configuring the table creation information on the page, click Save at the top left corner of the page to generate the data table.



8. The progress of the corresponding data table generation can be viewed in the progress pop-up after saving. Once the creation steps are successfully executed, the data table will be successfully generated.


Edit Data Table

1. Move the cursor to the data table that needs to be edited in the data management directory tree, double-click the left mouse button to open the corresponding data table's edit page. Some parameters of the data table can be edited on the page.
Editable content in Hive includes lifecycle and field description.



2. Editable content in DLC includes table field description, event policy configuration, and governance rule configuration.



3. After editing the data table, click Save to complete the data table editing operation.




Export Table DDL

1. In the data management directory tree, move the cursor to the database where the data table for which you want to export the DDL is located, click

to expand the database operation menu, then click Export Table DDL. In the left panel of the popup, select the data table whose DDL you want to export under the current database, add it to the right panel, and confirm to export the corresponding data table's DDL file.



2. Select the data table whose DDL you want to export.




Dropping a Table

1. In the data management directory tree, move the cursor to the data table you want to delete, click

to expand the data table operation menu, then click Delete, and confirm in the popup to delete the corresponding data table.




Viewing table details

In the data management directory tree, move the cursor to the data table whose details you want to view, click

to expand the data table operation menu, then click View Table Details to see basic information, storage information, field information, data preview, and table DDL.




Table Information


Table detail information:
Information
Description
Basic information
Data Type
The storage and computing engine type to which the data table belongs.
Database name
The name of the database to which the data table belongs.
Table name
The identifier name of the data table.
Owner
The person in charge of the data table.
Chinese name
The Chinese name of the data table.
Description
User-defined description information.
Storage Information
Table Size
The data in the current table has occupied physical storage space.
Lifecycle
The lifecycle of the current table is used to control its effective usage time, enhancing overall security and saving storage and computing resources during data governance.
Creation Time
Creation date and time of the current table.

Field Information

Displays the field metadata of the current table, including field sequence number, field name, field Chinese name, field English name, column type, partition status, and description information.




Data preview

Capture a portion of the actual data in the current table as preview data to help users quickly understand the data in the table and provide references needed for data cleaning and data analysis.


DDL

By viewing the table's DDL, you can understand important information such as table name, column names, data types, and constraints, thereby better understanding the structure and characteristics of the table.


Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon