Tencent Cloud WeData: Unlock Flexible and Efficient Scheduling Configuration Capabilities to Easily Handle Complex Business Scenarios

Introduction：The scheduling system in a big data development platform is a key component used for the automated management and execution of data processing tasks. Through capabilities such as scheduled scheduling, dependency management, and task monitoring, it ensures the smooth flow of data and serves as the core foundation of the data development platform. WeData's scheduling system provides data development engineers, algorithm engineers, and other users with a wealth of scheduled scheduling capabilities, dependency configuration capabilities, and task operation and maintenance monitoring capabilities, covering a variety of scheduling business scenarios and improving the efficiency of user development and operation and maintenance. This article will introduce the highlight features and best practices of the scheduling configuration in the WeData scheduling system.

Overview of Scheduling Configuration Capabilities

Task scheduling configuration capabilities include scheduling configuration, task dependency configuration, event dependency configuration, scheduling parameter configuration, and retry & timeout strategies. Below is an overview diagram of the task scheduling configuration capabilities. The following text will introduce the highlight features within the scheduling configuration capabilities.

Highlight Features

1.Rich Scheduling Configuration Capabilities

Diverse Scheduling Periods: In terms of scheduling periods, the WeData scheduling system offers various configuration methods. It supports configuring scheduling times in crontab mode at the workflow granularity. It also allows for the configuration of scheduling in minutes, hours, days, weeks, months, and years at the task granularity.
Scheduling Calendar: The scheduling calendar is a feature that can be used when there are no time period rules at the desired start time of the user. Users can define a specific day as a scheduling day or a non-scheduling day, and the scheduling system will follow the user's configuration to start or not start the scheduling. This capability is often used in financial services. For example, if financial services expect to start on trading days and not on non-trading days, this scenario can be achieved by configuring the scheduling calendar.
Dry Run Scheduling: For certain specific scenarios, users may not want the current task to be initiated. For example, if a task has been configured but the data is still under verification and initiation is not desired, then dry run scheduling can be selected. Dry run scheduling means that when the task is scheduled, the system will directly set the instance to success without actually executing the task. At the same time, instances with dry run scheduling will also undergo upstream detection.

2.Flexible Dependency Relationship Configuration Capability

Cross-Cycle Dependency: Cross-cycle dependency refers to a scenario where downstream data depends on the previous partition of upstream data. For example, an hourly task depends on a minute task, and the current hour's instance of the downstream depends on the previous hour's instance of the upstream. This type of scenario can be achieved through cross-cycle dependency (only some dependency relationships are supported; unsupported ones can be implemented through custom dependency configuration). The configuration method is shown in the figure below.
Custom Dependency Configuration: When users need to flexibly configure dependency relationships, such as when an instance today expects to depend on all instances of the upstream from three days before today, this can be achieved through the custom dependency configuration feature. The configuration method is shown in the figure below: Select custom for the configuration method, select interval for the time dimension, and select -3,-1 for the instance range, indicating a range from one period before the current instance data time to three periods before.Two modes are supported:Interval Mode: The input format is: x,y. It indicates the range of the upstream task instance data time offset dependency. For example, in the case of an interval (day), entering -10,-1 means depending on the closed interval instances from 10 days before to 1 day before the upstream task.List Mode: The input format is: x,y,z. It indicates the specific offset values of the upstream task instance data time dependency. For example, in the case of a list (day), entering -3,-2,-1 means depending on the instances from 3, 2, and 1 days before the upstream task.

After the dependency configuration is completed, you can use the dependency preview to check whether the configuration meets expectations.

Task Self-Dependency and Workflow Self-Dependency: In addition to dependencies between upstream and downstream tasks, we also provide the capability for instance dependency within the same task and instance dependency between workflows. Task self-dependency refers to the current instance of the same task depending on the status of the previous cycle's instance. Workflow self-dependency refers to the current task depending on all tasks from the previous cycle of the same workflow. This is commonly used in scenarios where data from different cycles is related and needs to be produced sequentially.
Circular Dependency: In certain business scenarios, an upstream task may need to rely on the instance of a downstream task from the previous cycle to fulfill business logic. This is where the circular dependency feature comes into play. For instance, if Task A is the upstream task and Task B is the downstream task, the typical dependency relationship would be that the instance of Task B on December 14th depends on the instance of Task A on December 14th, while simultaneously, the instance of Task A on December 14th depends on the instance of Task B on December 13th. This kind of scenario can be facilitated using the circular dependency functionality.

3.Comprehensive Failure & Timeout Handling Mechanism

Failure Retry: Supports configuring the number of failure retries and the time interval between retries. That is, when an instance execution fails, the system automatically retries the operation. In big data processing tasks, jobs may fail due to network fluctuations, resource contention, and other reasons. By setting up a failure retry mechanism, these jobs can be automatically retried to ensure the integrity and accuracy of data processing.
Timeout Strategy: Supports configuring a timeout strategy, meaning that if an instance's execution time or waiting time exceeds a certain threshold, the system will automatically terminate the instance. By implementing a timeout mechanism, jobs that run longer than the predetermined limit can be automatically terminated, thereby freeing up resources and ensuring that other jobs can execute smoothly.

Summary and Outlook

Over the past year, WeData has made significant advancements in scheduling configuration capabilities. We have not only introduced several new features such as custom dependency configuration, circular dependency, cross-cycle dependency, dependency preview, and scheduling calendar, but these features also broadly apply to more real-world business scenarios, assisting users in implementing their scheduling logic. Going forward, we will continue to optimize and add more capabilities to ensure that our services can comprehensively cover and meet the ever-growing demand for scheduling configurations.

tencent cloud

Tencent Cloud WeData: Unlock Flexible and Efficient Scheduling Configuration Capabilities to Easily Handle Complex Business Scenarios

Overview of Scheduling Configuration Capabilities

Highlight Features

Summary and Outlook

About Tencent Cloud

Help & Support

Resources

User Center

tencent cloud

Sign Up

Log in

Overview of Scheduling Configuration Capabilities

Highlight Features

Summary and Outlook

About Tencent Cloud

Help & Support

Resources

User Center