Data engines empower the data analysis and computing service in Data Lake Compute. They are used in all computing operations and can be public or private based on your needs.
Public engine
The Data Lake Compute service comes with the shared public engine, which is applicable to low-frequency analysis use cases with small data volumes. With this highly flexible and available engine, you don't need to configure or manage resources. Fees are charged by the scanned data volume of running tasks. For billing details, see Billing Overview. Since Data Lake Compute adopts serverless architecture, it needs to schedule the data engine for task execution for the first time over a period of time, which may take a longer time.
Private engine
A private engine is a dedicated data engine that you purchase on a pay-as-you-go basis. For billing details, see Billing Overview. Pay-as-you-go: This billing mode is highly flexible and stable, where fees are charged by the CU usage. It is applicable to use cases where data is analyzed regularly, with compute resources elastically scaled based on the business load.
Monthly subscription: This billing mode is applicable to use cases where large amounts of data require long-term and stable analysis, with compute resources elastically scaled based on the business load. It guarantees always available resources with no need to wait for resource startup. Fees are charged by month based on the cluster specification (elastic clusters are billed by CU usage though).
Compute engine types
A private engine can work with different compute engines in different use cases.
SparkSQL: It is suitable for stable and efficient offline SQL tasks.
Spark job: It is suitable for native Spark stream/batch data job processing.
Presto: It is suitable for agile and fast interactive query and analysis.
Note:
The compute engine type does not affect the unit price of a private engine.
Engine scaling rules
The elastic scaling rules for the engine can be configured either in Create Engine or in the SuperSQL Engine.within the Console Data Engine. The number of clusters refers to the number of resident clusters in the engine. The sum of the total number of clusters and elastic clusters represents the maximum number of clusters the engine can scale to during elastic scaling.
Basic rule: Engine scaling will only occur when the number of elastic clusters is greater than zero.
Scale-out rule: The system will scale out the data engine based on the configured rules when the number of queued tasks exceeds the available concurrent capacity, the task queue time surpasses the queue time limit, and no clusters are being initialized.
Scale-in rule: The system will scale in the data engine when the current number of clusters exceeds the number of resident clusters, the overall average load of the clusters is below 20%, and there are idle clusters.
As shown in the figure below: During the purchase, the number of clusters is set to 2, the number of elastic clusters to 3, and the task queue time limit to 5 minutes. During high concurrency of cluster tasks, if the number of queued tasks exceeds 2 and the queue time exceeds 5 minutes, the system will scale out the data engine to alleviate the task queuing situation. After successful scale-out, if the task queuing situation is alleviated, clusters become idle, and the load is low, the system will scale in the data engine.
In the case of elastic scaling, the number of clusters in the data engine will not be less than the configured cluster count and will not exceed the sum of the configured cluster count and the elastic clusters.
For example, if the configured number of clusters is 2 and the number of elastic clusters is 3, after scaling out, the number of clusters will not exceed 5, and after scaling in, the number of clusters will not be fewer than 2.
Note:
The cluster count of a data engine cannot be smaller than the minimum cluster count. A pay-as-you-go cluster can be suspended if it is not needed.
Engine running status
A cluster may be in one of the following eight statuses: Starting, Running, Suspended, Suspending, Changing configuration, Isolated, Isolating, Recovering.
Starting: The cluster is being started. In this case, a pay-as-you-go private engine is not billed. A starting cluster cannot be selected for data computing.
Running: The cluster is running and can be selected for data computing.
Suspended: The cluster is suspended and cannot be selected for data computing.
Suspending: The cluster is being suspended and cannot be selected for data computing. This will affect running tasks.
Changing configuration: The cluster is undergoing a configuration change and cannot be selected for data computing.
Isolated: The cluster is isolated due to overdue payments and cannot be selected for data computing.
Isolating: The cluster is being isolated due to overdue payments and cannot be selected for data computing. This will affect running tasks.
Recovering: The cluster is being recovered from the Isolated status to the Running status after the account is topped up. It cannot be selected for data computing.
Was this page helpful?