Overview
In general, job tuning may cost you a lot of time. For example, when publishing a new job, you need to consider the parallelism, the number of TaskManagers, the TaskManager spec (in CUs), and other configurations of the job. During the running of a job, you may also need to adjust the resources allocated to it to maximize the resource utilization, and allocate more resources to it in case of a back pressure or an increased delay.
To this end, Stream Compute Service provides the auto tuning feature, enabling you to adjust the parallelism and resources of a job more properly. It helps optimize your jobs from a global perspective, addressing various performance tuning issues such as insufficient throughput, busy jobs, and resource waste.
Use limits
Auto tuning cannot resolve the performance bottlenecks of a streaming job itself,
because a tuning policy handles the job based on certain assumptions, including smooth changes in flows, no data skew, and linear improvement of the throughput of each operator along with the increase of the parallelism. If the actual logic of the job is substantially deviated from these assumptions, the job may encounter exceptions. Manual tuning is required if the job itself has some problems. Common job exceptions are as follows:
Unable to change the job parallelism.
The job cannot run properly or is continuously restarted.
Issues related to user-defined features (UDF) of user-defined functions.
Severe data skew.
Auto tuning cannot resolve issues arising from external systems.
A failure or slow access of an external system may cause a larger job parallelism, resulting in increased pressure and a crash of the external system. You need to resolve external system issues by yourself. Common issues are as follows:
Insufficient partitions or throughput of the source message queue.
Performance issues of the sink in the downstream.
Deadlock in a database in the downstream.
Precautions
Auto tuning is in beta testing, so we recommend you not enable auto-scaling for critical production jobs.
The job will be restarted after auto tuning is triggered, so it will stop processing data for a short period of time. A job with a large state takes a longer period of time to start, and this may result in flow stop for a longer period of time, so we recommend you not enable auto-scaling for such a job.
The interval between two auto tuning triggers defaults to 10 minutes.
If you have enabled auto tuning for a JAR job, make sure no job parallelism is configured in the job code. Otherwise, job resources cannot be adjusted through auto-scaling, which means the auto tuning will be invalidated.
The auto tuning process is serial, so do not enable auto tuning for all jobs on a cluster with limited resources to avoid interference.
Default tuning rules
For a job for which auto tuning is enabled, the Stream Compute Service system automatically tunes its parallelism and TaskManager spec to optimize it.
1. Specifically, parallelism is tuned to achieve the throughput for the changing job flows, and the CPU utilization of TaskManagers and the data processing time of each operator are monitored to tune the parallelism of the job as a whole accordingly. Details are as follows:
When the CPU utilization of all TaskManagers is greater than 80% for 10 minutes, the default parallelism of the job is increased to twice the existing value, but the number of job CUs will not exceed the specified max available CUs (default: 64 CUs).
When the data processing time percentage of any Vertex node in the job is greater than 80% for 10 minutes, the default parallelism of the job is increased to twice the existing value, but the number of job CUs will not exceed the specified max available CUs (default: 64 CUs).
When the CPU utilization of all TaskManagers is less than for 4 hours, and the data processing time percentage of all Vertex nodes in the job is less than 20% for 4 hours, the default parallelism of the job is reduced to half the existing value (to 1 at most).
2. When auto tuning is enabled, the memory usage of TaskManagers is also monitored to adjust the memory settings of the job. Details are as follows:
When the heap memory utilization of all TaskManagers is greater than 80% for 1 hour, the TaskManager spec (in CUs) is increased to twice the existing value.
When the heap memory utilization of all TaskManagers is less than 30% for 4 hours, the TaskManager spec (in CUs) is reduced to half the existing value.
Note
The job parallelism can be reduced to 1 at most. The TaskManager spec may be different, depending on whether fine-grained resources are used. It can be 0.25, 0.5, 1, or 2 if fine-grained resources are used, and can only be 1 otherwise.
Was this page helpful?