Practices on Dynamic Scheduling of Spark Resources

Elastic MapReduce

Release Notes and Announcements

Release Notes

Announcements

Alarm Policy Migration

Security Announcements

Notice for Apache Log4j 2 RCE Vulnerability

Product Introduction

Constraints and Limits

Technical Support Scope

Product release

Version Overview

Overview of Component Versions

Purchase Guide

EMR on CVM Billing Instructions

Billing Overview

Purchase Instructions

Cost Allocation by Tag

EMR on TKE Billing Instructions

Billing Overview

Purchase Instructions

Payment Overdue

EMR Serverless HBase Billing Instructions

Billing Overview

Purchase Instructions

Pay-As-You-Go to Monthly Subscription

Monthly Subscription Refund Instructions

Renewal Instructions

Overdue Payment

Getting Started

EMR on CVM Quick Start

EMR on TKE Quick Start

EMR on CVM Operation Guide

Planning Cluster

Business Evaluation

Cluster Types

Cross-AZ Cluster

Cross-AZ Cluster Deployment

Cross-AZ High Availability

Configuring Cluster

Creating Cluster

Administrative rights

CAM Overview

Role Authorization

Collaborator/Sub-user Permissions

CAM-Enabled EMR API Authorization Granularity Details

Authentication Granularity Scheme

Cluster COS Service Role

Setting Tags

Bootstrap Actions

Software Configuration

Mounting CHDFS Instance

Unified Management of Hive Metadata

Setting Security Groups

Component Configuration Sharing

Managing Cluster

Managing Service

Managing Users

Adding Components

Restarting Service

Starting/Stopping Services

WebUI Access

Role Management

Client Management

Configuration Management

Configuration Status

Configuration Rollback

Configuration Group Management

Service List

YARN Resource Scheduling

Overview

Configuring Fair Scheduler

Configuring Capacity Scheduler

Label Management

Viewing Scheduling History

HBase RIT Fixing

Component Port Information

Monitoring and Alarms

Application Analysis

StarRocks Query Management

Hive Data Table Analysis

YARN Job Query

HDFS File Storage Analysis

Impala Query Management

Monitoring Metrics

EMR on TKE Operation Guide

Introduction to EMR on TKE

Configuring Cluster

Permission Management

Role Authorization

Creating Cluster

Cluster Management

Adjusting the Number of Pods

Modifying Configuration

Service Management

Deployment Instructions

Configuration Management

Configuration Update

Configuration Rollback

Monitoring and Ops

Application Analysis

EMR Serverless HBase Operation Guide

EMR Serverless HBase Product Introduction

Planning an Instance

Managing an Instance

Modifying an Instance

Table Management

Setting EC Policy

Terminating an Instance

Monitoring and Alarms

Instance Monitoring

Data Table Analysis

Configuring Alarms

Development Guide

Serverless HBase Instructions

EMR Development Guide

Hadoop Development Guide

HDFS Common Operations

HDFS Federation Management Development Guide

HDFS Federation Management

Submitting MapReduce Tasks

Automatically Adding Task Nodes Without Assigning ApplicationMasters

YARN Task Queue Management

Practices on YARN Label Scheduling

Hadoop Practical Tutorial

Using API to Analyze Data in HDFS and COS

Dumping YARN Job Logs to COS

Spark Development Guide

Spark Environment Info

Using Spark to Analyze Data in COS

Using Spark Python to Analyze Data in COS

SparkSQL Tutorial

Integrating Spark Streaming with Ckafka

Practices on Dynamic Scheduling of Spark Resources

Spark Integration with Kafka

Spark Dependencies in Each EMR Version

Hbase Development Guide

Using HBase Through API

Using Hbase with Thrift

Spark on Hbase

MapReduce on Hbase

Phoenix on Hbase Development Guide

Phoenix Client Usage

Phoenix JDBC Usage

Phoenix Practical Tutorial

Hive Development Guide

Hive Overview

Basic Hive Operations

Hive Connection Methods

Configuring Hive Execution Engine

Advanced Usage

Configuring LDAP Authentication

HiveServer2 CLB

Hive Metadata Management

Custom Functions UDF

Practical Tutorial

Mapping Hbase Tables

Practices on Loading JSON Data to Hive

Accessing Iceberg Data with Hive

Accessing Hudi Data with Hive

Creating Databases and Tables in COS/CHDFS with Hive

Presto Development Guide

Presto Web UI

Connector

Analyzing Data in COS

Sqoop Development Guide

Import/Export of Relational Database and HDFS

Incremental Data Import into HDFS

Importing and Exporting Data Between Hive and TencentDB for MySQL

Hue Development Guide

Hue Overview

Hue Practical Tutorial

Oozie Development Guide

Flume Development Guide

Flume Overview

Storing Kafka Data in Hive Through Flume

Storing Kafka Data in HDFS or COS Through Flume

Storing Kafka Data in Hive Through Flume

Kerberos Development Guide

Kerberos Overview

Knox Development Guide

Alluxio Development Guide

Alluxio Development Documentation

Common Alluxio Commands

Mounting File System to Unified Alluxio File System

Using Alluxio in Tencent Cloud

Support for COS Transparent-URI

Support for Authentication

Kylin Development Guide

Kylin Overview

Livy Development Guide

Livy Overview

Kyuubi Development Guide

Kyuubi Overview

Kyuubi Practical Tutorial

Zeppelin Development Guide

Zeppelin Overview

Zeppelin Interpreter Configuration

Hudi Development Guide

Hudi Overview

Superset Development Guide

Superset Overview

Impala Development Guide

Impala Overview

Impala OPS Manual

Analyzing Data on COS/CHDFS

Druid Development Guide

Druid Overview

Druid Usage

Ingesting Data from Hadoop in Batches

Ingesting Data from Kafka in Real Time

TensorFlow Development Guide

TensorFlow Overview

TensorFlowOnSpark Overview

Kudu Development Guide

Kudu Overview

Data Migration Guide for Kudu Node Scale-In

Ranger Development Guide

Ranger Overview

Ranger User Guide

Integrating HDFS with Ranger

Integrating YARN with Ranger

Integrating HBase with Ranger

Integrating Presto with Ranger

Ranger Audit Log Guide

Storing Ranger Audit Logs in Solr

Storing Ranger Audit Logs in Tencent Cloud ElasticSearch

Kafka Development Guide

Kafka Overview

Use Cases

Kafka Usage

Iceberg Development Guide

StarRocks Development Guide

StarRocks Overview

User Guide

Flink Development Guide

Flink Overview

Analyzing COS Data with Flink

Practical Tutorial

Practice of EMR on CVM Ops

Migration of HiveServer2 and MetaStore to Router

Practice of Troubleshooting Unexecuted Auto-Scaling Rules

Practice Tutorial on Switching HDFS DataNode Maintenance Status

Data Migration

HDFS Data Migration Using COS

HDFS Data Migration Using DistCp

Practice of Hive Data Migration

Practical Tutorial on Custom Scaling

Practical Tutorial on Setting Scaling Rules

Practical Tutorial on Setting Time-based Scaling Rules

Practical Tutorial on Setting Load-based Scaling Rules

Practical Tutorial on Setting Mixed Scaling Rules

API Documentation

FAQs

EMR on CVM

Billing

Cluster Management

Service Level Agreement

DocumentationElastic MapReduceEMR Development GuideSpark Development GuidePractices on Dynamic Scheduling of Spark Resources

Practices on Dynamic Scheduling of Spark Resources

Download PDF

Last updated: 2025-01-03 14:50:17

Practices on Dynamic Scheduling of Spark Resources

Last updated: 2025-01-03 14:50:17

Download PDF

Preparations for Development
Confirm that you have activated Tencent Cloud and created an EMR cluster. When creating the EMR cluster, you need to select the spark_hadoop component on the software configuration page.
Spark is installed in the /usr/local/service/ path (/usr/local/service/spark) in the CVM instance for the EMR cluster.
Copying JAR Package
You need to copy spark-<version>-yarn-shuffle.jar to the /usr/local/service/hadoop/share/hadoop/yarn/lib directory of all nodes in the cluster.
Method 1. Use the SSH Console
1. In Cluster Service > YARN, select Operation > Role Management and confirm the IP of the node where NodeManager resides.
﻿
﻿
2. Log in to the nodes where NodeManager resides one by one.
You need to log in to any node (preferably a master one) in the EMR cluster. For more information on how to log in to EMR, please see Logging in to Linux Instance. Here, you can log in by using XShell.
Use SSH to log in to other nodes where NodeManager resides. The used command is ssh $user@$ip, where $user is the login username, and $ip is the remote server IP (i.e., IP address confirmed in step 1).
﻿
﻿
Verify that the switch is successful.
﻿
﻿
3. Search for the path of the spark-<version>-yarn-shuffle.jar file.
﻿
﻿
4. Copy spark-<version>-yarn-shuffle.jar to /usr/local/service/hadoop/share/hadoop/yarn/lib.
﻿
﻿
5. Log out and switch to other nodes.
﻿
﻿
Method 2. Use batch deployment script
You need to log in to any node (preferably a master one) in the EMR cluster. For more information on how to log in to EMR, please see Logging in to Linux Instance. Here, you can log in by using XShell.
Write the following Shell script for batch file transfer. When there are many nodes in a cluster, to avoid entering the password for multiple times, you can use sshpass for file transfer. sshpass provides password-free transfer to eliminate your need to enter the password repeatedly; however, the password plaintext is prone to disclosure and can be found with the history command.
1. Install sshpass for password-free transfer.
[root@172 ~]# yum install sshpass
Write the following script:
#!/bin/bash
nodes=(ip1 ip2 … ipn) # List of IPs of all nodes in the cluster separated by spaces
len=${#nodes[@]}
password=<your password>
file=" spark-2.3.2-yarn-shuffle.jar "
source_dir="/usr/local/service/spark/yarn"
target_dir="/usr/local/service/hadoop/share/hadoop/yarn/lib"
echo $len
for node in ${nodes[*]}
do
  echo $node;
  sshpass -p $password scp "$source_dir/$file"root@$node:"$target_dir";
done
2. Transfer files in a non-password-free manner.
Write the following script:
#!/bin/bash
nodes=(ip1 ip2 … ipn) # List of IPs of all nodes in the cluster separated by spaces
len=${#nodes[@]}
password=<your password>
file=" spark-2.3.2-yarn-shuffle.jar "
source_dir="/usr/local/service/spark/yarn"
target_dir="/usr/local/service/hadoop/share/hadoop/yarn/lib"
echo $len
for node in ${nodes[*]}
do
    echo $node;
    scp "$source_dir/$file" root@$node:"$target_dir";
done
Modifying YARN Configuration
1. In Cluster Service > YARN, select Operation > Configuration Management. Select the configuration file yarn-site.xml and select "cluster level" as the level (modifications of configuration items at the cluster level will be applied to all nodes in the cluster).
﻿
﻿
2. Modify the yarn.nodemanager.aux-services configuration item and add spark_shuffle.
﻿
﻿
3. Add the configuration item yarn.nodemanager.aux-services.spark_shuffle.class and set it to org.apache.spark.network.yarn.YarnShuffleService.
﻿
﻿
4. Add the configuration item spark.yarn.shuffle.stopOnFailure and set it to false.
﻿
﻿
5. Save and distribute the settings. Restart the YARN component for the configuration to take effect.
Modifying Spark Configuration
1. In Cluster Service > SPARK, select Operation > Configuration Management.
2. Select the configuration file spark-defaults.conf, click Modify Configuration, and create configuration items as shown below:
﻿
﻿
Configuration Item
Value
Remarks
spark.shuffle.service.enabled
true
It starts the shuffle service.
spark.dynamicAllocation.enabled
true
It starts dynamic resource allocation.
spark.dynamicAllocation.minExecutors
1
It specifies the minimum number of executors allocated for each application.
spark.dynamicAllocation.maxExecutors
30
It specifies the maximum number of executors allocated for each application.
spark.dynamicAllocation.initialExecutors
1
Generally, its value is the same as that of `spark.dynamicAllocation.minExecutors`. 
spark.dynamicAllocation.schedulerBacklogTimeout
1s
If there are pending jobs backlogged for more than this duration, new executors will be requested.
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
5s
If the queue of pending jobs still exists, it will be triggered again once every several seconds. The number of executors requested per round grows exponentially compared to the previous round.
spark.dynamicAllocation.executorIdleTimeout
60s
If an executor has been idle for more than this duration, it will be deleted by the application.
3. Save and distribute the configuration and restart the component.
Testing Dynamic Scheduling of Spark Resources
1. Resource configuration description of the testing environment
In the testing environment, there are two nodes where NodeManager is deployed, and each node has a 4 CPU cores and 8 GB memory. The total resources of the cluster are 8 CPU cores and 16 GB memory.
2. Testing job description
Test 1
In the EMR Console, enter the /usr/local/service/spark directory, switch to the "hadoop" user, and run spark-submit to submit a job. The data needs to be stored in HDFS.
[root@172 ~]# cd /usr/local/service/spark/
[root@172 spark]# su hadoop
[hadoop@172 spark]$  hadoop fs -put ./README.md /
[hadoop@172 spark]$ spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn-client --num-executors 10 --driver-memory 4g --executor-memory 4g --executor-cores 2 ./examples/jars/spark-examples_2.11-2.3.2.jar /README.md /output
In the "Application" panel of the WebUI of the YARN component, you can view the container and CPU allocation before and after the configuration.
﻿
﻿
Before dynamic resource scheduling is configured, at most 3 CPUs can be allocated.
﻿
﻿
After dynamic resource scheduling is configured, up to 5 CPUs can be allocated.
Conclusion: after dynamic resource scheduling is configured, the scheduler will allocate more resources based on the real-time needs of applications.
Test 2
In the EMR Console, enter the /usr/local/service/spark directory, switch to the "hadoop" user, and run spark-sql to start the interactive SparkSQL Console, which is set to use most of the resources in the testing cluster. Configure dynamic resource scheduling and check resource allocation before and after the configuration.
[root@172 ~]# cd /usr/local/service/spark/
[root@172 spark]# su hadoop
[hadoop@172 spark]$ spark-sql --master yarn-client --num-executors 5 --driver-memory 4g --executor-memory 2g --executor-cores 1
Use the example for calculating the pi that comes with Spark 2.3.0 as the testing job. When submitting the job, set the number of executors to 5, the driver memory to 4 GB, the executor memory to 4 GB, and the number of executor cores to 2.
[root@172 ~]# cd /usr/local/service/spark/
[root@172 spark]# su hadoop
[hadoop@172 spark]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 5 --driver-memory 4g --executor-memory 4g --executor-cores 2 examples/jars/spark-examples_2.11-2.3.2.jar 500
﻿
﻿
﻿
The resource utilization when only the SparkSQL job is running is 90.3%.
﻿
﻿
After the SparkPi job is submitted, the resource utilization of SparkSQL becomes 27.8%.
Conclusion: although the SparkSQL job applies for a large amount of resources during submission, no analysis jobs are executed; therefore, there are a lot of idle resources actually. When the idle duration exceeds the limit set by spark.dynamicAllocation.executorIdleTimeout, idle executors will be released, and other jobs will get resources. In this test, the cluster resource utilization of the SparkSQL job decreases from 90% to 28%, and idle resources are allocated to the pi calculation job; therefore, automatic scheduling is effective.
Note:
The value of the configuration item spark.dynamicAllocation.executorIdleTimeout affects the speed of dynamic resource scheduling. In the test, it is found that the resource scheduling duration is basically the same as this value. You are recommended to adjust this value based on your actual needs for optimal performance.
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

Configuration Item	Value	Remarks
spark.shuffle.service.enabled	true	It starts the shuffle service.
spark.dynamicAllocation.enabled	true	It starts dynamic resource allocation.
spark.dynamicAllocation.minExecutors	1	It specifies the minimum number of executors allocated for each application.
spark.dynamicAllocation.maxExecutors	30	It specifies the maximum number of executors allocated for each application.
spark.dynamicAllocation.initialExecutors	1	Generally, its value is the same as that of `spark.dynamicAllocation.minExecutors`.
spark.dynamicAllocation.schedulerBacklogTimeout	1s	If there are pending jobs backlogged for more than this duration, new executors will be requested.
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout	5s	If the queue of pending jobs still exists, it will be triggered again once every several seconds. The number of executors requested per round grows exponentially compared to the previous round.
spark.dynamicAllocation.executorIdleTimeout	60s	If an executor has been idle for more than this duration, it will be deleted by the application.

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service free trial

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

E-commerce

E-commerce retail solutions

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Financial Services

Financial Services Solution

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha

Cloud Workload Protection Platform

Data Security Governance Center

Key Management Service