tencent cloud

Feedback

Last updated: 2024-11-01 16:26:14
    Note:
    You need to start Hive and Spark component services in the EMR cluster, have COS permissions, and store resource files in COS.
    Example: In the following example, the user has permissions in the EMR cluster.

    Feature Overview

    Submit a Spark task for execution to WeData's workflow scheduling platform.

    
    
    

    Parameter description

    Parameter
    Description
    Spark program zip package
    The user directly uploads the written Spark program code file. After packaging it into a JAR, package all dependencies of Definition into a single ZIP file. Do not package directories, only the files themselves.
    Execution parameters
    Execution parameters for the Spark program. The user does not need to write spark-submit, specify a submit user, submission queue, or submission mode (default is yarn). The parameter format is: --class mainClass run.jar args or wordcount.py input output.
    Application parameters
    Application parameters for Spark.

    Spark JAR Example:

    Submit a task to count words, i.e., a wordcount task. The file to be counted needs to be uploaded to COS in advance.

    Step 1: Write the Spark JAR task locally

    Create the project

    1. Using Maven as an example, create a project and introduce Spark dependencies.
    Note:
    Here, the groupId and artifactId need to be replaced with your specific groupId and artifactId.
    Here, the Spark dependency scope is set to 'scope', indicating that Spark is required only during compilation and packaging, and runtime dependencies are provided by the platform.
    # Generate a Maven project; you can also use an IDE to operate
    mvn archetype:generate -DgroupId=com.example -DartifactId=my-spark -DarchetypeArtifactId=maven-archetype-quickstart
    2. The generated directory structure is as follows:
    
    3. Import dependencies:
    # Introduce Spark dependencies in pom.xml
    <dependencies>
    <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.7</version>
    <scope>provided</scope>
    </dependency>
    </dependencies>

    Write code

    1. Create a new Java Class in the src/main/java/com/example directory. Enter the Class name, using WordCount here, and add the sample code to the Class as follows:
    package com.example;
    
    import java.util.Arrays;
    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaPairRDD;
    import org.apache.spark.api.java.JavaRDD;
    import org.apache.spark.api.java.JavaSparkContext;
    import scala.Tuple2;
    
    public class WordCount {
    public static void main(String[] args) {
    // create SparkConf object
    SparkConf conf = new SparkConf().setAppName("WordCount");
    // create JavaSparkContext object
    JavaSparkContext sc = new JavaSparkContext(conf);
    // read input file to RDD
    JavaRDD<String> lines = sc.textFile(args[0]);
    // split each line into words
    JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
    // count the occurrence of each word
    JavaPairRDD<String, Integer> wordCounts = words.mapToPair(word -> new Tuple2<>(word, 1)).reduceByKey((x, y) -> x + y);
    // save the word counts to output file
    wordCounts.saveAsTextFile(args[1]);
    }
    }
    Note:
    Here, the Spark dependency scope is set to 'scope', indicating that Spark is required only during compilation and packaging, and runtime dependencies are provided by the platform.
    2. Package the code into a JAR file and add the following packaging plugin in Maven:
    <build>
    <plugins>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
    <source>1.8</source>
    <target>1.8</target>
    <encoding>utf-8</encoding>
    </configuration>
    </plugin>
    <plugin>
    <artifactId>maven-assembly-plugin</artifactId>
    <configuration>
    <descriptorRefs>
    <descriptorRef>jar-with-dependencies</descriptorRef>
    </descriptorRefs>
    </configuration>
    <executions>
    <execution>
    <id>make-assembly</id>
    <phase>package</phase>
    <goals>
    <goal>single</goal>
    </goals>
    </execution>
    </executions>
    </plugin>
    </plugins>
    </build>
    3. Then execute the following in the project root directory:
    mvn package
    4. You can see the JAR file with dependencies in the target directory. Here it's named my-spark-1.0-SNAPSHOT-jar-with-dependencies.jar.

    Data preparation

    Since WeData data development only supports ZIP files, you first need to package the JAR into a ZIP file and perform the following operations to obtain the ZIP file. If there are other dependent configuration files, you can also include them in the ZIP package.
    zip spark-wordcount.zip my-spark-1.0-SNAPSHOT-jar-with-dependencies.jar

    Step 2: Upload the SparkJAR task package

    1. In Resource Management, create a new resource file and upload the resource file package.
    
    2. New Resource Configuration:
    

    Step 3: Create a SparkJAR task and configure the schedule

    1. Create a new workflow in the Orchestration Space, and create a Spark task within the workflow.
    
    2. Enter the task parameters.
    
    3. Example of execution parameter format:
    --class mainClass run.jar args or wordcount.py input output
    4. The complete format in the example is as follows:
    --class com.example.WordCount my-spark-1.0-SNAPSHOT-jar-with-dependencies.jar cosn://wedata-demo-1314991481/wordcount.txt
    cosn://wedata-demo-1314991481/result/output
    Note:
    In which cosn://wedata-demo-1314991481/wordcount.txt is the COS path of the file to be processed .
    cosn://wedata-demo-1314991481/result/output is the output COS path of the calculation result , this folder directory should not be created in advance, otherwise the execution will fail.
    5. The sample file for wordcount.txt is as follows:
    hello WeData
    hello Spark
    hello Scala
    hello PySpark
    hello Hive
    6. After debugging, view the calculation results in cos.
    7. Publish the Spark task and enable scheduling. Submit the SparkJAR task:
    
    8. SparkJAR task operation and maintenance are shown below:
    
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support