tencent cloud

All product documents
Data Lake Compute
SparkJar Job Development Guide
Last updated: 2024-08-15 17:54:48
SparkJar Job Development Guide
Last updated: 2024-08-15 17:54:48

Use Cases

DLC is fully compatible with open-source Apache Spark, allowing users to write business programs for reading, writing, and analyzing data on the DLC platform. This example demonstrates how to write Java code to read and write data on COS and perform detailed operations such as creating databases and tables, and reading and writing tables on DLC, helping users complete job development on DLC.

Environment Preparation

Dependencies: JDK1.8 Maven IntelliJ IDEA

Development Process

Development Flowchart

The development flowchart for DLC Spark JAR jobs is as follows:


Creating Resource

The first time you run a job on DLC, you need to create Spark job computing resources. For example, create a Spark job resource named "dlc-demo".
1. Log in to the Data Lake Compute DLC Console. Select your service region, and click Data Engine in the navigation menu.
2. Click Create Resource in the upper left corner to enter the resource configuration purchase page.
3. In Cluster Configuration > Calculation Engine Type, select Spark Job Engine.

In Information Configuration > Resource Name, enter "dlc-demo". For detailed instructions on creating resources, see Purchasing Private Data Engine.

4. Click Activate Now and confirm the resource configuration information.
5. After confirming the information is correct, click Submit to complete the resource configuration.

Uploading Data to COS

Create a bucket named "dlc-demo" and upload the people.json file to use as an example for reading and writing data from COS. The content of the people.json file is as follows:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":3}
{"name":"WangHua", "age":19}
{"name":"ZhangSan", "age":10}
{"name":"LiSi", "age":33}
{"name":"ZhaoWu", "age":37}
{"name":"MengXiao", "age":68}
{"name":"KaiDa", "age":89}
1. Log in to the Cloud Object Storage (COS) console and click Bucket List in the left navigation menu.
2. Create a bucket: Click Create Bucket in the upper left corner, enter "dlc-demo" for the name, and click Next to complete the configuration.
3. Upload files: Click File List > Upload File, select the local "people.json" file, and upload it to the "dlc-demo-1305424723" bucket (-1305424723 is a randomly generated string by the platform when creating the bucket), then click Upload to complete the file upload. For detailed instructions on creating a new bucket, see Create Bucket.


Creating a Maven Project

1. Create a new Maven project named "demo" through IntelliJ IDEA.
2. Add dependencies: Add the following dependencies to the pom.xml file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>

Writing Codes

The code features include reading and writing data from COS, as well as creating databases and tables, querying data, and writing data in DLC.
1. Example code for reading and writing data from COS:
package com.tencent.dlc;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;


public class CosService {

public static void main( String[] args )
{
//1. Create SparkSession
SparkSession spark = SparkSession
.builder()
.appName("Operate data on cos")
.config("spark.some.config.option", "some-value")
.getOrCreate();
//2. Read the json file from COS to generate a data set, supporting various file types such as json, csv, parquet, orc, text
String readPath = "cosn://dlc-demo-1305424723/people.json";
Dataset<Row> readData = spark.read().json(readPath);
//3. Perform business computations on the data set to generate result data, supporting API and SQL formats. Here, a temporary table is generated and SQL is used to read data.
readData.createOrReplaceTempView("people");
Dataset<Row> result = spark.sql("SELECT * FROM people where age > 3");
//4. Save the result data back to COS
String writePath = "cosn://dlc-demo-1305424723/people_output";
//Supports writing various file types such as json, csv, parquet, orc, text
result.write().mode(SaveMode.Append).json(writePath);
spark.read().json(writePath).show();
//5. Close the session
spark.stop();
}
}
2. Create databases, tables, query data, and write data on DLC:
package com.tencent.dlc;

import org.apache.spark.sql.SparkSession;


public class DbService {

public static void main(String[] args) {
//1. Initialize SparkSession
SparkSession spark = SparkSession
.builder()
.appName("Operate DB Example")
.getOrCreate();
//2. Create a database
String dbName = " `DataLakeCatalog`.`dlc_db_test` ";
String dbSql = "CREATE DATABASE IF NOT EXISTS" + dbName + " COMMENT 'demo test'";
spark.sql(dbSql);
//3. Create an internal table
String tableName = "`test`";
String tableSql = "CREATE TABLE IF NOT EXISTS " + dbName + "." + tableName
+ "(`id` int,`name` string, `age` int)";
spark.sql(tableSql);
//4. Write data
spark.sql("INSERT INTO " + dbName + "." + tableName + "VALUES (1,'Andy',12),(2,'Justin',3) ");
//5. Query data
spark.sql(" SELECT * FROM " + dbName + "." + tableName).show();
//6. Create an external table
String extTableName = "`ext_test`";
spark.sql(
"CREATE EXTERNAL TABLE IF NOT EXISTS " + dbName + "." + extTableName + ""
+ " (`id` int, `name` string, `age` int) "
+ "ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' "
+ "STORED AS TEXTFILE LOCATION 'cosn://dlc-demo-1305424723/ext_test '");
//7. Write data to the external table
spark.sql("INSERT INTO " + dbName + "." + extTableName + "VALUES (1,'LiLy',12),(2,'Lucy',3) ");
//8. Query data from the external table
spark.sql(" SELECT * FROM " + dbName + "." + extTableName).show();
//9. Close the session
spark.stop();
}
}
When you create an external table, follow the steps to upload data to COS and first create a corresponding folder named after the table in the bucket to store the table files.


Debugging, Compiling Codes, and Packaging Them as JAR Files

Compile and packet the demo project through IntelliJ IDEA to generate the JAR packet demo-1.0-SNAPSHOT.jar in the project's target folder.

Uploading JAR Files to COS

Log in to the COS console and follow the steps in Uploading Data to COS to upload demo-1.0-SNAPSHOT.jar to COS.

Creating a Spark Jar Data Job

Before creating a data job, you need to configure CAM role arn to ensure the data job can securely access the data. For details on configuring CAM role arn, see Configuring Data Access Policy. If you have already configured a data policy, the policy name might be: qcs::cam::uin/100018379117:roleName/dlc-demo.
1. Log in to the Data Lake Compute DLC Console. Select your service region, and click Data Jobs in the navigation menu.
2. Click Create Job in the upper left corner to enter the creation page.
3. On the job configuration page, configure the job running parameters as follows:
Configuration Parameter
Note
Job Name
Custom Spark Jar job name, for example: cosn-demo.
Job Type
Select Batch Processing Type
Data Engine
Select the dlc-demo computing engine created in the Create Resource step.
Program Packet
Select COS, and choose the JAR file demo-1.0-SNAPSHOT.jar uploaded in the Upload JAR file to COS step.
Main Class
Fill in according to the program code, such as:
For reading and writing data from COS, fill in: com.tencent.dlc.CosService
For creating databases and tables on DLC, fill in: com.tencent.dlc.DbService.
CAM role arn
Select the policy created in the previous step, qcs::cam::uin/100018379117:roleName/dlc-demo.
Keep other parameter values as default.

4. Click Save and you can see the created job on the Spark Job page.

Running and Viewing Job Results

1. Running the Job: On the Spark Job page, find the newly created job and click Run to run the job.
2. Viewing Job Running Results: You can view the job running logs and results.

Viewing Job Running Logs

1. Click Job Name >Tasks history to view the task's running status.

2. Click Task ID > Run Log to view the job running logs.

Viewing Job Running Results

1. If you run the example of reading and writing data from COS, you should check the COS Console for writing results.

2. For jobs that create tables and databases on DLC, check the DLC Data Exploration page to view the created databases and tables.


Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 available.

7x24 Phone Support