PySpark Job Development Guide

Data Lake Compute

Product Introduction

Purchase Guide

Configuration Adjustment Fees

Getting Started

Complete Process for New User Activation

DLC Data Import Guide

Quick Start with Data Analytics in Data Lake Compute

Quick Start with Permission Management in Data Lake Compute

Quick Start with Partition Table

Enabling Data Optimization

Cross-Source Analysis of EMR Hive Data

Standard Engine Configuration Guide

Operation Guide

Console Operation Introduction

Data Development and Exploration

Data Exploration

SQL Editor

Data Query Task

SELECT Task

Querying Partition Table

Querying JSON Data

Querying Data from Other Sources

Using View

INSERT INTO

Querying Script Parameters

Obtaining Task Results

Query Script Analysis

Data Job

Overview

Configuring Data Access Policy

Creating Data Job

Managing Data Job

PySpark Dependency Package Management

Resource Management

Engine Management

Network Connection Configuration

Storage Configuration

Managed Storage Configuration

Binding a Metadata Acceleration Bucket

Metadata Management

Data Catalogs and DMC

Data Table Management

Data View Management

Function Management

Partition Field Policy

Ops Management

Historical Task Instances

Historical task(Old version)

Session Management

Insight Management

Task Insights

System Management

User and Permission Management

CAM Service

Permission Overview

User and Work Group

Sub-Account Permission Management

Monitoring and Alarms

Data Engine Monitoring

Data Job Monitoring

Access Point Gateway Engine Monitoring

Monitoring Alarm Configuration

Audit Log

Development Guide

SparkJar Job Development Guide

PySpark Job Development Guide

Query Performance Optimization Guide

UDF Function Development Guide

Materialized View

System Restraints

Metadata Information

Computing Task

Client Access

JDBC Access

DLC JDBC Access

Hive JDBC Access

Presto JDBC Access

Configuring Public Access for Standard Engine

TDLC Command Line Interface Tool Access

Third-party Software Linkage

Python Access

Practical Tutorial

Table Creation Practice

Using Apache Airflow to Schedule DLC Engine to Submit Tasks

Direct Query of DLC Internal Storage with StarRocks

DLC Native Table

DLC Source Table Core Capabilities

DLC Source Table Operation Configuration

DLC Source Table Lake Ingestion Practice

DLC Source Table FAQs

SQL Statement

SuperSQL Statement

Overview of SuperSQL Statement

Unified Statement

DDL Statement

ALTER DATABASE

ALTER DATABASE SET DBPROPERTIES

ALTER DATABASE SET LOCATION

DROP DATABASE

CREATE TABLE

REPLACE TABLE AS SELECT

SHOW COLUMNS IN TABLE

ALTER TABLE

ALTER TABLE ADD COLUMNS

ALTER TABLE ADD COLUMN AFTER/FIRST

ALTER TABLE DROP COLUMN

ALTER TABLE ADD PARTATION

SHOW PARTITIONS

ALTER TABLE DROP PARTITION

ALTER TABLE ADD PARTITION FIELD

ALTER TABLE DROP PARTITION FIELD

ALTER TABLE ... RENAME COLUMN

ALTER TABLE SET TBLPROPERTIES

ALTER TABLE SET LOCATION

ALTER TABLE ... WRITE ORDERED BY

ALTER TABLE ... WRITE DISTRIBUTED BY PARTITION

ALTER TABLE ... SET IDENTIFIER FIELDS

ALTER TABLE ... DROP IDENTIFIER FIELDS

ALTER VIEW

ALTER VIEW SET TBLPROPERTIES

DML Statement

DQL Statement

Iceberg Table Statement

Differences in Statement Between Iceberg External Tables and Native Tables

Materialized View Statement

SQL Implicit Conversion

Functions

Unified Functions

Overview of Unified Functions

Binary Functions

Bitwise Functions

Collection Functions

Date and Time Functions

JSON Functions

Mathematical Functions

Presto Built-in Functions

Comparison of Hive Functions

Overview of Standard Spark Statement

Overview of Standard Presto Statement

API Documentation

Making API Requests

Data Table APIs

DescribeLakeFsDirSummary

DescribeLakeFsInfo

QueryResult

GenerateCreateMangedTableSql

Task APIs

Metadata APIs

DescribeForbiddenTablePro

DescribeDLCCatalogAccess

GrantDLCCatalogAccess

RevokeDLCCatalogAccess

DropDMSTable

DropDLCTable

DescribeDMSDatabaseList

Service Configuration APIs

CreateCHDFSBindingProduct

DeleteCHDFSBindingProduct

DescribeOtherCHDFSBindingList

CreateStoreLocation

DescribeStoreLocation

ModifyDataEngineDescription

RollbackDataEngineImage

SwitchDataEngine

SwitchDataEngineImage

UpgradeDataEngineImage

DeleteThirdPartyAccessUser

DescribeDataEngineImageVersions

DescribeSubUserAccessPolicy

DescribeThirdPartyAccessUser

RegisterThirdPartyAccessUser

RestartDataEngine

UpdateUserDataEngineConfig

UpdateDataEngineConfig

Permission Management APIs

Database APIs

ModifyAdvancedStoreLocation

ModifyGovernEventRule

DescribeAdvancedStoreLocation

Data Source Connection APIs

CheckDataEngineImageCanBeRollback

CheckDataEngineImageCanBeUpgrade

DescribeDataEnginePythonSparkImages

Data Optimization APIs

GetOptimizerPolicy

Data Engine APIs

CreateDataEngine

DescribeDataEnginesScaleDetail

DeleteDataEngine

RenewDataEngine

SuspendResumeDataEngine

UpdateDataEngine

DescribeUpdatableDataEngines

DescribeDataEngine

DescribeUserDataEngineConfig

CheckDataEngineConfigPairsValidity

General Reference

Operation Guide on Connecting Third-Party Software to DLC

Connecting CBoard to DLC

DLC Policy

Data Privacy And Security Agreement

Service Level Agreement

DocumentationData Lake ComputeDevelopment GuidePySpark Job Development Guide

PySpark Job Development Guide

Download PDF

Last updated: 2025-03-07 15:52:30

PySpark Job Development Guide

Last updated: 2025-03-07 15:52:30

Download PDF

Scenarios
Data Lake Compute supports the execution of programs written in Python. This example demonstrates the detailed operations of reading and writing data on Cloud Object Storage (COS), creating libraries and tables on Data Lake Compute, and reading and writing tables, assisting users in job development on Data Lake Compute.
Environment Preparation
Dependencies: PyCharm or other Python programming development tools.
Development Process
Development Flowchart
The development process for Data Lake Compute Spark JAR jobs is as follows:
﻿
﻿
﻿
 Resource Creation
For the first time running a job on Data Lake Compute, you need to create new Spark job compute resources, for instance, creating a Spark job resource named "dlc-demo".
1. Log in to the Data Lake Compute DLC Console, select the service region, and click on Data Engine in the navigation menu.
2. Click Create Resource in the upper left corner to enter the resource configuration purchase page.
3. In the Cluster Configuration > Calculation Engine Type option, select Spark as the job engine.
﻿
﻿
﻿
Fill in "dlc-demo" for Information Configuration > Resource Name. For a detailed introduction to creating new resources, please refer to Purchasing a Dedicated Data Engine.
﻿
﻿
﻿
4. Click Activate Now to confirm the resource configuration information.
5. Upon verifying that the information is accurate, click Submit to complete the resource configuration.
Uploading Data to COS
Create a bucket named "dlc-demo" and upload the file people.json for the example of reading and writing data from COS. The content of the people.json file is as follows:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":3}
{"name":"WangHua", "age":19}
{"name":"ZhangSan", "age":10}
{"name":"LiSi", "age":33}
{"name":"ZhaoWu", "age":37}
{"name":"MengXiao", "age":68}
{"name":"KaiDa", "age":89}
1. Log in to the Cloud Object Storage (COS) console and click on Bucket List in the left navigation menu.
2. Creating a Bucket:
Click Create Bucket in the upper left corner, fill in the name field with "dlc-dmo", and click Next to complete the configuration.
3. Upload File:
Click on File List > Upload File, select the local "people.json" file to upload to the "dlc-demo-1305424723" bucket (-1305424723 is a random string generated by the platform when creating the bucket), click Upload to complete the file upload. For details on creating a new bucket, please refer to Create Bucket.
﻿
﻿
﻿
Creating a Python Project
Create a new project named "demo" using PyCharm.
Writing Code
1. Create a new cos.py file, write code with the functionality to read and write data from COS, create libraries and tables on DLC, query data, and write data.
import sys
from pyspark.sql import SparkSession
from pyspark.sql import Row
﻿
if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .appName("Operate data on cos")\
        .getOrCreate()
﻿
    # 1. Read data from COS, supporting various file types such as JSON, CSV, Parquet, ORC, Text.
    read_path = "cosn://dlc-demo-1305424723/people.json"
    peopleDF = spark.read.json(read_path)
    
    # 2. Operate on the data
    peopleDF.createOrReplaceTempView("people")
    data_src = spark.sql("SELECT * FROM people WHERE age BETWEEN 13 AND 19")
    data_src.show()
    
    # 3. Writing Data
    write_path = "cosn://dlc-demo-1305424723/people_output"
    data_src.write.csv(path=write_path, header=True, sep=",", mode='overwrite')
﻿
    spark.stop()
2. Create a new db.py file, write code, the functions of which include creating libraries, tables, querying data, and writing data on Data Lake Compute.
from os.path import abspath
﻿
from pyspark.sql import SparkSession
﻿
﻿
﻿
if __name__ == "__main__":
﻿
    spark = SparkSession \
        .builder \
        .appName("Operate DB Example") \
        .getOrCreate()
     
    # 1. Create a Database
    spark.sql("CREATE DATABASE IF NOT EXISTS DataLakeCatalog.dlc_db_test_py COMMENT 'demo test' ") 
    # 2. Create Internal Table
    spark.sql("CREATE TABLE IF NOT EXISTS DataLakeCatalog.dlc_db_test_py.test(id int,name string,age int) ")
    # 3. Writing Internal Data
    spark.sql("INSERT INTO DataLakeCatalog.dlc_db_test_py.test VALUES (1,'Andy',12),(2,'Justin',3) ") 
    # 4. Inspect Internal Data
    spark.sql("SELECT * FROM DataLakeCatalog.dlc_db_test_py.test ").show()
    
    # 5. Create External Table
    spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS DataLakeCatalog.dlc_db_test_py.ext_test(id int, name string, age int) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION 'cosn://cry-1305424723/ext_test' ")   
    # 6. Write external data
    spark.sql("INSERT INTO DataLakeCatalog.dlc_db_test_py.ext_test VALUES (1,'Andy',12),(2,'Justin',3) ")  
    # 7. Inspect External Data
    spark.sql("SELECT * FROM DataLakeCatalog.dlc_db_test_py.ext_test ").show()
    spark.stop()
When creating an external table, you can follow the steps to upload data to COS and first create a corresponding table name folder in the bucket to save the table files.
﻿
﻿
﻿
Debugging
Ensure PyCharm debugging is free of syntax errors.
Upload PY Files to COS
Log in to the COS console and follow the steps in the previous section Upload data to COS to upload cos.py and db.py to COS.
Create a New Spark Jar Data Job
Before creating a data job, you need to complete the data access policy configuration to ensure that the data job can safely access the data. For details on configuring the data access policy, please refer to Configuring Data Access Policy. If the data policy name has been configured as: qcs::cam::uin/100018379117:roleName/dlc-demo.
1. Log in to the Data Lake Compute DLC Console, select the service region, and click on Data Jobs in the navigation menu.
2. Click the Create Job button in the upper left corner to enter the creation page.
3. On the job configuration page, set the job running parameters as detailed below:
Parameter Configuration
Note
Job name
Specify a custom Spark job name, for instance: cosn_py
Job type
Select Batch Processing Type
Data engine
Select the dlc-demo compute engine created in the Create Resource step.
Application Package
Select COS, and in the step of uploading a py file to COS, upload the py file:
To read and write data from COS, select: cosn://dlc-demo-1305424723/cos.py
To create a library, table, etc. on Data Lake Compute, select: cosn://dlc-demo-1305424723/db.py
CAM role arn
Select the policy created in the previous step: qcs::cam::uin/100018379117:roleName/dlc-demo
Retain the default values of other parameters.
﻿
﻿
﻿
4. Click Save to view the created job on the Spark Job page.
Execute and View Job Results
1. Run the job: On the Spark Job page, locate the newly created job and click Run to execute the job.
2. Viewing Job Execution Results: You can view the job execution logs and results.
Viewing Job Execution Logs
1. Click Job Name >Tasks history to view the task execution status:
﻿
﻿
﻿
2. Click Task ID > Run Log to view the job execution log.
View Job Execution Results
1. To run the example of reading and writing data from COS, go to the COS console to view the data write results.
﻿
﻿
﻿
2. To create tables and libraries on Data Lake Compute, navigate to the Data Exploration page on Data Lake Compute to view the creation of libraries and tables.
﻿
﻿
﻿
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

Parameter Configuration	Note
Job name	Specify a custom Spark job name, for instance: cosn_py
Job type	Select Batch Processing Type
Data engine	Select the dlc-demo compute engine created in the Create Resource step.
Application Package	Select COS, and in the step of uploading a py file to COS, upload the py file: To read and write data from COS, select: cosn://dlc-demo-1305424723/cos.py To create a library, table, etc. on Data Lake Compute, select: cosn://dlc-demo-1305424723/db.py
CAM role arn	Select the policy created in the previous step: qcs::cam::uin/100018379117:roleName/dlc-demo

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service Special Offers

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

Financial Services

Financial Services Solution

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Real Estate

Tencent Cloud LinkBase(Weiling)

E-commerce

E-commerce retail solutions

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

TencentDB for TcaplusDB

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha