Feature Overview
WeData has newly launched the Notebook Exploration feature, which supports reading data from Tencent Cloud's Big Data Engine EMR and DLC through Jupyter Notebook. With the help of interactive data analysis, users can perform data exploration and machine learning.
Currently, Notebook exploration is available on the Chinese site in Beijing, Shanghai, Guangzhou, Singapore, and Silicon Valley regions, as well as on the international site in Singapore and Frankfurt regions. You can contact a Ticket to activate the allowlist for trial use. Features
One-Click workspace creation
No need to manually install the Python environment and configure environment dependencies. It supports one-click creation of a Notebook workspace, including a complete Jupyter Notebook environment and commonly used dependency packages.
User and resource isolation
Each user has a dedicated workspace under different projects. The storage and computing resources of each workspace are isolated from one another. Users' tasks and file resources do not interfere with each other.
Integrated with big data engine base
Supports binding with EMR and DLC Big Data Engines. Data can be directly read from the big data storage and computing engines for interactive exploration, algorithm model training, and predictive data analysis.
Built-In practice tutorial
The Notebook workspace comes with built-in Big Data tutorials, allowing users to get started quickly and easily.
Overall Usage Process
The full process for users to use Notebook in WeData is shown below:
Operation Steps
Create a Notebook Workspace
2. click the Project List in the left menu, and find the target project for the Notebook Exploration feature.
3. After selecting the project, click to enter the Data Analysis > Notebook Exploration module.
4. Enter the Notebook Exploration list page and click Create workspace.
5. Enter the workspace configuration page to set basic information and resource configuration information.
|
Basic information | Configure the basic information of the Notebook Workspace to create a Notebook Workspace instance |
|
Workspace Name | Notebook workspace name, supports Chinese, English, numbers, underscores, hyphens, and a length of no more than 32 characters | Yes |
Permission Scope | If "Individual Use Only" is selected, only the current user can access the workspace; if "Project Share" is selected, all members of the project can access the space for collaborative development. | Yes |
Description(Optional) | Notebook workspace description, supports Chinese, English, numbers, special characters, etc., and a length of no more than 255 characters | No |
Engines | Supports selecting the EMR or DLC computational storage engine bound to the current project. Once selected, it will pre-connect with the engine, allowing access to Notebook tasks using PySpark | No |
Network | When selecting the EMR engine, further network configuration is needed to establish network connectivity, defaulting to the VPC and subnet where the EMR engine is located | Yes |
DLC Data Engine | When selecting the DLC engine, you need to further choose a DLC data engine bound to the project to execute DLC PySpark tasks. Note: Only supports DLC Spark job type computational resources. | Yes |
Machine Learning | If the DLC data engine you select contains a "machine learning" type resource group, this option will appear and be selected by default. If the DLC data engine you select does not contain a "machine learning" type resource group, this option will not appear. To use it, please go to DLC to create one. | No |
RoleArn | When selecting the DLC engine, further selection of RoleArn is needed to authorize access permissions to the COS data storage Note: RoleArn is the data access policy (CAM role arn) for the DLC engine to access Cloud Object Storage (COS), which needs to be configured by the user in DLC. | Yes |
Advanced configuration | You can choose to use Mlflow to manage experiments, data, and models in Notebook exploration. This feature currently requires allowlist access. |
|
MLflow service | After checking, the creation of experiments and machine learning using MLflow functions in Notebook tasks will be reported to the MLflow service. You can later view them in Machine Learning - Experiment Management and Model Management. | No |
Resource Configuration | Configure the storage and computational resources for the workspace, used for executing Notebook tasks with CFS |
|
Specification Selection | Supported specifications include: 2 Cores 4GB Memory / 8GB Storage (Trial Version) 4 Cores 8GB Memory / 16GB Storage (Advanced Edition) 8 Cores 16GB Memory / 32GB Storage (Express Version) | Yes |
Start/Stop Workspace Management
Launch workspace
1. click Create Now to enter the Notebook workspace launch page.
2. During the startup process, the PySpark environment will be configured for you, and common Python packages such as numpy, pandas, and scikit-learn will be installed. The installation may take some time, please be patient and wait until the installation is complete.
3. Upon reaching the following page, the Notebook workspace has successfully launched, and you can begin creating Notebook tasks.
Log out of the workspace
1. Click the log out button at the top left to exit the current workspace and return to the list page.
The workspace will automatically stop ten minutes after exiting. Restarting a stopped workspace will restore the development environment and data.
Editing workspace
click the Edit button on the list page to modify the configuration information of the current workspace. Configurable items include: space name, description, resource configuration.
Deleting workspace
click the Delete button on the list page to delete the current workspace.
Create and Run a Notebook File
1. Create a Notebook File
You can create folders and Notebook files in the left resource manager.
Note:
Notebook files need to end with (.ipynb).
2. Select a running kernel
Enter the Notebook file, click the top left select kernel, and choose a kernel from the dropdown options.
Note:
In Jupyter Notebook, the kernel is the backend application that executes code, handles the execution of code cells, returns calculation results, and interacts with the user interface.
WeData Notebook currently supports two types of kernels:
Python Environment: The default IPython kernel in Jupyter Notebook, supports Python code execution.
DLC resource group: A remote kernel provided by Tencent Cloud Big Data, allowing Python tasks to be submitted to the DLC resource group for execution.
If you select the DLC resource group, choose a machine learning resource group instance from the DLC data engine in the next level of options.
3. Run a Notebook File
Click Run to generate a Notebook kernel instance and start running the code. The results will be displayed below each cell.
Periodic Scheduling Of Notebook Tasks
Create a notebook task
1. Enter the project and open the menu Data Research and Development > Offline Development.
2. In the left directory, click Create Workflow and configure the workflow properties, including the workflow name, folder, etc.
3. Create a task in the workflow, with the task type as General-Notebook Exploration. Configure the task's basic attributes on the Create New Task page, including the task name, task type, etc.
Configure and run a notebook task
On the Notebook task configuration page, refer to a file in a Notebook workspace.
1. Select a Notebook workspace
You can drop down to select all Notebook workspaces in the current project.
2. Select a Notebook file
You can drop down to select all files in the current Notebook workspace. Note: If the current user does not have permission for the Notebook workspace, they cannot access it for operations.
3. Preview code
After selecting a Notebook file, you can preview the specific content of the Notebook file below.
4. Run a notebook task
In the upper right corner, select the scheduling resource group and click Run to run the current Notebook file online. You can view the running logs, running code, and execution results below.
Configuring scheduling
1. Click on the right side Scheduling Configuration to set the scheduling cycle for the current Notebook task. For example, the figure below sets it to run every 5 minutes.
2. Click the Submit button to submit the current task to periodic scheduling.
Task ops
1. Go to Data Research and Development > Ops Center.
2. Task Ops
Click Task Ops to see the workflows submitted to the scheduler and the task nodes within the workflows.
3. Instance Ops
Click Instance Ops to view each period instance generated by the workflow.
4. Enter the instance detail to view the running logs and results.
Practical Tutorial
The Notebook workspace comes with built-in Big Data tutorials, allowing users to get started quickly and easily.
Tutorial 1: Data Analysis Using DLC Jupyter Plug-In
This sample Notebook demonstrates how to analyze data in Data Lake Compute (DLC). The Notebook workspace has the DLC Jupyter Plugin built-in, which can be loaded directly. The example syntax includes running Spark code, SparkSQL code, and using SparkML.
Note:
Using this tutorial, the Notebook workspace needs to bind the DLC engine and uncheck "Use Machine Learning Resource Group". The kernel should be set to Python Environment, and WeData Notebook will interact with DLC using the Jupyter Plugin.
Tutorial 2 Reading EMR Data For Model Prediction
1. This sample Notebook demonstrates how to create EMR-Hive Tables and import local data into EMR-Hive Tables. It then reads data from the EMR-Hive Tables and converts it into a pandas DataFrame for data preparation.
2. After completing data preparation, you can use the Prophet Time Series Algorithm to train a Predictive Model, followed by Model Accuracy Evaluation and prediction.
Note:
Using this tutorial, the Notebook workspace needs to bind the EMR engine.
Tutorial 3 Creating Machine Learning Experiments and Managing Experiments
This sample Notebook demonstrates how to use MLflow to create experiments, record data, and manage models. The experiment is based on the Iris dataset, using the KNeighborsClassifier algorithm for model training, and uses MLflow to record and trace experimental data, ultimately producing an optimal model for data classification and prediction. Note:
Using this tutorial, the Notebook workspace needs to bind the DLC engine and check "Use Machine Learning Resource Group". The kernel should be set to DLC Resource Group, and WeData Notebook will remotely submit the Notebook file to DLC for execution.
MLflow is an open-source machine learning platform that provides end-to-end support for the data science lifecycle, including experiment management, model versioning, model deployment, and model monitoring. If the current workspace has MLflow service enabled, you can record each experiment's parameters, indicators, and results by calling MLflow's related functions in the experiment, and view them in WeData Machine Learning Module > Experiment Management and Model Management, thus achieving experiment traceability and reproducibility.
Was this page helpful?