tencent cloud

All product documents
Cloud Virtual Machine
DocumentationCloud Virtual MachinePractical TutorialUsing AVX-512 Instructions to Accelerate AI Applications on CVM
Using AVX-512 Instructions to Accelerate AI Applications on CVM
Last updated: 2024-01-06 17:49:55
Using AVX-512 Instructions to Accelerate AI Applications on CVM
Last updated: 2024-01-06 17:49:55

Overview

The fifth-generation Tencent Cloud CVM instances (including S6, S5, M5, C4, IT5, D3, etc.) all come with the 2nd generation Intel® Xeon® scalable processor Cascade Lake. These instances provides more instruction sets and features, which can accelerate the artificial intelligence (AI) applications. The integrated hardware enhancement technology, like Advanced Vector Extensions 512 (AVX-512), can boost the parallel computing performance for AI inference and produce a better deep learning result.
This document describes how to use AVX-512 on S5 and M5 CVM instances to accelerate AI application.
Tencent Cloud provides various types of CVMs for different application development. The Standard S6, Standard S5 and Memory Optimized M5 instance types come with the 2nd generation Intel® Xeon® processor and support Intel® DL Boost, making them suitable for machine learning or deep learning. The recommended configurations are as follows:
Scenario
Instance Specifications
Deep learning training platform
84vCPU Standard S5 or 48vCPU Memory Optimized M5
Deep learning inference platform
8/16/24/32/48vCPU Standard S5 or Memory Optimized M5
Deep learning training or inference platform
48vCPU Standard S5 or 24vCPU Memory Optimized M5

Advantages

Running the workloads for machine learning or deep learning on Intel® Xeon® scalable processors has the following advantages:
Suitable for processing 3D-CNN topologies used in scenarios such as big-memory workloads, medical imaging, GAN, seismic analysis, gene sequencing, etc.
Flexible core support simply using the numactl command, and applicable to small-scale online inference.
Powerful ecosystem to directly perform distributed training on large clusters, without the need for a large-scale architecture containing additional large-capacity storage and expensive caching mechanisms.
Support for many workloads (such as HPC, BigData, and AI) in a single cluster to deliver better TCO.
Support for SIMD acceleration to meet the computing requirements of various deep learning applications.
The same infrastructure for direct training and inference.

Directions

Creating an instance

Create an instance as instructed in Creating Instances via CVM Purchase Page. Select a recommended model that suits your actual use case.


Note:
For more information on instance specifications, see Instance Types.

Logging in to the instance

Deploying a platform

Deploy an AI platform as instructed below to perform the machine learning or deep learning task:
Sample 1: optimizing the deep learning framework TensorFlow* with Intel®
PyTorch and IPEX on the 2nd generation Intel® Xeon® scalable processor Cascade Lake will automatically optimize AVX-512 instructions to maximize the computing performance.
TensorFlow* is a widely-used large-scale machine learning and deep learning framework. You can improve the instance training and inference performance as instructed in the sample below. More information about the framework, see Intel® Optimization for TensorFlow* Installation Guide. Follow these steps:

Deploying the TensorFlow* framework

1. Install Python in the CVM instance. This document uses Python 3.7 as an example.
2. Run the following command to install the Intel® optimized TensorFlow* intel-tensorflow.
Note:
The version 2.4.0 or later is recommended to obtain the latest features and optimization.
pip install intel-tensorflow

Setting runtime parameters

Choose one of the following two runtime interfaces to optimize runtime parameters as needed. For more information about the optimization settings, see General Best Practices for Intel® Optimization for TensorFlow.
Batch inference: measures how many input tensors can be processed per second with batches of size greater than one. Typically, for batch inference, optimal performance is achieved by exercising all the physical cores on a CPU socket.
On-line Inference: (also called real-time inference) is a measurement of the time it takes to process a single input tensor, i.e. a batch of size one. In a real-time inference scenario, optimal throughput is achieved by running multiple instances concurrently.
Follow the steps below:
1. Run the following command to obtain the number of physical cores in the system.
lscpu | grep "Core(s) per socket" | cut -d':' -f2 | xargs
2. Set the optimization parameters using either method:
Set the runtime parameters. Add the following configurations in the environment variable file:
export OMP_NUM_THREADS= # <physicalcores>
export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export TF_NUM_INTRAOP_THREADS= # <physicalcores>
export TF_NUM_INTEROP_THREADS=1
export TF_ENABLE_MKL_NATIVE_FORMAT=0
Add the environment variables to codes. Add the following configurations to the running Python codes.
import os
os.environ["KMP_BLOCKTIME"] = "1"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
if FLAGS.num_intra_threads > 0:
os.environ["OMP_NUM_THREADS"]= # <physical cores>
os.environ["TF_ENABLE_MKL_NATIVE_FORMAT"] = "0"
config = tf.ConfigProto()
config.intra_op_parallelism_threads = # <physical cores>
config.inter_op_parallelism_threads = 1
tf.Session(config=config)

Running inference on the TensorFlow* deep learning model

Run inference on other machine learning or deep learning models as instructed in Image Recognition with ResNet50, ResNet101 and InceptionV3. This document describes how to run inference benchmark with ResNet50. For more information, see ResNet50 (v1.5).

Training on the TensorFlow* deep learning model

This document describes how to run training benchmark with ResNet50. For more information, see FP32 Training Instructions.

TensorFlow performance

The performance data is as shown in Improving TensorFlow* Inference Performance on Intel® Xeon® Processors, which may slightly vary according to the models and actual configurations.
Latency performance: We tested models of image classification and object detection at batch size one, and found improved inference performance of Intel Optimization for TensorFlow with AVX-512 instructions against the non-optimized version. For example, the latency performance of optimized ResNet 50 is reduced to 45% of the original version.
Throughput performance: We tested models of image classification and object detection for throughput performance at large batch size, and found significant improvements. The throughput performance of optimized ResNet 50 is increased to 1.98 times of the original version.

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon