平台类型 | 实例规格 |
深度学习训练平台 | 84vCPU 的标准型 S5 实例或 48vCPU 的内存型 M5 实例。 |
深度学习推理平台 | 8/16/24/32/48vCPU 的标准型 S5 实例或内存型 M5 实例。 |
机器学习训练或推理平台 | 48vCPU 的标准型 S5 实例或 24vCPU 的内存型 M5 实例。 |
numactl
命令进行灵活的核心控制,也适用小批量的实时推理。pip install intel-tensorflow
lscpu | grep "Core(s) per socket" | cut -d':' -f2 | xargs
export OMP_NUM_THREADS= # <physicalcores>export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"export KMP_BLOCKTIME=1export KMP_SETTINGS=1export TF_NUM_INTRAOP_THREADS= # <physicalcores>export TF_NUM_INTEROP_THREADS=1export TF_ENABLE_MKL_NATIVE_FORMAT=0
import osos.environ["KMP_BLOCKTIME"] = "1"os.environ["KMP_SETTINGS"] = "1"os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"if FLAGS.num_intra_threads > 0:os.environ["OMP_NUM_THREADS"]= # <physical cores>os.environ["TF_ENABLE_MKL_NATIVE_FORMAT"] = "0"config = tf.ConfigProto()config.intra_op_parallelism_threads = # <physical cores>config.inter_op_parallelism_threads = 1tf.Session(config=config)
lscpu | grep "Core(s) per socket" | cut -d':' -f2 | xargs
export OMP_NUM_THREADS=physicalcoresexport GOMP_CPU_AFFINITY="0-<physicalcores-1>"export OMP_SCHEDULE=STATICexport OMP_PROC_BIND=CLOSE
export OMP_NUM_THREADS=physicalcoresexport LD_PRELOAD=<path_to_libiomp5.so>export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"export KMP_BLOCKTIME=1export KMP_SETTINGS=1
import intel_pytorch_extension...net = net.to('xpu') # Move model to IPEX formatdata = data.to('xpu') # Move data to IPEX format...output = net(data) # Perform inference with IPEXoutput = output.to('cpu') # Move output back to ATen format
malloc(3)
实现,强调避免碎片化和可扩展的并发支持,旨在为系统提供的内存分配器。jemalloc 提供了许多超出标准分配器功能的内省、内存管理和调整功能。详情请参见 jemalloc 及 示例代码。v1.15.0
、v1.15.0up1
、v1.15.0up2
、 v2.0.0
、v2.1.0
、v2.2.0
、v2.3.0
和v2.4.0
。
Intel® 优化的 PyTorch v1.5.0+cpu
和 v1.6.0+cpu
。
Intel® 优化的 MXNet v1.6.0
、v1.7.0
以及 ONNX-Runtime v1.6.0
。Framework | Version | Model | Accuracy | Performance speed up | ||
| | | INT8 Tuning Accuracy | FP32 Accuracy Baseline | Acc Ratio [(INT8-FP32)/FP32] | Realtime Latency Ratio[FP32/INT8] |
tensorflow | 2.4.0 | resnet50v1.5 | 76.92% | 76.46% | 0.60% | 3.37x |
tensorflow | 2.4.0 | resnet101 | 77.18% | 76.45% | 0.95% | 2.53x |
tensorflow | 2.4.0 | inception_v1 | 70.41% | 69.74% | 0.96% | 1.89x |
tensorflow | 2.4.0 | inception_v2 | 74.36% | 73.97% | 0.53% | 1.95x |
tensorflow | 2.4.0 | inception_v3 | 77.28% | 76.75% | 0.69% | 2.37x |
tensorflow | 2.4.0 | inception_v4 | 80.39% | 80.27% | 0.15% | 2.60x |
tensorflow | 2.4.0 | inception_resnet_v2 | 80.38% | 80.40% | -0.02% | 1.98x |
tensorflow | 2.4.0 | mobilenetv1 | 73.29% | 70.96% | 3.28% | 2.93x |
tensorflow | 2.4.0 | ssd_resnet50_v1 | 37.98% | 38.00% | -0.05% | 2.99x |
tensorflow | 2.4.0 | mask_rcnn_inception_v2 | 28.62% | 28.73% | -0.38% | 2.96x |
tensorflow | 2.4.0 | vgg16 | 72.11% | 70.89% | 1.72% | 3.76x |
tensorflow | 2.4.0 | vgg19 | 72.36% | 71.01% | 1.90% | 3.85x |
Framework | Version | Model | Accuracy | Performance speed up | ||
| | | INT8 Tuning Accuracy | FP32 Accuracy Baseline | Acc Ratio [(INT8-FP32)/FP32] | Realtime Latency Ratio[FP32/INT8] |
pytorch | 1.5.0+cpu | resnet50 | 75.96% | 76.13% | -0.23% | 2.46x |
pytorch | 1.5.0+cpu | resnext101_32x8d | 79.12% | 79.31% | -0.24% | 2.63x |
pytorch | 1.6.0a0+24aac32 | bert_base_mrpc | 88.90% | 88.73% | 0.19% | 2.10x |
pytorch | 1.6.0a0+24aac32 | bert_base_cola | 59.06% | 58.84% | 0.37% | 2.23x |
pytorch | 1.6.0a0+24aac32 | bert_base_sts-b | 88.40% | 89.27% | -0.97% | 2.13x |
pytorch | 1.6.0a0+24aac32 | bert_base_sst-2 | 91.51% | 91.86% | -0.37% | 2.32x |
pytorch | 1.6.0a0+24aac32 | bert_base_rte | 69.31% | 69.68% | -0.52% | 2.03x |
pytorch | 1.6.0a0+24aac32 | bert_large_mrpc | 87.45% | 88.33% | -0.99% | 2.65x |
pytorch | 1.6.0a0+24aac32 | bert_large_squad | 92.85 | 93.05 | -0.21% | 1.92x |
pytorch | 1.6.0a0+24aac32 | bert_large_qnli | 91.20% | 91.82% | -0.68% | 2.59x |
pytorch | 1.6.0a0+24aac32 | bert_large_rte | 71.84% | 72.56% | -0.99% | 1.34x |
pytorch | 1.6.0a0+24aac32 | bert_large_cola | 62.74% | 62.57% | 0.27% | 2.67x |
conda create -n lpot python=3.7conda activate lpot
pip install lpot
git clone https://github.com/intel/lpot.gitcd lpotpip install –r requirements.txtpython setup.py install
mkdir –p img_raw/val && cd img_rawwget http://www.image-net.org/challenges/LSVRC/2012/dd31405981ef5f776aa17412e1f0c112/ILSVRC2012_img_val.tartar –xvf ILSVRC2012_img_val.tar -C val
cd valwget -qO -https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
cd examples/tensorflow/image_recognitionbash prepare_dataset.sh --output_dir=./data --raw_dir=/PATH/TO/img_raw/val/--subset=validation
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb
examples/tensorflow/image_recognition/resnet50_v1.yaml
,使 quantization\\calibration
、evaluation\\accuracy
、evaluation\\performance
三部分的数据集路径指向用户本地实际路径,即数据集准备阶段生成的 TFrecord 数据所在位置。详情请参见 ResNet50 V1.0。cd examples/tensorflow/image_recognitionbash run_tuning.sh --config=resnet50_v1.yaml \\--input_model=/PATH/TO/resnet50_fp32_pretrained_model.pb \\--output_model=./lpot_resnet50_v1.pb
bash run_benchmark.sh --input_model=./lpot_resnet50_v1.pb--config=resnet50_v1.yaml
accuracy mode benchmarkresult:Accuracy is 0.739Batch size = 32Latency: 1.341 msThroughput: 745.631 images/secperformance mode benchmark result:Accuracy is 0.000Batch size = 32Latency: 1.300 msThroughput: 769.302 images/sec
本页内容是否解决了您的问题?