Scenario | Instance Specifications |
Deep learning training platform | 84vCPU Standard S5 or 48vCPU Memory Optimized M5 |
Deep learning inference platform | 8/16/24/32/48vCPU Standard S5 or Memory Optimized M5 |
Deep learning training or inference platform | 48vCPU Standard S5 or 24vCPU Memory Optimized M5 |
numactl
command, and applicable to small-scale online inference.pip install intel-tensorflow
lscpu | grep "Core(s) per socket" | cut -d':' -f2 | xargs
export OMP_NUM_THREADS= # <physicalcores>export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"export KMP_BLOCKTIME=1export KMP_SETTINGS=1export TF_NUM_INTRAOP_THREADS= # <physicalcores>export TF_NUM_INTEROP_THREADS=1export TF_ENABLE_MKL_NATIVE_FORMAT=0
import osos.environ["KMP_BLOCKTIME"] = "1"os.environ["KMP_SETTINGS"] = "1"os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"if FLAGS.num_intra_threads > 0:os.environ["OMP_NUM_THREADS"]= # <physical cores>os.environ["TF_ENABLE_MKL_NATIVE_FORMAT"] = "0"config = tf.ConfigProto()config.intra_op_parallelism_threads = # <physical cores>config.inter_op_parallelism_threads = 1tf.Session(config=config)
lscpu | grep "Core(s) per socket" | cut -d':' -f2 | xargs
export OMP_NUM_THREADS=physicalcoresexport GOMP_CPU_AFFINITY="0-<physicalcores-1>"export OMP_SCHEDULE=STATICexport OMP_PROC_BIND=CLOSE
export OMP_NUM_THREADS=physicalcoresexport LD_PRELOAD=<path_to_libiomp5.so>export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"export KMP_BLOCKTIME=1export KMP_SETTINGS=1
import intel_pytorch_extension...net = net.to('xpu') # Move model to IPEX formatdata = data.to('xpu') # Move data to IPEX format...output = net(data) # Perform inference with IPEXoutput = output.to('cpu') # Move output back to ATen format
malloc(3)
implementation that emphasizes fragmentation avoidance and scalable concurrency support. It is intended for use as the system-provided memory allocator. jemalloc provides much introspection, memory management, and tuning features beyond the standard allocator functionality. For more information, see jemalloc and sample codes.v1.15.0
, v1.15.0up1
, v1.15.0up2
, v2.0.0
, v2.1.0
, v2.2.0
, v2.3.0
and v2.4.0
.
Intel® optimized PyTorch, including v1.5.0+cpu
and v1.6.0+cpu
.
Intel® optimized MXNet, including v1.6.0
, v1.7.0
; ONNX-Runtime: v1.6.0
.Framework | Version | Model | Accuracy | Performance speed up | ||
| | | INT8 Tuning Accuracy | FP32 Accuracy Baseline | Acc Ratio [(INT8-FP32)/FP32] | Realtime Latency Ratio[FP32/INT8] |
tensorflow | 2.4.0 | resnet50v1.5 | 76.92% | 76.46% | 0.60% | 3.37x |
tensorflow | 2.4.0 | resnet101 | 77.18% | 76.45% | 0.95% | 2.53x |
tensorflow | 2.4.0 | inception_v1 | 70.41% | 69.74% | 0.96% | 1.89x |
tensorflow | 2.4.0 | inception_v2 | 74.36% | 73.97% | 0.53% | 1.95x |
tensorflow | 2.4.0 | inception_v3 | 77.28% | 76.75% | 0.69% | 2.37x |
tensorflow | 2.4.0 | inception_v4 | 80.39% | 80.27% | 0.15% | 2.60x |
tensorflow | 2.4.0 | inception_resnet_v2 | 80.38% | 80.40% | -0.02% | 1.98x |
tensorflow | 2.4.0 | mobilenetv1 | 73.29% | 70.96% | 3.28% | 2.93x |
tensorflow | 2.4.0 | ssd_resnet50_v1 | 37.98% | 38.00% | -0.05% | 2.99x |
tensorflow | 2.4.0 | mask_rcnn_inception_v2 | 28.62% | 28.73% | -0.38% | 2.96x |
tensorflow | 2.4.0 | vgg16 | 72.11% | 70.89% | 1.72% | 3.76x |
tensorflow | 2.4.0 | vgg19 | 72.36% | 71.01% | 1.90% | 3.85x |
Framework | Version | Model | Accuracy | Performance speed up | ||
| | | INT8 Tuning Accuracy | FP32 Accuracy Baseline | Acc Ratio [(INT8-FP32)/FP32] | Realtime Latency Ratio[FP32/INT8] |
pytorch | 1.5.0+cpu | resnet50 | 75.96% | 76.13% | -0.23% | 2.46x |
pytorch | 1.5.0+cpu | resnext101_32x8d | 79.12% | 79.31% | -0.24% | 2.63x |
pytorch | 1.6.0a0+24aac32 | bert_base_mrpc | 88.90% | 88.73% | 0.19% | 2.10x |
pytorch | 1.6.0a0+24aac32 | bert_base_cola | 59.06% | 58.84% | 0.37% | 2.23x |
pytorch | 1.6.0a0+24aac32 | bert_base_sts-b | 88.40% | 89.27% | -0.97% | 2.13x |
pytorch | 1.6.0a0+24aac32 | bert_base_sst-2 | 91.51% | 91.86% | -0.37% | 2.32x |
pytorch | 1.6.0a0+24aac32 | bert_base_rte | 69.31% | 69.68% | -0.52% | 2.03x |
pytorch | 1.6.0a0+24aac32 | bert_large_mrpc | 87.45% | 88.33% | -0.99% | 2.65x |
pytorch | 1.6.0a0+24aac32 | bert_large_squad | 92.85 | 93.05 | -0.21% | 1.92x |
pytorch | 1.6.0a0+24aac32 | bert_large_qnli | 91.20% | 91.82% | -0.68% | 2.59x |
pytorch | 1.6.0a0+24aac32 | bert_large_rte | 71.84% | 72.56% | -0.99% | 1.34x |
pytorch | 1.6.0a0+24aac32 | bert_large_cola | 62.74% | 62.57% | 0.27% | 2.67x |
lpot
in anaconda. This document uses python 3.7 as an example.conda create -n lpot python=3.7conda activate lpot
pip install lpot
git clone https://github.com/intel/lpot.gitcd lpotpip install -r requirements.txtpython setup.py install
mkdir -p img_raw/val && cd img_rawwget http://www.image-net.org/challenges/LSVRC/2012/dd31405981ef5f776aa17412e1f0c112/ILSVRC2012_img_val.tartar -xvf ILSVRC2012_img_val.tar -C val
cd valwget -qO -https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
cd examples/tensorflow/image_recognitionbash prepare_dataset.sh --output_dir=./data --raw_dir=/PATH/TO/img_raw/val/--subset=validation
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb
examples/tensorflow/image_recognition/resnet50_v1.yaml
file so that the path of quantization\\calibration
, evaluation\\accuracy
and evaluation\\performance
datasets point to your local actual path, i.e., the location of the TFrecord data generated in the dataset preparations. For more information, see ResNet50 V1.0.cd examples/tensorflow/image_recognitionbash run_tuning.sh --config=resnet50_v1.yaml \\--input_model=/PATH/TO/resnet50_fp32_pretrained_model.pb \\--output_model=./lpot_resnet50_v1.pb
bash run_benchmark.sh --input_model=./lpot_resnet50_v1.pb--config=resnet50_v1.yaml
accuracy mode benchmarkresult:Accuracy is 0.739Batch size = 32Latency: 1.341 msThroughput: 745.631 images/secperformance mode benchmark result:Accuracy is 0.000Batch size = 32Latency: 1.300 msThroughput: 769.302 images/sec
Was this page helpful?