mxnet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [incubator-mxnet] ThomasDelteil commented on a change in pull request #15448: [MKLDNN]Enhance Quantization APIs and Tutorial
Date Thu, 04 Jul 2019 13:40:07 GMT
ThomasDelteil commented on a change in pull request #15448: [MKLDNN]Enhance Quantization APIs
and Tutorial

 File path: docs/tutorials/mkldnn/
 @@ -0,0 +1,259 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+<!--- -->
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+# Quantize custom models with MKL-DNN backend
+This document is to introduce how to quantize the customer models from FP32 to INT8 with
Apache/MXNet toolkit and APIs under Intel CPU.
+If you are not familiar with Apache/MXNet quantizaiton flow, please reference [quantization
first, and the perforamnce data is shown in [Apache/MXNet C++ interface](
and [GluonCV]( 
+## Installation and Prerequisites
+Installing MXNet with MKLDNN backend is an easy and essential process. You can follow [How
to build and install MXNet with MKL-DNN backend](
to build and install MXNet from source. Also, you can install the release or nightly version
via PyPi and pip directly by running:
+# release version
+pip install mxnet-mkl
+# nightly version
+pip install mxnet-mkl --pre
+## Image Classification Demo
+A quantization script [](
has been designed to launch quantization for image-classification models. This script is 
integrated with [Gluon-CV modelzoo](,
so that all pre-trained models can be downloaded from Gluon-CV and then converted for quantization.
For details, you can refer [Model Quantization with Calibration Examples](
+## Integrate Quantization Flow to Your Project
+Quantization flow works for both symbolic and Gluon models. If you're using Gluon, you can
first refer [Saving and Loading Gluon Models](
to hybridize your computation graph and export it as a symbol before running quantization.
+In general, the quantization flow includes 4 steps. The user can get the acceptable accuracy
from step 1 to 3 with minimum effort. Most of thing in this stage is out-of-box and the data
scientists and researchers only need to focus on how to represent data and layers in their
model. After a quantized model is generated, you may want to deploy it online and the performance
will be the next key point. Thus, step 4, calibration, can improve the performance a lot by
reducing lots of runtime calculation.
+![quantization flow](quantization.png)
+Now, we are going to take Gluon ResNet18 as an example to show how each step work.
+### Initialize Model
+import logging
+import mxnet as mx
+from mxnet.gluon.model_zoo import vision
+from mxnet.contrib.quantization import *
+logger = logging.getLogger('logger')
+batch_shape = (1, 3, 224, 224)
+resnet18 = vision.resnet18_v1(pretrained=True)
+sym, arg_params, aux_params = mx.model.load_checkpoint('resnet18_v1', 0)
+# (optional) visualize float32 model
+First, we download resnet18-v1 model from gluon modelzoo and export it as a symbol. You can
visualize float32 model. Below is a raw residual block.
+![float32 model](fp32_raw.png)
+#### Model Fusion
+sym = sym.get_backend_symbol('MKLDNN_QUANTIZE')
+# (optional) visualize fused float32 model
+It's important to add this line to enable graph fusion before quantization to get better
performance. Below is a fused residual block. Batchnorm, Activation and elemwise_add are fused
into Convolution.
+![float32 fused model](fp32_fusion.png)
+### Quantize Model
+A python interface `quantiza_graph` is provided for the user. Thus, it is very flexible for
the data scientist to construct the expected models based on different requirements in a real
+# quantize configs
+# set exclude layers
+excluded_names = []
+# set calib mode.
+calib_mode = 'none'
+# set calib_layer
+calib_layer = None
+# set quantized_dtype
+quantized_dtype = 'auto''Quantizing FP32 model Resnet18-V1')
+qsym, qarg_params, aux_params, collector = quantize_graph(sym=sym, arg_params=arg_params,
+                                                          excluded_sym_names=excluded_names,
+                                                          calib_mode=calib_mode, calib_layer=calib_layer,
+                                                          quantized_dtype=quantized_dtype,
+# (optional) visualize quantized model
+# save quantized model
+mx.model.save_checkpoint('quantized-resnet18_v1', 0, qsym, qarg_params, aux_params)
+By applying `quantize_graph` to the symbolic model, a new quantized model can be generated,
named `qsym` along with its parameters. We can see `_contrib_requantize` operators are inserted
after `Convolution` to conver the INT32 output to FP32. 
+![none calibrated model](none_calib.png)
+Below table gives some descriptions.
+| param              | type            | description|
+| excluded_sym_names | list of strings | A list of strings representing the names of the
symbols that users want to excluding from being quantized.|
+| calib_mode         | str             | If calib_mode='none', no calibration will be used
and the thresholds for requantization after the corresponding layers will be calculated at
runtime by calling min  and max operators. The quantized models generated in this mode are
normally 10-20% slower than those with  calibrations during inference.<br>If calib_mode='naive',
the min and max values of the layer outputs from a calibration dataset will be directly taken
as the thresholds for quantization.<br>If calib_mode='entropy', the thresholds for quantization
will be derived such that the KL divergence between the distributions of FP32 layer outputs
and  quantized layer outputs is minimized based upon the calibration dataset. |
+| calib_layer        | function        | Given a layer's output name in string, return True
or False for deciding whether to calibrate this layer.<br>If yes, the statistics of
the layer's output will be collected; otherwise, no information of the layer's output will
be collected.<br>If not provided, all the layers' outputs that need requantization will
be collected.|
+| quantized_dtype    | str             | The quantized destination type for input data. Currently
support 'int8', 'uint8' and 'auto'.<br>'auto' means automatically select output type
according to calibration result.|
+### Evaluate & Tune
+Now, you get a pair of quantized symbol and params file for inference. For Gluon inference,
only difference is to load model and params by a SymbolBlock as below example:
+quantized_net = mx.gluon.SymbolBlock.imports('quantized-resnet18_v1-symbol.json', 'data',
+quantized_net.hybridize(static_shape=True, static_alloc=True)
+batch_size = 1
+data = mx.nd.ones((batch_size,3,224,224))
+Now, you can get the accuracy from a quantized network. Furthermore, you can try to select
different layers or OPs to be quantized by `excluded_sym_names` parameter and figure out an
acceptable accuracy.
 Review comment:
   more details would be appreciated here, which layer should be excluded, fully connected
ones ? Why does the 'flatten' layer should be excluded for it to work?

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

View raw message