Thanks. Let's lay down the highlevel API design for some of the quantized operators. A large
portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene
and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping
design this RFC.
* [Discussion](https://discuss.tvm.ai/t/tflitequantizedconv2doperatorconversion/2651)
Other nonTVM related links that were used to understand quantization
* GemmLowP  [Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md)
* TFlite reference [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101L182)

**Covered frameworks for now**  TFLite and MxNet
**Target network for now**  Inception V3 from TFLite. (I will create one for Mxnet)
**Target platforms for now**  ARM and Intel (will create separate Issue as the project progresses)

**List of required operators**  quantize, quantized_conv2d, qunatized_relu, quantized_pool2d,
quantized_fully_connected, quantized_concat, dequantize

It will be good if we can agree on Relay ops  its inputs/outputs and the attributes. The
initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other
quantized_* operators will be on the same lines as that of quantized_conv2d)
## Op quantize
```python
def quantize(data, scale, zero_point, out_dtype):
"""
Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor.
Parameters

data: FP32 tensor
The input tensor in FP32.
scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.
zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.
out_dtype: String
The dtype of the output. Can only be int8/uint8
Returns

quantized_data: int8/uint8 tensor
The quantized tensor.
"""
```
Key points to discuss
* The scale and zero_point calculations happen outside the relay graph, i.e., the framework
parsers will have to compute the scale and offset if only min and max are provided. [Reference
implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28L99)
in TFLite. This can also be thought as a framework parser utils where we can handle min/max,
symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.
## Op quantized_conv2d
```python
def quantized_conv2d(quantized_data, quantized_kernel,
input_scale, input_zero_point,
kernel_scale, kernel_zero_point,
output_scale, output_zero_point,
out_dtype,
# All the old remaining ones from conv2d
strides=(1, 1),
padding=(0, 0),
dilation=(1, 1),
groups=1,
channels=None,
kernel_size=None,
data_layout="NCHW",
kernel_layout="OIHW",
out_layout=""):
"""
Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
happen outside the relay graph, i.e., the framework parsers will have to compute
the scale and offset if only min and max are provided.
Parameters

quantized_data: int8/uint8 tensor
The quantized input tensor in int8/uint8.
quantized_kernel: FP32 tensor
The quantized kernel tensor in int8/uint8.
input_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_data int8 values back to FP32.
input_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_data distribution.
kernel_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_kernel int8 values back to FP32.
kernel_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_kernel distribution.
output_scale: FP32 scalar (An attribute of the op)
The output scale is set during the quantization process using training/calibration.
The float scalar to scale the quantized_output int8 values back to FP32.
output_zero_point: Int32 zero point (An attribute of the op)
The output zero point is set during the quantization process using training/calibration.
The zero point of the quantized_output distribution.
out_dtype: String
The dtype of the quantized_output. Can only be int8/uint8.
The requantization from int32 to int8/uint8 is a part of the op compute.
out_dtype: String
The dtype of the output. Can only be int8/uint8
..... Other attributes are same as before.
Returns

quantized_output: int8/uint8 tensor
The quantized tensor.
"""
```
Key points to discuss further
* This op has a set of computations that can be precomputed ideally but difficult to do because
foldconstant only works on Relay ops and not within a Relay op. This has been discussed in
more detail in [discuss forum](https://discuss.tvm.ai/t/tflitequantizedconv2doperatorconversion/2651).
* First precomputable  The core computation has some compute with kernel (Term 2 and
Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid.
We need a fused compute to get the best performance.
* Second precomputable  The output scale and zero_point are used to calculate int multiplier
and shifts to keep all the computations in Int domain. This computation changes for each op
(e.g. concat will handle this in a different manner compared to conv). So, this computation
is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing
output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite
and one might want to handle the output_scale and output_offset in a different manner. **I
am not sure about this part, so please comment.**
* The op already has the requantization portion accounted for. As far as I understand, the
requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and
output_shift, as mentioned above, is for the calculation of output quantized tensor and not
for requantization).
## Op dequantize
Dequantization is required while connecting a quantized operator and an FP32 operator. This
might be a temporary stage where we do not have a quantized implementation of the second op.
Dequantization might also be required at the end of the network to keep the output of the
graph in FP32.
```python
def dequantize(quantized_data, scale, zero_point, out_dtype):
"""
Dequantize takes the scale and zero_point attributes and dequantizes the
int8/uint8 tensor to FP32 tensor.
Parameters

quantized_data: int8/uint8 quantized input tensor
The input tensor in int8/uint8.
scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.
zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.
out_dtype: String
The dtype of the output. Can only be float32.
Returns

data: FP32 tensor
The dequantized tensor.
"""
```

You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment496707076
