tvm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Animesh Jain <notificati...@github.com>
Subject [dmlc/tvm] [RFC] Reading quantized models from TFLite and MxNet - operators API (#3252)
Date Tue, 28 May 2019 17:57:03 GMT
To increase quantization support in TVM, it is necessary to support the pre-quantized models,
i.e., the models that have been quantized in the framework itself (outside of Relay). In this
issue, we are laying down the high-level API design for some of the quantized operators. A
large portion of this is coming from the following relevant discussions. Thanks to @jackwish,
@FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik
for helping design this RFC.

* RFC [Issue](https://github.com/dmlc/tvm/issues/2351)
* [Discussion](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651)

Other non-TVM related links that were used to understand quantization
* GemmLowP - [Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md)
* TFlite reference [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101-L182)

---------

**Covered frameworks for now** - TFLite and MxNet
**Target network for now** - Inception V3 from TFLite. (I will create one for Mxnet)
**Target platforms for now** - ARM and Intel (will create separate Issue as the project progresses)


---------


**List of required operators** - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d,
quantized_fully_connected, quantized_concat, dequantize

------------


It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The
initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other
quantized_* operators will be on the same lines as that of quantized_conv2d)


## Op quantize
```python
def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """
```

Key points to discuss
* The scale and zero_point calculations happen outside the relay graph, i.e., the framework
parsers will have to compute the scale and offset if only min and max are provided. [Reference
implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28-L99)
in TFLite. This can also be thought as a framework parser utils where we can handle min/max,
symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.



## Op quantized_conv2d

```python
def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """
    
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.
    
    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.


    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """
```

Key points to discuss further
* This op has a set of computations that can be pre-computed ideally but difficult to do because
fold-constant only works on Relay ops and not within a Relay op. This has been discussed in
more detail in [discuss forum](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651).
    * First pre-computable - The core computation has some compute with kernel (Term 2 and
Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid.
We need a fused compute to get the best performance.
    * Second pre-computable - The output scale and zero_point are used to calculate int multiplier
and shifts to keep all the computations in Int domain. This computation changes for each op
(e.g. concat will handle this in a different manner compared to conv). So, this computation
is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing
output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite
and one might want to handle the output_scale and output_offset in a different manner. **I
am not sure about this part, so please comment.**
* The op already has the requantization portion accounted for. As far as I understand, the
requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and
output_shift, as mentioned above, is for the calculation of output quantized tensor and not
for requantization).




## Op dequantize
Dequantization is required while connecting a quantized operator and an FP32 operator. This
might be a temporary stage where we do not have a quantized implementation of the second op.
Dequantization might also be required at the end of the network to keep the output of the
graph in FP32.

```python
def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """
```


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3252
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message