tvm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhao Wu <notificati...@github.com>
Subject Re: [dmlc/tvm] [RFC][Quantization] Support quantized models from TensorflowLite (#2351)
Date Wed, 29 May 2019 19:21:59 GMT
> > > > For the `q_conv2d`, we will add two more arguments.
> > > > ```python
> > > >   output_min=0, 
> > > >   output_max=0
> > > > ```
> > > > 
> > > > 
> > > > These will be used for restrict the output range, which could be calculated
previously.
> > > 
> > > 
> > > I see what you are saying, but I am not sure if this is the right approach.
In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes
is because of fused activation in TFLite. It seems better to keep it separate so that both
MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv,
we can add one more clamp operator in the sequence of ops at the end.
> > 
> > 
> > No matter whether we have fused activation function, we always need output_min /
output_max. Because we will get conv int32 result, but we will need uint8 result. So we must
restrict int32 to uint8. If we don't have fused activation function, (When we have quantized
model of TFLite, we don't have fused activation many cases), the output_min / output_max will
be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 /
6. So I think we are better put these two into conv argument. And we could avoid producing
another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is
nature.
> 
> In the case the activation is not fused, the values have to clamped to 0/255 or uint8
range, which is basically the out_dtype. So, we do not need any extra information for the
quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?
> 
> Now, If the activation is fused, I agree that we will have two clamps now. One inside
the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also
write a Relay that replaces two back-to-back clamping with one clamp Relay operator.
> 
> The reason I am saying this is that TFLite chooses one way to handle things, which other
frameworks might not. So, it is necessary to come up with right abstractions first. The performance
can be then be achieved by writing Relay passes.

Yes, I agree when we don't have activation, we don't need anything. However, Another thing
we should consider: How to integrate with other libraries, such as QNNPACK. QNNPACK also need
output min / output max too. https://github.com/pytorch/QNNPACK/blob/master/include/qnnpack.h#L62-L63


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-497074984
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message