tvm-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-tvm] u99127 commented on issue #4828: [QNN][TFLite] TFLite rounding mode support
Date Wed, 08 Apr 2020 16:53:47 GMT
u99127 commented on issue #4828: [QNN][TFLite] TFLite rounding mode support
URL: https://github.com/apache/incubator-tvm/pull/4828#issuecomment-611071742
 
 
   > 
   > 
   > Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite
quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms).
Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which
is the cheapest). The reason is that calculations happen in int64 and either ARM does not
have good instructions or LLVM does not pick up the right instructions.
   > 
   > As an experiment, I forced the datatype of Requantize to be int32. This will lead
to bad accuracy but it will give us an idea what is the minimum latency of Requantize op.
The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize
is the bottleneck.
   > 
   > Additionally, we should notice that TFLite rounding is not a standard rounding. They
have moved backwards from the ARM instructions they want to use. Because of that they have
2 roundings instead of 1 - 1 in `SaturatingRoundingDoublingHighMul` (non-standard rounding)
and other is
   > `RoundingDivideByPOT` (standard). Given there are 2 roundings, they have also made
an accuracy-performance tradeoff. So, my suggestion is that we should also wisely think if
it makes sense to completely follow TFLite rounding as the golden reference. For example,
following reference function
   > 
   > ```
   > // This function implements the same computation as the ARMv7 NEON VQRDMULH
   > // instruction.
   > 
   > template <>
   > inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
   >                                                       std::int32_t b) {
   >   bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
   >   std::int64_t a_64(a);
   >   std::int64_t b_64(b);
   >   std::int64_t ab_64 = a_64 * b_64;
   >   std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
   >   std::int32_t ab_x2_high32 =
   >       static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
   >   return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
   > }
   > ```
   > 
   > This function alone needs around 10 Relay ops (if not more), but in TFLite this is
just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction
automatically.
   > 
   
   The vqrdmulh instruction is pretty standard on all Arm CPUs that have Advanced SIMD in
32 bit. C compilers will struggle with generating that instruction automagically. It's just
too big an idiom to recognize and I don't think llvm has that infrastructure to detect it
. Is there anything preventing us from having a specific tvm intrinsic that we can lower to
this vqrdmulh instruction ? Ofcourse if this fails we could instead lower to the C implementation
? 
   
   The equivalent aarch64 instruction is sqrdmulh IIRC and is default on all devices. There
should be an llvm intrinsic equivalent of this. 
   
   ramana
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message