tvm-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [incubator-tvm] giuseros edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures
Date Thu, 11 Jun 2020 10:04:03 GMT

giuseros edited a comment on pull request #5754:

   Hi @FrozenGene ,
   Thanks a lot for your comments.  I will address general replies here, and code comments
in a separate reply.
   * I indeed read your discuss [post](,
but I thought the work was orthogonal to this one. My main goal here is to have a fast general
convolution algorithm for Armv8-A. Your post talks about mobilenet v2, and raspi 3. 
   * In mobilenet v2 there are no deep convolutional layers, mostly depthwise convolutions
and 1x1 convolutions. With shallow convolutions the problem becomes memory bound, and the
differences among the algorithms  become less evident. That is also why I picked inception_v3,
where there are 1x1, 3x3, 5x5, 1x7, 7x1 convolutions. 
   * Raspi 3 comes with a 32bit operative system, which means using Armv7-A. The problem with
Armv7-A is that instead of having 32 registers (as in Armv8-A) you have only 16, so the optimization
space is reduced. Also, I think (but I am not 100% sure) that the guys in TFlite do not extremely
optimize for Armv7-A. Indeed, on Armv7-A @anijain2305 shows (in the same post you mention)
a 0.80 ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios,
respectively ). 
   * The Qnnpack post you mention explicitly says that: "the microkernel that leverages the
dual issue capability proves to be 15 percent to 20 percent faster for a sufficiently large
channel count (K > 64)"
   * The way they do convolution (and gemm) in Qnnpack for Armv8-A is by using a combination
of `smlal` and `smlal2` (plus a combination of `usubl` and `usubl2`) while `conv2d_nhwc_spatial_pack`
only uses `smal`. It is true that in Armv7-A they only use `vsmal` (and `vusubl`). So, I wonder
if the autoscheduler (which I am not familiar with) is able to generate such combinations
for armv8. 
   * I did not try other CPUs other than the Cortex-A76. The point is that I am not using
anything specific for that CPU, but only specific to the Armv8-A ISA.  
   * I agree that in case of smaller convolutions (or depthwise convolutions) there are simpler
algorithms that work as well (or even faster). I also agree in stacking multiple strategies
and let TVM select the best. 
   I will reply on the code in the following comment. 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

View raw message