From commits-return-15039-archive-asf-public=cust-asf.ponee.io@tvm.apache.org Thu Jun 11 10:04:05 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id AA46F180679 for ; Thu, 11 Jun 2020 12:04:04 +0200 (CEST) Received: (qmail 46198 invoked by uid 500); 11 Jun 2020 10:04:03 -0000 Mailing-List: contact commits-help@tvm.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tvm.apache.org Delivered-To: mailing list commits@tvm.apache.org Received: (qmail 46076 invoked by uid 99); 11 Jun 2020 10:04:03 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jun 2020 10:04:03 +0000 From: =?utf-8?q?GitBox?= To: commits@tvm.apache.org Subject: =?utf-8?q?=5BGitHub=5D_=5Bincubator-tvm=5D_giuseros_edited_a_comment_on_pull?= =?utf-8?q?_request_=235754=3A_=5BRFC=5D_Improve_quantized_convolution_perfo?= =?utf-8?q?rmance_for_armv8_architectures?= Message-ID: <159186984332.8471.14517013714659875568.asfpy@gitbox.apache.org> Date: Thu, 11 Jun 2020 10:04:03 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit In-Reply-To: References: giuseros edited a comment on pull request #5754: URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722 Hi @FrozenGene , Thanks a lot for your comments. I will address general replies here, and code comments in a separate reply. * I indeed read your discuss [post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4), but I thought the work was orthogonal to this one. My main goal here is to have a fast general convolution algorithm for Armv8-A. Your post talks about mobilenet v2, and raspi 3. * In mobilenet v2 there are no deep convolutional layers, mostly depthwise convolutions and 1x1 convolutions. With shallow convolutions the problem becomes memory bound, and the differences among the algorithms become less evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 1x7, 7x1 convolutions. * Raspi 3 comes with a 32bit operative system, which means using Armv7-A. The problem with Armv7-A is that instead of having 32 registers (as in Armv8-A) you have only 16, so the optimization space is reduced. Also, I think (but I am not 100% sure) that the guys in TFlite do not extremely optimize for Armv7-A. Indeed, on Armv7-A @anijain2305 shows (in the same post you mention) a 0.80 ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios, respectively ). * The Qnnpack post you mention explicitly says that: "the microkernel that leverages the dual issue capability proves to be 15 percent to 20 percent faster for a sufficiently large channel count (K > 64)" * The way they do convolution (and gemm) in Qnnpack for Armv8-A is by using a combination of `smlal` and `smlal2` (plus a combination of `usubl` and `usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in Armv7-A they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler (which I am not familiar with) is able to generate such combinations for armv8. * I did not try other CPUs other than the Cortex-A76. The point is that I am not using anything specific for that CPU, but only specific to the Armv8-A ISA. * I agree that in case of smaller convolutions (or depthwise convolutions) there are simpler algorithms that work as well (or even faster). I also agree in stacking multiple strategies and let TVM select the best. I will reply on the code in the following comment. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org