From dev-return-387-archive-asf-public=cust-asf.ponee.io@tvm.apache.org  Tue May 28 17:57:17 2019
Return-Path: <dev-return-387-archive-asf-public=cust-asf.ponee.io@tvm.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 545B618060F
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 28 May 2019 19:57:17 +0200 (CEST)
Received: (qmail 86942 invoked by uid 500); 28 May 2019 17:57:16 -0000
Mailing-List: contact dev-help@tvm.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@tvm.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@tvm.apache.org>
List-Post: <mailto:dev@tvm.apache.org>
List-Id: <dev.tvm.apache.org>
Reply-To: dev@tvm.apache.org
Delivered-To: mailing list dev@tvm.apache.org
Received: (qmail 86876 invoked by uid 99); 28 May 2019 17:57:15 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2019 17:57:15 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B40CEC2863
	for <dev@tvm.apache.org>; Tue, 28 May 2019 17:57:14 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.585
X-Spam-Level:
X-Spam-Status: No, score=0.585 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2,
	MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H2=-0.307, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
	T_DKIMWL_WL_HIGH=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (1024-bit key) header.d=github.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id 3MJnUMGktueV for <dev@tvm.apache.org>;
	Tue, 28 May 2019 17:57:12 +0000 (UTC)
Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id ACF9660E21
	for <dev@tvm.apache.org>; Tue, 28 May 2019 17:57:11 +0000 (UTC)
Received: by mail-ed1-f48.google.com with SMTP id g24so5103928eds.9
        for <dev@tvm.apache.org>; Tue, 28 May 2019 10:57:11 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:delivered-to:date:dkim-signature:from:reply-to
         :to:cc:message-id:subject:mime-version:content-transfer-encoding
         :precedence:list-id:list-archive:list-post:list-unsubscribe;
        bh=raX7pkiOgEeoOZs9phMEZE53MJACS4rxTOt8D1dlxko=;
        b=COdXb1LZiwt4THKOQn82KFK1kGcviEZmS+h9U3AlhaegIycsjYkaablaRZUuHuqEnw
         Mgz9rVfUQrMWpJpVMVmW8MAWo/XVdlxCqADyLwiYlGgQj8aCrO2rXyszW3tRTYUf0TOL
         UjqyNh7+0xGwvrBcRfW63sI3Eavj640SFH/EPCB6ygA6TkSEkWPmqWl4AjxJHF/Kvrqt
         v5l+iCthqu//56llcKjj1Q0cHppr/4fICwmIJfDgSzRxu5qRO8Dcy36mCKEWUctrD9Yw
         S5QseN213m+GqCOHi6zx3kDNGV4A4Ud6HX7/+aVZXevuMGZoeT9SQNk2S4VUOWilEG6m
         hYdw==
X-Gm-Message-State: APjAAAWafeBKafjjtbrt1Xs4Ekz+sBi3kMWXEoHk5QQKtT8tNAKMGHGJ
	HBFjCNXWJyv5gd24VbAethV2hc2DbJ6t0C7/6ksrvlvF7hanhi0=
X-Received: by 2002:a17:906:46ca:: with SMTP id k10mr14393846ejs.134.1559066230554;
        Tue, 28 May 2019 10:57:10 -0700 (PDT)
X-Forwarded-To: dev@tvm.apache.org
X-Forwarded-For: tvm.archiver@gmail.com dev@tvm.apache.org
Delivered-To: tvm.archiver@gmail.com
Received: by 2002:a50:b689:0:0:0:0:0 with SMTP id d9csp8425289ede;
        Tue, 28 May 2019 10:57:04 -0700 (PDT)
X-Google-Smtp-Source: APXvYqx6WtgWujfKkQmCPlw/iOYB07Q6GFSe9kYKD2doAzMlwDceJKSbQGN47H9lnQzeJ0r/Nfo2
X-Received: by 2002:a67:fd85:: with SMTP id k5mr67583144vsq.29.1559066224359;
        Tue, 28 May 2019 10:57:04 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1559066224; cv=none;
        d=google.com; s=arc-20160816;
        b=k3GKlXL8N6HmpR6tmfEhBu7CaPbsh2AqxBd21TSrQ+aCec5XmsWXLT77/vNASGipbv
         I4EIxFTekdrpanQUUdAQAK+aGda2yE1404+nq/mD1mftR43QJyQQVWYS3gxg0M0Ti69/
         W4pS0q6YE4SG6JJj1X/p2qazVGb5NT3OpCo82/pjaIQusoLx7mgZhUeCFojGHVxB7I3U
         v2LIkKwS6ZSyYFm52SwMPEuaycGHBhcQsZ4ecQRdUY4ZnPA+o2FCBavabblsb5VfdABn
         ja3oyEUVVAGmU4viNvsde0AVBG5bbVzO3pAvbG5MaUq3+v2Ua8wHCbiwf7SoHmRPgGTf
         CdiQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=list-unsubscribe:list-post:list-archive:list-id:precedence
         :content-transfer-encoding:mime-version:subject:message-id:cc:to
         :reply-to:from:dkim-signature:date;
        bh=raX7pkiOgEeoOZs9phMEZE53MJACS4rxTOt8D1dlxko=;
        b=MixcQmJF5Lcckncxnt+WGCO+HaA6xlRtQ5qAfi1In6O5Y8APP5e1ILRcKiVmq4JxMm
         IaA0RkRPSzuFXSmjWvjKJKw+9yVkJkcGJMqXq3bQVz57g2tx2A6iO+vu9JKIdr+doYb3
         vtUSrUXfJO5nN70ll3oEjyNawgChGMsc+CYNHvVU24FByZn+N1EPBJvxDMYaZ9Hphahq
         lh94kAR2FcKPhTsoH/nUtYrB2zsgQe/7ma2vct+HzmWf37cvWNVUcQhMXn3CYn/CY+Zh
         vWJBbWW8ls20RpPiWirDZVbmUybdm5vsAf9lFwAnbZha4CM1uUaebogfE5w0ugIW45xY
         Jaog==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass (test mode) header.i=@github.com header.s=pf2014 header.b=xqsllHqO;
       spf=pass (google.com: domain of noreply@github.com designates 192.30.252.196 as permitted sender) smtp.mailfrom=noreply@github.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=github.com
Received: from out-5.smtp.github.com (out-5.smtp.github.com. [192.30.252.196])
        by mx.google.com with ESMTPS id h21si5752588uan.175.2019.05.28.10.57.04
        for <tvm.archiver@gmail.com>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 28 May 2019 10:57:04 -0700 (PDT)
Received-SPF: pass (google.com: domain of noreply@github.com designates 192.30.252.196 as permitted sender) client-ip=192.30.252.196;
Authentication-Results: mx.google.com;
       dkim=pass (test mode) header.i=@github.com header.s=pf2014 header.b=xqsllHqO;
       spf=pass (google.com: domain of noreply@github.com designates 192.30.252.196 as permitted sender) smtp.mailfrom=noreply@github.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=github.com
Date: Tue, 28 May 2019 10:57:03 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com;
	s=pf2014; t=1559066224;
	bh=raX7pkiOgEeoOZs9phMEZE53MJACS4rxTOt8D1dlxko=;
	h=Date:From:Reply-To:To:Cc:Subject:List-ID:List-Archive:List-Post:
	 List-Unsubscribe:From;
	b=xqsllHqOjue/aHx/4ViBdMRD3xiStncf8j9I80iBlZBGlcq4y+rz37i5afBdIs2YD
	 Ho96OqZB0cxCulwnYKDYynGf69F3Ov33ntXIYNt0JdfRFLDdG9yFMTjzBW2JVtlEYl
	 y/pvIxw6YeUjz6fXb3P7sXc0OXq6dcdN/kmo8wDA=
From: Animesh Jain <notifications@github.com>
Reply-To: dmlc/tvm <reply+ALTSWH3YMZQFZJBJU54EFJV27KUO7EVBNHHBVSKR6Y@reply.github.com>
To: dmlc/tvm <tvm@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <dmlc/tvm/issues/3252@github.com>
Subject: [dmlc/tvm] [RFC] Reading quantized models from TFLite and MxNet -
 operators API (#3252)
Mime-Version: 1.0
Content-Type: multipart/alternative;
 boundary="--==_mimepart_5ced766ff0769_42e43fd12a4cd96c2527be";
 charset=UTF-8
Content-Transfer-Encoding: 7bit
X-GitHub-Sender: anijain2305
X-GitHub-Recipient: tvm-archiver
X-GitHub-Reason: subscribed
List-Archive: https://github.com/dmlc/tvm
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: tvm.archiver@gmail.com

----==_mimepart_5ced766ff0769_42e43fd12a4cd96c2527be
Content-Type: text/plain;
 charset=UTF-8
Content-Transfer-Encoding: 7bit

To increase quantization support in TVM, it is necessary to support the pre-quantized models, i.e., the models that have been quantized in the framework itself (outside of Relay). In this issue, we are laying down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.

* RFC [Issue](https://github.com/dmlc/tvm/issues/2351)
* [Discussion](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651)

Other non-TVM related links that were used to understand quantization
* GemmLowP - [Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md)
* TFlite reference [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101-L182)

---------

**Covered frameworks for now** - TFLite and MxNet
**Target network for now** - Inception V3 from TFLite. (I will create one for Mxnet)
**Target platforms for now** - ARM and Intel (will create separate Issue as the project progresses)


---------


**List of required operators** - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize

------------


It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d)


## Op quantize
```python
def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """
```

Key points to discuss
* The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. [Reference implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28-L99) in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.


## Op quantized_conv2d

```python
def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """
    
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.
    
    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.


    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """
```

Key points to discuss further
* This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in [discuss forum](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651).
    * First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.
    * Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. **I am not sure about this part, so please comment.**
* The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).


## Op dequantize
Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.

```python
def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """
```


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3252
----==_mimepart_5ced766ff0769_42e43fd12a4cd96c2527be--