tvm-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [tvm-rfcs] masahi commented on a change in pull request #18: [RFC] Adding initial SVE implementation
Date Fri, 27 Aug 2021 06:52:04 GMT

masahi commented on a change in pull request #18:
URL: https://github.com/apache/tvm-rfcs/pull/18#discussion_r697195878



##########
File path: rfcs/0018-initial-sve-addition.md
##########
@@ -0,0 +1,221 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/18
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. Scalable vectorisation is extracting data parallelism from 
+code, but as opposed to a fixed width vectorisation, the vector length is 
+unknown at compile time. A scalable vector's total number of elements is a 
+constant multiple of a specified number of elements. The 
+[LLVM LangRef](https://llvm.org/docs/LangRef.html) refers to this constant 
+multiple as vscale. It is a positive integer that is unknown at compile time, 
+therefore the overall vector length (VL) is also unknown. The value of vscale, 
+and therefore VL, will depend on the architecture that is running the program. 
+More details and an overview of this is given in 
+[this tutorial](https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf),

+where an example of a daxpy kernel is given from slide 17 onwards. In this RFC, 
+we will show an example of lowering from TE for a (scalable) vector addition 
+kernel all the way down to LLVM IR, further illustrating the vscale concept. 
+We will also cover TIR support and how it affects the LLVM codegen. This is an 
+introductory RFC to see if the design of our prototype implementation, see 
+https://github.com/apache/tvm/pull/8655, is sound and we welcome any feedback 
+on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX512 or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since we have 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX512 and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension (SVE) 
+in TVM by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express
+scalable vectorisation in both TE and TIR. The question is how to achieve that? In
+Tensor Expression language, our example to add two tensors A and B would look
+like this:
+
+```
+n = 17
+A = te.placeholder((n,), name="A", dtype = "int8")
+B = te.placeholder((n,), name="B", dtype = "int8")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+s = te.create_schedule(C.op)
+x, = C.op.axis
+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will
+result in the TIR example shown in the Introduction. However, this requires
+knowing the vector length at compile time; it is an example of fixed width
+vectorisation. Instead, we would like for it to work with an unknown vector
+length at compile time.
+
+## Solution Approach
+
+In order to address the problem of expressing scalable vectorisation, we would
+like to propose the addition of a new _vectorize_scalable_ function to the Tensor
+Expression language, for example:
+``` 
+s[C].vectorize_scalable(x)
+```
+The TIR output of this would be:
+
+```
+primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
+  attr = {"global_symbol": "main", "tir.noalias": True}
+  buffers = {C: Buffer(C_2: Pointer(int8), int8, [17], []),
+             A: Buffer(A_2: Pointer(int8), int8, [17], []),
+             B: Buffer(B_2: Pointer(int8), int8, [17], [])}
+  buffer_map = {A_1: A, B_1: B, C_1: C} {
+  for (i: int32, 0, 17;i+=VL) {
+    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + (int8xVL*)B_2[ramp(i, 1, VL)])
+  }
+}
+```
+
+In the above TIR, we can see the the for loop is looping with an agnostic
+stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
+representation and we don't store _VL_ anywhere inside the TIR data structures.
+
+We can also see the syntax of the Ramp nodes have now been modified to handle
+an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
+integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it
+are still the same, the only difference is that the number of lanes is unknown

Review comment:
       To avoid confusion, it is probably better to talk about different sense in which `Ramp`
node is used, between fixed-width vs scalable vectorization. The former one treats the entire
input as one chunk while the latter one is specifically for vector-length wide chunk.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message