tvm-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-tvm] yongfeng-nv edited a comment on issue #5367: Improve IntervalSet's floormod
Date Tue, 21 Apr 2020 05:38:20 GMT

yongfeng-nv edited a comment on issue #5367:
URL: https://github.com/apache/incubator-tvm/pull/5367#issuecomment-616035227


   The current behavior of IntervalSet floormod(a, b) is rough -- it returns [0, b-1], [-b+1,
b-1], or everything.  It causes extra iterations to my schedule.  This is a simplified version:
   
       import tvm
       from tvm import te
       
       def test_bound_tile_mod():
           def compute(M_tiles, N_tiles, factor, dtype):
               # Algo
               M = M_tiles * factor
               N = N_tiles * factor
       
               A = tvm.te.placeholder((N, M), name='A', dtype=dtype)
               C = tvm.te.compute((N, M), lambda n, m: A[n, m], name='C')
               s = tvm.te.create_schedule(C.op)
       
               return s, A, C
       
           def schedule(s, factor, padding, A, C):
               C_local = s.cache_write(C, "local")
       
               n, m = C.op.axis
               bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
               nio, nii = s[C].split(ni, 2)
               n = s[C].fuse(nii, mi)
               C_shared = s.cache_write(C, "shared")
               bn, bm, ni, mi = C_shared.op.axis       
               s[C_shared].storage_align(ni, factor * 2, padding)
       
               n, m = s[C].op.axis
               bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
               s[C].set_scope("global")
               niio, niii = s[C].split(ni, 32)
               s[C_shared].compute_at(s[C], niio)
       
               return s
       
           s, A, C = compute(2, 2, 128, "float16")
           s = schedule(s, 128, 8, A, C)
           bounds = tvm.te.schedule.InferBound(s)
           print(tvm.lower(s, [A, C], simple_mode=True))
   
   It does 256x256 point-wise copying in 128x128 tiles from local to shared then to global.
 I would like to allocate only a 32x128 shared memory and reuse it four times per tile.  In
addition, I want to pad 8 data every two tile rows to avoid bank conflicts in shared memory
with storage_align.  I expect C.shared (the local ->shared stage) does exactly 256x256
copying, but the following IR shows 4 times of that.
   
       // attr [C.local] storage_scope = "local"
       allocate C.local[float16 * 65536]
       // attr [C.shared] storage_scope = "shared"
       allocate C.shared[float16 * 16896]
       for (n.c, 0, 256) {
         for (m.c, 0, 256) {
           C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
         }
       }
       for (n.outer, 0, 2) {
         for (m.outer, 0, 2) {
           for (n.inner.outer, 0, 4) {
             for (n.inner.outer.c, 0, 64) {
               for (n.inner.inner.m.inner.fused.c, 0, 256) {
                 C.shared[((n.inner.outer.c*264) + n.inner.inner.m.inner.fused.c)] = C.local[(((((n.outer*32768)
+ (n.inner.outer.c*512)) + (floordiv(n.inner.inner.m.inner.fused.c, 128)*256)) + (m.outer*128))
+ floormod(n.inner.inner.m.inner.fused.c, 128))]
               }
             }
             for (n.inner.inner, 0, 32) {
               for (m.inner, 0, 128) {
                 C[(((((n.outer*32768) + (n.inner.outer*8192)) + (n.inner.inner*256)) + (m.outer*128))
+ m.inner)] = C.shared[((((n.inner.outer*4224) + (floordiv(n.inner.inner, 2)*264)) + (floormod(n.inner.inner,
2)*128)) + m.inner)]
               }
             }
           }
         }
       }
               
   If there is another schedule to achieve this, please let me know.  As of now, I can only
accomplish this by tiling both C.shared and C.  C's compute now involves several floormod:
`C.shared(floordiv(n, 128), floordiv(m, 128), floordiv(floormod(n, 128), 2), ((floormod(floormod(n,
128), 2)*128) + floormod(m, 128)))`.  `m` and `n` further bring in leaf IterVars to floormod
to PropBoundToInputs() during InferBound.  Because the best result from floormod(x, 128) is
[0, 128-1], C.shared doesn't reduce domain, although it does compute_at s[C]'s n.inner.outer
with range [0,3].
   
   With this PR, I am able to get expected IR -- `(n.inner.outer.c, 0, 64)` becoming `(n.inner.outer.c,
0, 16)`:
   
       // attr [C.local] storage_scope = "local"
       allocate C.local[float16 * 65536]
       // attr [C.shared] storage_scope = "shared"
       allocate C.shared[float16 * 4224]
       for (n.c, 0, 256) {
         for (m.c, 0, 256) {
           C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
         }
       }
       for (n.outer, 0, 2) {
         for (m.outer, 0, 2) {
           for (n.inner.outer, 0, 4) {
             for (n.inner.outer.c, 0, 16) {
               for (n.inner.inner.m.inner.fused.c, 0, 256) {
                 C.shared[((n.inner.outer.c*264) + n.inner.inner.m.inner.fused.c)] = C.local[((((((n.outer*32768)
+ (n.inner.outer*8192)) + (n.inner.outer.c*512)) + (floordiv(n.inner.inner.m.inner.fused.c,
128)*256)) + (m.outer*128)) + floormod(n.inner.inner.m.inner.fused.c, 128))]
               }
             }
             for (n.inner.inner, 0, 32) {
               for (m.inner, 0, 128) {
                 C[(((((n.outer*32768) + (n.inner.outer*8192)) + (n.inner.inner*256)) + (m.outer*128))
+ m.inner)] = C.shared[(((floordiv(n.inner.inner, 2)*264) + (floormod(n.inner.inner, 2)*128))
+ m.inner)]
               }
             }
           }
         }
       }
   
   
   The improvement has two folds:
   1.   EvalSet takes a range map as input to help calculating floormod.  For example:  `[x*4,
x*4+3] % 8 = [x*4-x/2*8, x*4+3-(x*4+3)/8*8)]`, if we don't know `x`'s range; but `[x*4, x*4+3]`,
if we know `x`'s range is `[0, 1]`.  This is common from tiling.
         
   2.   For `a mod b`, if b is an IntImm, single point, and greater than 0, we do the following:
   
       // a mod b = a - b * (a/b) if
       // (a) a_max - a_min < b, i.e. that before mod, a's range doesn't cover [0, b)
       // and (b) a_min mod b <= a_max mod b, i.e. that a's range is still continuous after
mod
   so that `[13, 15] % 10 = [3, 5]`
   
   IntervalSet evaluation is hard.  These improvements are adhoc to the problems I encountered.
 I am open to any alternatives.
   
   Question: how about mod?  It's implementation looks same as floormod.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message