tvm-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-tvm] boh-inspur commented on a change in pull request #5428: [CODEGEN][CUDA] Fix a bug when vectorized load&store was involved for…
Date Tue, 28 Apr 2020 09:14:17 GMT

boh-inspur commented on a change in pull request #5428:
URL: https://github.com/apache/incubator-tvm/pull/5428#discussion_r416455958



##########
File path: src/target/source/codegen_cuda.cc
##########
@@ -274,9 +274,21 @@ void CodeGenCUDA::PrintVecElemLoad(
   static const char access[] = {'x', 'y', 'z', 'w'};
   CHECK(i >= 0 && i < (t.is_float16() ? 8 : 4));
   if ((t.is_int()) && t.bits() == 8) {
-    os << "((char)(" << vec << " >> " << i * 8 << "))";
+    if (t.lanes() == 1) {
+      os << vec;
+    } else if (t.lanes() == 2) {
+      os << vec << "." << access[i % 2];
+    } else {
+      os << "((char)(" << vec << " >> " << i * 8 << "))";
+    }
   } else if ((t.is_uint()) && t.bits() == 8) {
-    os << "((unsigned char)(" << vec << " >> " << i * 8 <<
"))";
+    if (t.lanes() == 1) {

Review comment:
       @wpan11nv , I've realized the type with 'int',  and there is something wrong.
   When the type is int8x2,the cuda code seems correct and also build correctly.
   `
   #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 610)
   
   #include <sm_61_intrinsics.h>
   #endif
   extern "C" __global__ void default_function_kernel0(void* __restrict__ B, void* __restrict__
A) {
     int _1;
     {
       int _2 = (( int*)(( signed char*)A + (((((int)blockIdx.x) * 16) + (((int)threadIdx.x)
* 2)))))[0];
       int _3 = (int)16843009;
       _1=((((char)(_2 >> 0))+((char)(_3 >> 0))) << 0);
       _1=_1 & ~(0x000000ff << 8) |((((char)(_2 >> 8))+((char)(_3 >>
8))) << 8);
     }
     (( int*)(( signed char*)B + (((((int)blockIdx.x) * 16) + (((int)threadIdx.x) * 2)))))[0]
= _1;
   }
   `
   But there is a runtime error when copy from gpu to cpu memory : Check failed: e == cudaSuccess
|| e == cudaErrorCudartUnloading: misaligned address.  
   Do you have any advice?
   
   And if we use int32_t, when the type is int8x2, and if the size of the tensor is very huge,
it will wastes lots of memory which is not nessary, and the runtime resource is more important.
So I think based on the current code, just need a small modification that can support int8x3
by using char3 if needed support char3. What about your opinion?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message