impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Youwei Wang (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-2809: Improve ByteSwap with builtin function or SSE or AVX2.
Date Fri, 27 May 2016 08:54:45 GMT
Youwei Wang has posted comments on this change.

Change subject: IMPALA-2809: Improve ByteSwap with builtin function or SSE or AVX2.

Patch Set 13:

File be/src/benchmarks/

Line 50: //                           SIMD               57.73               2.18X
> I think we should pause this code review while you figure out why SIMD is s
Hi Jim. I have re-written this benchmark. 
Please review the latest patch for more information.
And if you don't mind, I want to present some performance data first:
/// ByteSwap benchmark:   Function     Rate (iters/ms)          Comparison 	
/// ---------------------------------------------------------------------- 	
///                      OldImpala           1.224e+06                  1X 	
///                      FastScala           1.286e+06              1.051X 	
///                         SSE4.2           1.296e+07              10.59X 	
///                           AVX2           3.156e+07              25.78X 	
///                           SIMD           3.043e+07              24.86X 

This performance data comes from my latest benchmark.
For simplicity, SSE4.2 refers to the performance test result of the ByteSwapSSE_Unit function
in the bit-util.inline.h. AVX2 refers to the performance test result of the ByteSwapAVX_Unit
function in the bit-util.inline.h. The SIMD refers to the "inline void BitUtil::ByteSwap(void*
dest, const void* source, int len) ;" interface in the bit-util.inline.h. 

As you can see here, the performance data is explainable now:
1. FastScala vs. OldImpala: this result is pretty easy to understand.
2. AVX2 vs. SSE4.2: this result is also easy to understand since AVX2 has double data-width
than SSE4.2. We can expect such great performance boost here.
3. SIMD vs. AVX2: the arch-selector branch indeed causes some relatively-small performance
loss. But I think it is still acceptable according to the context.

And please allow me to give some explanation here:
1. Zuo Wang's idea to byteswap a long array one element by one element. The data type of maximum
size is 128bit, that is, the Decimal16 type. So here is some illustration:

D16: the abbreviation for Decimial16.
BS: the abbreviation for Byteswap.

D16[0] D16[1] D16[2] D16[3] D16[4] D16[5] D16[6] D16[7] ....
AVX2-BS(D16[0] D16[1])  
AVX2-BS(D16[2] D16[3])
AVX2-BS(D16[4] D16[5])
AVX2-BS(D16[6] D16[7])
BS(D16[0]) BS(D16[1]) BS(D16[2]) BS(D16[3]) BS(D16[4]) BS(D16[5]) BS(D16[6]) BS(D16[7])

2. In contrast, my logic is to deal with a very long array like this:
Byte1 Byte2 Byte3 Byte4 ... ByteN-1 ByteN
AVX2-BS(Byte1 Byte2 Byte3 Byte4 ... ByteN-1 ByteN)
ByteN ByteN-1 ... Byte4 Byte3 Byte2 Byte1 

My logic is based on my understanding of this interface:
inline void BitUtil::ByteSwap(void* dest, const void* source, int len) ;
This interface is just to swap an input array without taking its types into consideration.
I am not sure whether my understanding of the itnterface BitUtil::ByteSwap is right or not.

If you approve my logic, should we continue on my latest code 
since my new benchmark can output reasonable performance data now?
Please feel free to share your idea about this.

Line 250: /// For example, if FIXED_LEN_SIZE == 2, we are going to deal with the first
> I think it would be inappropriate to check this code in if Zuo Wang is the 

Line 255: /// we can get these constants by following definitions of corresponding AVX2 intrinsics:
> This is way too sparse. Please explain in more detail. Consider using one e
This part of code is removed in the latest patch.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I392ed5a8d5683f30f161282c228c1aedd7b648c1
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Youwei Wang <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Jim Apple <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-Reviewer: Youwei Wang <>
Gerrit-HasComments: Yes

View raw message