impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Date Fri, 07 Oct 2016 01:23:39 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4123: Fast bit unpacking

Patch Set 5:

File be/src/benchmarks/

PS4, Line 296: bool
> const

PS4, Line 303: UnpackValues
> Unpack32Values

PS4, Line 312: + offset
> This could go off the end of out_buffer if NUM_OUT_VALUES is not a multiple
Yeah, that's an assumption to simplify the benchmark code. Added a static_assert up the top.

PS4, Line 322: int64_t
> const
File be/src/util/

PS4, Line 30: compute_mask
> CamelCaseName

PS4, Line 35: smaller
> smaller than what?

Line 116:   const int num_in_values = 64 * 1024;
> constexpr, and all caps name

Line 119:     in[i] = rand();
> std::generate
File be/src/util/bit-packing.h:

PS2, Line 57: ast one return
> I don't understand yet why you have to choose a batch size of 8 for a bit w
8 is the smallest number such that (3 * 8) % 8 == 0. I.e. where the last bit of the batch
aligns to a byte boundary.
File be/src/util/bit-packing.h:

PS4, Line 66: bytes
> bits?

PS4, Line 66: values
> values of type OutType

PS4, Line 70: in_bytes
> Can you add to the comment a description of how this parameter is used?
Ended up rewording the comment a bit to fit this in.
File be/src/util/bit-packing.inline.h:

PS3, Line 143: undef 
got you - done.
File be/src/util/bit-packing.inline.h:

PS4, Line 125:     return lower_bits | (trailing_bits << TRAILING_BITS_SHIFT);
> This comes out to two shifts, a bitwise or, and a bitwise and. If you did u
That's true, but there's a downside - if the loads are aligned relative to the start of the
buffer, the optimiser can reuse the results of the loads. Doing the unaligned load adds an
extra load, which is probably comparable in cost to the two ALU ops that were saved.

Line 142:   BOOST_PP_REPEAT_FROM_TO(0, 33, UNPACK_VALUE_CALL, ignore);
> So, even after turning this into a templated function, a loop with static b
Even with #pragma unroll and ALWAYS_INLINE.

Inlining happens but the generated code is bad. I think either the unrolling or inlining happens
too late in GCC's optimisation pipeline for the rest of the optimisation to be fully effective.
I think part of it is that we have to demote the template argument VALUE_IDX to an argument,
so the template doesn't fully specialise the code when it's instantiated.

Here's an example of the generated assembly. Note that there are a lot of jmps and jes in
the loop version

I added a note in the UnpackValue() comment to explain how it should be stamped out.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <>
Gerrit-Reviewer: Jim Apple <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message