impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Date Mon, 10 Oct 2016 21:28:35 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4123: Fast bit unpacking
......................................................................


Patch Set 7:

(28 comments)

http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/benchmarks/bit-packing-benchmark.cc
File be/src/benchmarks/bit-packing-benchmark.cc:

PS7, Line 330: int argc, char **argv
> unused
It seems weird to me to omit the arguments to main()


PS7, Line 336: int64_t
> const, or just make this the param to the data constructor and use data.siz
Done


PS7, Line 343:     suite.AddBenchmark(Substitute("UnpackScalar", bit_width), UnpackBenchmark,
&params);
> This line at bit_width 32 is not in the output you display in the comment a
Oops, fat finger error


http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing-test.cc
File be/src/util/bit-packing-test.cc:

PS7, Line 35:  a subset of
            : /// 'num_in_values'
> I still find this wording confusing - Which is the to buffer, and which the
I reworded to make it clearer that it's a subarray.


PS7, Line 42: misaligned
> When misalignment is 0, what alignment is the memory guaranteed to have? 0 
In practice it should be 64-bit aligned with TCMalloc beyond a certain allocation size. I
switched to using posix_memalign so that it's explicitly guaranteed.


PS7, Line 49: uint32_t
> const
Done


Line 93:     uint32_t* in, uint8_t* packed, int num_in_values, int bit_width, int misalignment)
{
> const T*
Done


PS7, Line 94: int
> const
Done


PS7, Line 99: vector
> std::aligned_storage?
Ended up adding a helper class.


PS7, Line 102:  sizeof(uint32_t)
> Why is this needed?
Fixed. It really should have been uint8_t.

It turned out this was masking a bug where we overran the buffer with 64-bit loads for odd
bitwidths. I ended up just switching to doing 32-bit loads always. There wasn't a significant
performance hit (if anything it looked a bit more consistent).


PS7, Line 132: min(length, 1)
> so the only misalignments allowed are 0 and 1? Why so restrictive?
Other ones should work but I don't think improve coverage - can't see a scenario where it
would work for 1 byte-aligned and 8 byte-aligned but not 2-byte or 4-byte aligned.


http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.h
File be/src/util/bit-packing.h:

Line 22: #ifndef IMPALA_UTIL_BIT_PACKING_H
> These usually go above the #includes
Done


PS7, Line 30: values
> 'value'
Done


PS7, Line 48: BitPacking
> Why not put this into use in this patch? In other words, why only the micro
The BitReader need some rework to be able to make use of the batched interface - I thought
it would be better to break it up into smaller patches. It also makes it easier to compare
again the older implementation before reworking that.


PS7, Line 67:  bits of data
> You can elide this phrase
Done


http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h
File be/src/util/bit-packing.inline.h:

PS4, Line 125:     const uint32_t trailing_bits = in_words[IN_WORD_IDX + 1] % 
> I'm not so sure. Let's consider reading 10 bit ints using 32-bit loads. To 
That doesn't seem right - the nonoverlapping aligned loads will always minimise the number
of loads required. If you have a series of overlapping unaligned loads if you shift each load
along by enough bytes that it doesn't overlap with the previous you get a series of nonoverlapping
aligned loads that read all the bits you need. 

For batches of 32, the input is always a multiple of 32 bits, so the aligned loads always
exactly cover the input bytes. So you can't shift one of the loads without creating a gap
in the input array that you need to add another load to read.

The example has a bug somewhere - those loads only get you bits 0 to 112, which only contains
the first 11 values.  Reading 11 10-bit values would require 4 aligned loads, so that might
be a case where you could slightly improve things by using unaligned loads. It does depend
on the # of input values though so it's unclear how you'd exploit the # of input values without
adding runtime overhead.

Anyway, from what I've seen it would be more impactful to put later optimisation effort into
adding AVX2, which could be 4x faster.


http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h
File be/src/util/bit-packing.inline.h:

PS7, Line 45: make_
> make_ can be omitted.
Done


PS7, Line 57: const
> constexpr BATCH_SIZE
Done


PS7, Line 69:     in_bytes -= batch_size * (BIT_WIDTH / CHAR_BIT);
> so in_bytes does not move if BIT_WIDTH is 7?
Yep, that was a bug.


PS7, Line 90: onstants
> constants
Done


PS7, Line 91: should
> Should this be "must", since  VALUE_IDX is now a template param? Not that t
Yes, that's true - updated.


PS7, Line 115: else
> if branch returns, so you can drop the "else" here.
I found it clearer to have the if/else if/else just to make it obvious that the three branches
are mutually exclusive. Normally I'd use the early return but here with only one level of
indentation I didn't see a big benefit in readability.


Line 120:     DCHECK_LT(VALUE_IDX, 31) << "Trailing bits after last word.";
> Can you explain more why VALUE_IDX < 31? Can you add a static_assert at the
VALUE_IDX < 32 generally (added that static_assert) but this branch shouldn't be taken
for the last value. Changed the message so that it's a little more explicit that the branch
shouldn't be taken.


PS7, Line 126: warning
> What template values cause that warning to trigger?
BIT_WIDTH == 32, VALUE_IDX = *
  NUM_TRAILING_BITS == 0 (since we're not really unpacking, just reading 32-bit values)
  (BIT_WIDTH - NUM_TRAILING_BITS) == 32

The branch isn't actually taken in that case, but the warning apparently isn't able to reason
about control flow well enough to know that.


PS7, Line 178: const
> constexpr MAX_BATCH_SIZE
Done


PS7, Line 179: int
> const
Done


PS7, Line 183:   // Copy into padded temporary buffer to avoid reading past the end of 'in'.
> Can this be avoided sometimes? For instance, if BIT_WIDTH is 6 and num_valu
It's avoidable if roundup(bytes_to_read, sizeof(uint32_t)) <= in_bytes. Added this check
to decide whether to do the copy or not.


PS7, Line 184: BIT_WIDTH
> if BIT_WIDTH is 0, I don't think this is technically allowed.
Made sure it's 1 byte in that case.


-- 
To view, visit http://gerrit.cloudera.org:8080/4494
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Jim Apple <jbapple@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message