impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Date Sun, 23 Oct 2016 21:16:50 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
......................................................................


Patch Set 1:

(4 comments)

LGTM, just want to make sure the comment is a little clearer.

http://gerrit.cloudera.org:8080/#/c/4813/1/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

Line 167
> What's stringe is that without the pragma and without __restrict__, Matt Go
I experimented with some variants: https://godbolt.org/g/rFww4g .BloomFilterOrPointersUnrolledIvDep()
and BloomFilterOrIvDepInt() emit exactly the code you want.
There are a few things that make a difference:

* Avoiding the indirection via the in/out pointers helps a lot. If we extract vector.data()/vector.size()
from the loop body we can get the same effect.
* Rather subtly, using a signed/unsigned loop counter makes a difference (I think with the
signed type the compiler is allow to assume no overflow and this somehow helps).
* The unrolling and #pragma ivdep aren't always necessary to emit the code, but it's a lot
cleaner with them.  I think gcc's vectoriser is smart enough to insert a check for whether
the arrays overlap and two versions of the code for the overlapping/non-overlapping cases.

The explicit SIMD version has the virtue of not being sensitive to all these things.


PS1, Line 168: _mm256_loadu_pd
> BloomFilters do, but TBloomFilters do not, and from using gdb I see that th
Oh, duh.


PS1, Line 199: _mm_storeu_si128
> Let's continue this conversation above.
Yeah, I doubt there's much difference between the aligned/unaligned versions, just didn't
want to use the unaligned ones if there was already an invariant that the memory was aligned.


http://gerrit.cloudera.org:8080/#/c/4813/2/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

Line 181:   // The trivial loop out[i] |= in[i] should auto-vectorize with gcc at -O3, but
it is not
Let's fix this comment so it's clear that the issue is that the code is not written in a way
that is friendly to auto-vectorization.

I've seen other cryptic comments about these kind of things ("compiler is confused by x")
 and they're hard to know what to do with (if not outright misleading).


-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple <jbapple@cloudera.com>
Gerrit-Reviewer: Jim Apple <jbapple@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message