impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Date Wed, 24 May 2017 18:08:41 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries

Patch Set 7:

Commit Message:

Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries
> There is a bit of a disconnect with the JIRA as specified and the intention
Thanks for the clarification. Agree this is the best path forward for the moment.

We should make sure to straighten out the JIRAs though so that the JIRA attached to this diff
actually reflects the work you're doing, and that we track the other optimisation as separate
File be/src/exec/

Line 77:        tuple_index = scratch_batch_->NextValidTuple()) {
> Your intuition and mine are similar.  I did my best to make the "simple" ve
+1 no point optimising it until the code around it has settled down.
File be/src/exec/

Line 876:     RETURN_IF_ERROR(scalar_reader->InitDictionary(dict_filter_tuple));
This is orthogonal, but we really should evaluate single-slot runtime filters against the
dictionary as well. The payoff there is potentially really big - they can be quite selective
and are expensive to evaluate. Do we have a JIRA for that?
File be/src/exec/

Line 317:   bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) {
> My guess is that inlining the functions from the header should get about as
There's also a design layering question about whether the ColumnReader abstraction should
know about ScratchTupleBatch. IMO there's a cleaner separation if ColumnReaders only care
about materialising a column into a memory buffer and setting bits based on predicates as
fast as possible, not about how we're staging tuples in a scratch batch. 

Even after inlining it adds an additional pointer indirection via 'scratch'. We can try to
coax the compiler into undoing what we want it to do but at some point it's more robust and
adds less cognitive load if we just write code that's closer to what we want the compiler
to generate.

On a lot of queries we spend up to 50% of CPU time in this function so the perf matters a
lot (I'm not just being anal about shaving CPU cycles). We've had a couple rounds of attempting
to optimise this function and it is *very* sensitive. Even then we left a lot of performance
on the floor for subtle reasons: see The big problem
is that we have multiple layers of non-trivial inlined functions inside that loop and it's
hard to reason about what the compiler and CPU will actually be doing.

PS7, Line 408: IS_FILTERED
> The behavior is different as a class, not just at this function level, so I
It looks like the only perf-critical place that references IS_FILTERED are in these two templatised
functions, so I'm still skeptical that it makes sense to templatise the whole class - otherwise
it could just be a member variable. I don't see a reason to treat it differently from IS_DICT_ENCODED.

Line 410:             return false;
I'm going to retract my previous comment about materializing the indices as a separate step.
I played around with a prototype of batching the slot materialisation and the best design
looks like making the dictionary decoder RLE-aware so that it can only does one dictionary
lookup per run. See
(warning - it's pretty rough).

I think that would mean pushing the bitmap check down into the dictionary decoder, which is
something very different from what I was talking about.

That requires rearranging a lot of code in a more fundamental so the short-term change to
ReadFilteredSlot() doesn't really matter either way.

Line 417:           if (IS_FILTERED) batch.InvalidateTuple(val_index);
> True.  (Was not so in original code).
Do you have a repro? If it's a correctness bug we should make sure we're on top of it.

I agree we're not strict about validating that the def levels are in the expected range and
we could accept a technically-invalid parquet file as valid input. I think it's reasonable
that we don't exhaustively validate the file so long as we don't overrun any buffers etc as
a result.

I don't see how that leads to any further consequences though. Impala assumes that all slots
coming out of HDFS scans are nullable, so it can't be non-NULLable from Impala's point of

If the Parquet file claims that a column is non-NULLable within that file (max_def_level ==
0) then I don't think we will ever materialise a NULL value, so I don't think we do the wrong
thing there even if the file is invalid.

Line 609:   bool true_on_null_;
> I tried to do some optimization based on this and allowed passing in predic
It does seem quite useful - in practice it's common to have columns with many NULLs (less
so in benchmarks). I'm in favour of keeping the diff manageable though, but we should file
a JIRA to do as a follow-on IMO so we don't lose track of it.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <>
Gerrit-Reviewer: Joe McDonnell <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-Reviewer: Zach Amsden <>
Gerrit-HasComments: Yes

View raw message