impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Hecht (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
Date Thu, 15 Sep 2016 17:08:57 GMT
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

Patch Set 14:


Flushing out some more comments, and will move onto the later patch set.

One thing that's not clear to me yet is what the right separation between node and sink is
-- they are currently very tightly coupled which means they don't really benefit from being
separated into separate classes.  Is the goal of this change solely to be able to expose the
DataSink interface, or is it to provide some separation as well? Do we see them becoming less
coupled over time?
File be/src/exec/partitioned-hash-join-builder.h:

PS14, Line 45: build row partitions
is this the same as the "hash partitions" referred to in the next paragraph. let's be consistent
in terminology.

Line 86: 
is it expected that the renaming public functions would only be used by PHJN?  If so, would
be good to state that.

PS14, Line 90: Acquire ownership
Doesn't the ownership pass from this class to the exec node?  In which case this method wouldn't
acquire ownership.  So let's come up with better terminology and naming.

PS14, Line 95: Called after probing of the partitions
             :   /// is done. The partitions are not closed or destroyed, since they may be
spilled or
             :   /// may contain unmatched build rows for certain join modes (e.g. right outer
The fact that we need some much explanation kinda makes this seem like it might not be the
right abstraction.  Could we instead just clear the vector when creating the next set of partitions
(it doesn't look like we delete partitions until either Close() or Reset() anyway), so why
do we need to proactively clearly the hash_partitions_ vector before starting he next iteration?

Line 106:   }
nit: missing blank line

PS14, Line 109: When this call returns
Is this stuff by this method or the caller?  i.e. if this method closes input_partition, then
say that.  If it doesn't then either say "after" or don't say it at all (why is it relevant
to this method documentation?).

PS14, Line 112: so that the probe phase has enough buffers preallocated
              :   /// execute successfully.
slightly garbled. seems like a word is missing or something.

PS14, Line 117: /// Iterates over all the partitions in 'hash_partitions_' and returns the
              :   /// number of build rows.
rather than say how and referring to private member, and leaving it to the reader to assume
that "largest number of build rows" is talking about a per-partition thing, let's say something

Returns the maximum number of build rows of a hash partition 

(or whatever terminology you choose in the class comment).

Line 123:   bool HashTableStoresNulls() const;
why is this here rather than the join node? especially given that the join node owns the HashTableCtx.
 on a related note, I'm not really sure how the HashTable context is going to work out once
this is driven like a real sync.  Will we still have a parent_ link to the exec node living
in a different fragment?

Line 128:   inline const std::vector<bool>& is_not_distinct_from() const {
and then should this be here? is it to be closer to being able to break the parent_ link at
some point and make the builder able to build the hash table without any info from parent_?

PS14, Line 182: partition side of this partition
not sure what this is saying. maybe just "for this build partition"?

Line 188:     /// false.
maybe clarify that an error is not returned in that case.  or clarify what cases are errors.

PS14, Line 195: unpin_all_build
why "build"?  shouldn't it be unpin_all_blocks?  or even better would be to name it according
to what the algorithm cares about. which I think is whether or not we might have more build
rows to append, right?

PS14, Line 234: build or probe
does this care about the probe side partitioning?

Line 237:   /// 'null_aware_probe_partition_' and 'null_probe_rows_'.
Also explain what this is computing, rather than just what the computation is.

PS14, Line 244: all hash partitions for partitioning level
this is kinda misleading. maybe qualify with ... for the current input, or the "current set
of hash partitions for partition level...". i.e. this isn't all the hash partitions that are
created for a particular level.

PS14, Line 252: in

PS14, Line 256: in memory
what does this mean? is it trying to say it fits in the current (pinned) block?
File be/src/exec/partitioned-hash-join-node.h:

PS14, Line 410:  we then iterate over all the probe rows
> Clarified that it's the partition's. Most of the text was just copied verba
Thanks. From the perspective of the reviewer, the comment is already part of the patch (the
code moved in non-trivial ways), so the reviewer has to read the comment. Since the reviewers
need to read these comments, we might as well improve the comments now when we can make meaningful
improvements since it's a relatively cheap way to improve readability (doesn't lead to additional
debugging, testing, etc).

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Dan Hecht <>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message