impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
Date Thu, 25 Aug 2016 21:38:16 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

Patch Set 9:

File be/src/exec/

Line 206:     Status status = Status::MemLimitExceeded();
> let's make the error reporting for PrepareForWrite() and PrepareForRead() m
Good point, this discrepancy crept  in in a rebase.
File be/src/exec/

Line 165:   if (expr_mem_tracker_ != NULL) {
> the function comment makes it sound like we should also have:
Done. Makes it more obvious.
File be/src/exec/

Line 35
> revert?
I just fixed the formatting while I was looking at it, I think the new formatting is more
File be/src/exec/

Line 310:   for (unique_ptr<Partition>& partition: all_partitions_)  partition->Close(NULL);
> nit: extra space before 'partition'

Line 464:   // SwitchToIoBuffers() call below will succeed.
> Does this comment make sense? Can we use small buffers and have a stream in
Yes (aside from SwitchToIoBuffers() being a weird API). The build_stream() started off with
small buffers so might not have switched to I/O buffers at this point.

I think it's unlikely that we'll spill a partition that hadn't switched to I/O buffers but
I think it's possible.

This will go away with the new buffer pool.
File be/src/exec/partitioned-hash-join-builder.h:

Line 46: class PhjBuilder : public DataSink {
> PartitionedHashJoinBuilder for consistency?
It's for consistency with NljBuilder (Marcel suggested that name on the previous review).

Line 69:   /// This function walks the current hash partitions and picks one to spill.
> ... and spills the largest one
Done, although made it clear that it's a policy choice.

Line 174:     /// transferring ownership to them.
> this line doesn't seem to add anything

Line 189:     /// all blocks aside from the current read or write block are left pinned.
This comment was wrong - fixed it.
File be/src/exec/

Line 865: Status PartitionedHashJoinNode::BuildHashTablesAndPrepareProbePartitions() {
> It would be good to document what exactly is handed off from the PHJ builde
I agree that should be the end-goal.

I think for now it's tricky to do better in the spilling case than this approach of allocating
probe buffers, then hash tables, then remaining probe buffers until we have reservations.

When we need to implement the in-memory broadcast join, we could add a temporary in-memory
codepath that did the hash table build purely in the builder, but just bailed out if any partitions

One stopgap option (which is feasible but ugly) that lets us implement the current algorithm
in the builder is to allocate the initial probe Blocks in the builder, and then hand them
over to the join. PrepareForWrite() could then optionally take a Block as an argument. What
do you think of doing that?

Line 891:   for (int i = 0; i < PARTITION_FANOUT; ++i) {
I moved the hash table build calls into the builder, as an incremental step towards separating
this more cleanly.

Line 959:       RETURN_IF_ERROR(builder_->SpillPartition(&spilled_partition_idx));
> It feels like there must be a simpler and more efficient way to handle this
Yeah, if we know how much memory we have to work with we can precompute the sizes of the hash
tables and do some kind of bin-packing. I don't think this is feasible until we have reservations.
File be/src/exec/partitioned-hash-join-node.h:

Line 79:   /// Logically, the algorithm has the following modes:.
> would be good to document somewhere (not necessarily here) what phases are 
I added a comment to the builder class to better explain the responsibilities of the builder.

Line 150:   bool AppendProbeRow(BufferedTupleStream* stream, TupleRow* row, Status* status);
I changed this to just return a Status, since there's no situation where it can return false
but have ok status.

Line 445:     /// Close the partition and attach resources to 'batch' if non-NULL or free
> can you make the comment on BuildPartition::Close() similar to this one? th
File be/src/runtime/buffered-tuple-stream.h:

Line 231:   /// Prepares the stream for reading. Called after Init() and before the first
> maybe expand to:
File be/src/util/

Line 400:   AddChild(child, indent);
> shouldn't the addition and the swap be atomic under children_lock_?
Good point. I added an internal AddChildLocked() method and made sure to grab children_lock_
for the duration of this method.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message