impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Date Tue, 15 Nov 2016 02:40:03 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input

Patch Set 13:

File be/src/exec/

Line 66:             tsink.table_sink.hdfs_table_sink.skip_header_line_count :
if you break this up, then
? ...\n
: 0

is better

Line 212:         GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key);
we're standardizing on nullptr now

Line 266:   // The rows of this batch may span across multiple files. We repeatedly pass the
remove "across"

Line 273:     RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
rename this function, it's not appending the entire row batch (AppendRows?)

Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, RowBatch* batch)
to speed this up, you should:
- keep a current_clustered_partition_
- also current_clustered_partition_key_
- check the partition key of the last row in batch whether it matches current_partition_key_
- if so, skip the for loop and write the entire batch out

also: change WriteRowsToPartition to take the index vector separately; pass a nullptr in the
case where the entire batch gets written to the same partition

apply that same optimization to writer::AppendRowBatch (or however you rename that). no need
to spend time checking some vector in that case.
File be/src/exec/hdfs-table-sink.h:

Line 170:       const HdfsPartitionDescriptor& partition_descriptor, const TupleRow* current_row,
mention that current_row provides the partition key values.

Line 181:   /// GetHashTblKey(). Maps to an OutputPartition, which are owned by the object
pool and

Line 189:   void GetHashTblKey(const TupleRow* current_row, const std::vector<ExprContext*>&
current_row -> row (it doesn't matter to this function that the row is 'current')

same for InitOutputPartition

Line 213:   /// Appends all rows in batch to the temporary Hdfs files corresponding to partitions.
what "corresponding to partitions" refers to is unclear.

Line 214:   /// The input must be ordered by the partition key expressions.
the rows are expected to be ordered...?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-Reviewer: Zoltan Ivanfi <>
Gerrit-HasComments: Yes

View raw message