impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Date Tue, 15 Nov 2016 02:40:03 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
......................................................................


Patch Set 13:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 66:             tsink.table_sink.hdfs_table_sink.skip_header_line_count :
if you break this up, then
...\n
? ...\n
: 0

is better


Line 212:         GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key);
we're standardizing on nullptr now


Line 266:   // The rows of this batch may span across multiple files. We repeatedly pass the
row
remove "across"


Line 273:     RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
rename this function, it's not appending the entire row batch (AppendRows?)


Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, RowBatch* batch)
{
to speed this up, you should:
- keep a current_clustered_partition_
- also current_clustered_partition_key_
- check the partition key of the last row in batch whether it matches current_partition_key_
- if so, skip the for loop and write the entire batch out

also: change WriteRowsToPartition to take the index vector separately; pass a nullptr in the
case where the entire batch gets written to the same partition

apply that same optimization to writer::AppendRowBatch (or however you rename that). no need
to spend time checking some vector in that case.


http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 170:       const HdfsPartitionDescriptor& partition_descriptor, const TupleRow* current_row,
mention that current_row provides the partition key values.


Line 181:   /// GetHashTblKey(). Maps to an OutputPartition, which are owned by the object
pool and
"pool,"


Line 189:   void GetHashTblKey(const TupleRow* current_row, const std::vector<ExprContext*>&
ctxs,
current_row -> row (it doesn't matter to this function that the row is 'current')

same for InitOutputPartition


Line 213:   /// Appends all rows in batch to the temporary Hdfs files corresponding to partitions.
what "corresponding to partitions" refers to is unclear.


Line 214:   /// The input must be ordered by the partition key expressions.
the rows are expected to be ordered...?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message