impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans.
Date Mon, 15 May 2017 20:15:31 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans.

Patch Set 2:

Commit Message:

PS2, Line 12: ProcessSpit
> nit: typo
File be/src/exec/

Line 72:     header_(nullptr),
> Initialise these in the class itself to avoid the duplication.

PS2, Line 130: nullptr
> Don't need this argument.

Line 137:       static_cast<HdfsScanNode*>(scan_node_)->AddMaterializedRowBatch(row_batch);
> The transfer of ownership of this row batch is confusing and inconsistent b
Not opposed to the change but want to understand the reasoning a little more. I don't see
how the modes are inconsistent.
The caller passes the last batch into Close(), that's the rule for both MT and non-MT cases.
I'd argue that having Close() create the last batch in the non-MT case is actually inconsistent.
It would also add  duplicate code for creating the batch in all Close() function. It's odd
that we are in Close() which gets passed a batch by the caller, but then create a new batch
inside this function anyway. Adjusting the comment on Close() to explain this behavior seems

Line 211:   // scan range
> The logic for updating 'eos_'is a little confusing here, since we set 'eos_
File be/src/exec/base-sequence-scanner.h:

Line 40: /// through the first complete sync in the next range.
> The positioning of line breaks is a bit weird in this paragraphs. Also mayb

Line 86:   virtual Status GetNextInternal(RowBatch* row_batch);
> Can you add WARN_UNUSED_RESULT to the functions that return Status?
Good idea. Do you mind if I do this in a follow-on change for all scanners? I'd like to keep
this patch focused on the
these specific scanners, if possible.

PS2, Line 115:  Read and validate sync marker against header_->syn
> Maybe "Read sync marker from 'stream_' and validate against header_->sync"
Better, thanks. Done.
File be/src/exec/

Line 60:     avro_header_(nullptr),
> Initialise these in the class itself to avoid having to update in multiple 

Line 535:       VLOG_FILE << "Decompressed " << compressed_size << " to
" << data_len_;
> Remove this log? It doesn't seem very useful.
Agree. Done.

Line 557:       if (codegend_decode_avro_data_ != nullptr) {
> Flatten out this else if {} to reduce nesting?
File be/src/exec/hdfs-avro-scanner.h:

PS2, Line 135: data_
> maybe data_block_* to be more explicit? Or group these into a struct for th
Renamed to data_block_. Struct does not seem necessary to me for 3 fields, but will change
if you feel strongly.

PS2, Line 140: num_records_
> num_records_in_block_ or similar? Otherwise it's hard to know what it's cou
File be/src/exec/

Line 410:     Status status = GetNextInternal(batch);
> It's not obvious why the batch needs to be added to the queue (I assume it'
Good point. Done.
File be/src/exec/

Line 58:       num_rows_(0),
> Init in header to be consistent with the changes I suggested to other scann

Line 242: Status HdfsRCFileScanner::ResetRowGroup() {
> This doesn't need to return a status now.
I coalesced this function into StartRowGroup().

PS2, Line 464: ReadRowGroup
> The name is a little confusing. ReadRowGroupHeader() or StartRowGroup()?
Renamed to StartRowGroup()
File be/src/exec/hdfs-rcfile-scanner.h:

Line 382:   /// number of rows in this rowgroup object
> Can we group all of the row-group-specific state into a struct? This would 
I coalesced ResetRowGroup() into StartRowGroup(). Still think we need a struct?

Line 404:   uint8_t* row_group_buffer_;
> Use a StringBuffer instead of row_group_buffer_ + row_group_length_ + row_g
StringBuffer seems misleading to me because the data is binary. StringBuffer also provides
a lot of functionality we do not need here. Is this just about grouping the variables into
a struct?

Line 408:   int row_group_length_;
> I think this and row_group_buffer_size_ should be int64_ts. There's a possi
Good catch. Overflow was definitely possible. Done.
File be/src/exec/hdfs-scan-node.h:

Line 113:   boost::scoped_ptr<RowBatchQueue> materialized_row_batches_;
> One of the harder things I found with this code is reasoning about the tran
As a follow-on? Seems like an independent but fundamental change, want to avoid scope creep.
I have a feeling the change is going to lead to more debugging.
File be/src/exec/

Line 80:     : scan_node_(NULL),
> Consider moving the common initialisation to the class body.

Line 139:     scan_node->AddMaterializedRowBatch(batch);
> Can you comment why we need to add the batch even on error?
File be/src/exec/

Line 47:       current_block_length_(-1),
> Init in header for consistency with suggestion elsewhere.
File be/src/exec/hdfs-sequence-scanner.h:

PS2, Line 191: int
> int64_t for consistency with byte counts elsewhere

Line 191:   static const int MAX_BLOCK_SIZE = (1024 * 1024 * 1024);
> extraneous parentheses
File be/src/util/

Line 34: #include "util/pretty-printer.h"
> unused.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-Reviewer: anujphadke <>
Gerrit-HasComments: Yes

View raw message