impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3905: Implements HdfsScanner::GetNext() for text scans.
Date Thu, 23 Feb 2017 00:30:03 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-3905: Implements HdfsScanner::GetNext() for text scans.

Patch Set 2:

File be/src/exec/

Line 64:       batch_start_ptr_(NULL),
> use nullptr throughout

Line 273:   MemPool* pool = row_batch->tuple_data_pool();
> please consider getting rid of this variable and referring to the rowbatch 

Line 440:       eos_ = true;
> advance scan_state_?
Done. I think it makes sense to not allow GetNext() once eos() is true because there does
seem to be a real use case for allowing that, and disallowing it helps simplify the logic
in GetNext().
Imo, we should have that contract for ExecNode::GetNext() as well, but that's another story.

Line 445:         << "FE should have generated SNAPPY_BLOCKED instead.";
> why not check this when creating the stream?
Good idea. Done.

Line 446:     RETURN_IF_ERROR(UpdateDecompressor(stream_->file_desc()->file_compression));
> why is this needed, given that we just started (ie, why not move it into in
Moved into InitNewRange()

Line 459:     eos_ = true;
> scan_state_ transition?

Line 469: Status HdfsTextScanner::CommitRows(RowBatch* row_batch, int num_rows) {
> how about coalescing that with hdfsscannode::commitrows? at the very least 
The main problem is that the HdfsScanner::CommitRows() may add the batch to the row-batch
queue and allocate a new batch. But for the new scanners we want to manually add batches to
the queue after getting another batch with GetNext().

I merged this CommitRows() with HdfsScanner::CommitRows() and added a flag to control the
queue-adding behavior. It's slightly awkward, let me know if you have a better idea.

Line 485:     ExprContext::FreeLocalAllocations(entry.second);
> why call this here?
Added a comment. Did you have place in mind? We can probably move it somewhere else, but not
sure what side-effects it may have.
File be/src/exec/hdfs-text-scanner.h:

Line 119:   /// regardless if whether it passed the conjuncts.
> regardless of whether

Line 124:   /// Only valid to call in scan state FIRST_TUPLE_FOUND or PAST_SCAN_RANGE.
> thanks for introducing that state variable, it's much clearer than a bunch 

Line 135:   Status CommitRows(RowBatch* row_batch, int num_rows);
> in/out params go last

Line 140:   /// otherwise it will just read num_bytes. If we are reading compressed text,
> if we are reading compressed text, should this even be called (or why is th
Added comment.

This functions internally handles uncompressed text, streaming compressed text (e.g., gzip),
and non-streaming compressed text. See FillByteBufferCompressedFile() and FillByteBufferCompressedStream().

In addition, this function is overridden by the LZO scanner to reuse much of the scanning
logic in here.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Id193aa223434d7cc40061a42f81bbb29dcd0404b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-HasComments: Yes

View raw message