impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
Date Fri, 22 Apr 2016 21:30:37 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
......................................................................


Patch Set 5:

(7 comments)

Tim, I needed to make a fix to TransferScratchTuples(). The new/faster version had a division
by zero problem that is now fixed.

http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 367:   bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size,
> Consider exposing this templated function directly and getting rid of the t
In AssembleRows() we call Read*ValueBatch() on a ColumnReader* and not a concrete instance,
so I think we need a virtual function, and
hence the wrappers. If we wanted to expose this templated function directly, we'd need a virtual
templated function (which doesn't exist).

The other alternative I have considered is to separate the BaseScalarColumnReaders from the
CollectionColumnReaders in AssembleRows() and then do column-wise materialization for the
first, and for the produced rows, do row-size materialziation for the latter.

I opted for the current solution to have a uniformity in how all top-level slots are populated,
but I could go either way.

Thoughts?


Line 756:       uint8_t* tuple_mem, int* num_values) {
> You never call this function right? I would delete this, it's confusing and
This function is called, see the virtual wrappers in this class. That's where the perf improvement
comes from in this patch because we avoid the virtual calls to NextLevels() and ReadValue().


Line 1177:       parent_->state_->LogError(msg);
> Use LOG_OR_RETURN_ON_ERROR? Here and below
Done


Line 1680:         scan_node_->mem_tracker());
> I would consider making scatch_batch a member variable of HdfsParquetScanne
I'm not a fan of keeping everything as members because it makes the code hard to reason (side
effects vs. function input/outputs). Since we're already deep down the member path here, I'm
happy to conform. Done.


Line 1727:   // This function most not be called when the output batch is already full.
> Maybe further note that this should never happen as long as we always Commi
Done


Line 1842:     // The output batch must have enough room at this point.
> This confused me at first. I would clarify that this is necessary if the ro
Good point. A while loop below seems much clearer, thanks for the suggestion. Done.


http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 464:   /// tuple to be returned in a row batch.
> "to be returned in a row batch" is confusing, if you're trying to define wh
Removed.


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <skye@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message