impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) PREVIEW: Basic column-wise slot materialization in Parquet scanner.
Date Tue, 19 Apr 2016 17:10:26 GMT
Tim Armstrong has posted comments on this change.

Change subject: PREVIEW: Basic column-wise slot materialization in Parquet scanner.
......................................................................


Patch Set 2:

(10 comments)

Some initial comments. I'm not sure that I fully understand some aspects of the changes so
I'll have to do another pass.

http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 273:   /// once. Returns the number of values actually materialized in *num_values.
Can you document how errors are handled in these functions. E.g. what happens if it hits an
error partway through the batch.

The documentation was missing before but I think it's even more important now.


Line 706: RowGroupAtEnd
Is this somewhat redundant? We should know the # of rows so can compute max_values as min(max_values,
rows left in group). And if there's an error continue_execution should take care of that.

I figure you'll probably get to that when doing the perf optimisations but it would be good
now to remove any redundancy in the logic so it's easier to understand.


Line 718: val_count
Maybe convert the loop into for (int val_count = 0; val_count < max_values; ++val_count)
and just break out on exception conditions. It will make it easier to streamline the loop
down the track I think.

I think the LIKELY annotation should be redundant (if not already) if the loop is in a reasonably
canonical form.


Line 953: HdfsParquetScanner
Does this need to be nested inside the scanner? It seems unnecessary if it's local to the
.cc file.


Line 979: state->batch_size() * tuple_byte_size
The cast is in the wrong place to prevent overflow.

I think you want to do a similar calculation to what I added in a recent patch to compute
the batch size and allocate the memory. It probably makes sense to change the RowBatch interface
so the logic to size the batch is the same between all the scanners:

    IMPALA-3105: avoid overrunning allocated tuple buffer


Line 1752:     if (UNLIKELY(
It seems easy to hoist these checks out of the hot loop and should make the code more readable,
so maybe consider doing that in this patch set.


Line 1912:           ReadCollectionItem(column_readers, materialize_tuple, pool, tuple);
Would it make sense to switch this over to the column-wise version now? I know you're probably
doing it eventually but it might make the intermediate state of the code easier to understand
if it's not a mix of column-wise and row-wise materialisation.


Line 1917:         if (ExecNode::EvalConjuncts(&conjunct_ctxs[0], conjunct_ctxs.size(),
row)) {
Can we have conjuncts on nested collection items?


Line 2564:       return Status::OK();
?


Line 2893:       return Status::OK();
?


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message