impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5036: Parquet count star optimization
Date Thu, 08 Jun 2017 01:26:52 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................


Patch Set 2:

(23 comments)

Code comments. Still going through the tests.

http://gerrit.cloudera.org:8080/#/c/6812/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 432:       DCHECK(row_group_idx_ <= file_metadata_.row_groups.size());
> The file has to have more than 1 row group. I'm pretty sure tpch_parquet.li
Makes sense, batch_size=1 then


Line 455:     int max_tuples = min<int64_t>(row_batch->capacity(), rows_remaining);
> Not sure, should we be checking that stats is non-negative? What do you sug
I agree with Mostafa in principle and I believe we have a JIRA to track this. However, this
patch only optimizes existing behavior, it does not change it. Even before this patch we relied
on the same num_rows for count star queries, so I think we should tackle Mostafa's concern
separately.


Line 488:       eos_ = true;
> To keep it consistent with line 434. I don't like checking for greater or e
Ok, DCHECK_LE() then


Line 1455:     // Skip partition columns
> There can be several slots if we are grouping by a partition column. Left u
Mostafa, we need the per-column null count to optimize queries like select count(l_comment)
from lineitem. Good idea to do that, but not in this patch. Pooja is working on populating
the null count.


http://gerrit.cloudera.org:8080/#/c/6812/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 428:     RETURN_IF_ERROR(
In most cases we're allocating way too much here. We can get a more accurate capacity by taking
the minimum of the batch size and the number of row groups in this file. There's a static
ResizeAndAllocateTupleBuffer() in RowBatch that can be used for this purpose.


Line 437:       int64_t* dst_slot = reinterpret_cast<int64_t*>(dst_tuple->GetSlot(0));
The dst_tuple->GetSlot(0) looks wrong. There's no guarantee that the target slot is always
at that offset. I think we may need to plumb through the slot id from the FE and then get
the slot offset from the slot descriptor.


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
File fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java:

Line 333:       FunctionCallExpr f = (FunctionCallExpr) substitutedAgg;
remove


Line 359:     //Preconditions.checkState(mergeAggInfo_ == null);
?


Line 659:     if (getMaterializedAggregateExprs().size() != 1) return false;
Avoid calling getMaterializedAggregateExprs(() twice


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 571:   public FunctionCallExpr getMergeAggInputFn() { return mergeAggInputFn_; }
Do we need these getters/setters?


Line 621:       if (!(substitutedFn instanceof FunctionCallExpr)) return e;
This looks like an error we should not ignore. Convert into Preconditions check.


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
File fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java:

Line 334:     ArrayList<SlotDescriptor> materializedSlots = getMaterializedSlots();
The previous implementation seemed cheaper, I'd prefer to leave it. Using hdfsTable.isClusteringColumn()
is still better.


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 105:  * TODO: pass in range restrictions.
Might be good to add a word or two about the agg optimization flow, i.e., the caller passes
in some info about the agg in this query block, then this scan decides whether whether to
apply the optimization or not, then the produced smap must be applied to the agg infos from
this query block.


Line 134:   // Set to true if the count(*) aggregation can be optimized by populating the
tuple with
Set to true if the query block of this scan has a count(*) aggregation that is amenable to
optimization by populating the scan tuple with Parquet metadata. This scan does additional
analysis in init() to determine whether it is correct to apply the optimization.


Line 238:     // Create two functions that we will put into an smap. We want to replace the
Integrate this into a function comment. It should describe what this function does, i.e. it
adds a new slot descriptor and it populates the optimizedAggSmap_. Describe what the entry
looks like.


Line 247:     sd.setType(Type.BIGINT);
set slot as non-nullable


Line 253:     sumFn.analyze(analyzer);
use analyzeNoThrow() and remove the throws declaration


Line 288:           (getTupleDesc().getMaterializedSlots().isEmpty() ||
use desc_ instead of getTupleDesc()


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

Line 1207:    * If 'hdfsTblRef' only contains partition columns and 'fastPartitionKeyScans'
comment on new param


Line 1268:    * 'fastPartitionKeyScans' indicates whether to try to produce the slots with
comment new param


Line 1512:    * 'fastPartitionKeyScans' indicates whether to try to produce the slots with
comment new param


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java:

Line 378:     RewritesOk("count(id)", rule, null);
Also check count(1+1) and count(1+NULL)


http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 22: |  |  output: sum_init_zero(functional_parquet.alltypes.parquet-stats: num_rows)
> Not quite sure how to do that. We set the SlotDescriptor label and then it 
Looks like you fixed, thanks!


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message