impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
Date Wed, 15 Mar 2017 14:44:51 GMT
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
......................................................................


Patch Set 2:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/Expr.java
File fe/src/main/java/org/apache/impala/analysis/Expr.java:

Line 212:   // True if this expr always returns the same value given the same input, false
if it
> True if this exprs always returns the same value given the same input. Fals
Done


Line 213:   // does not, or if the behavior is unknown (e.g. UDFs). Set at the end of analyze()
and
> e.g.
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 292:     // We currently don't have a way to indicate if a UDF is deterministic, so just
always
> Ideally we should not conflate the UDF and deterministic concepts. For exam
Not sure what you mean by 'two separate members". Maybe Expr.containsNonDeterministicBuiltin
and Expr.containsUDF?

Also, it doesn't seem like FunctionCallExpr.isNondeterministicBuiltinFn() and Expr.isDeterministic()
could ever be guaranteed to return opposite values, since a deterministic function that has
non-deterministic parameters would constitute a non-deterministic expr.


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java
File fe/src/main/java/org/apache/impala/analysis/QueryStmt.java:

Line 253:       if (!(smap.getLhs().get(i) instanceof SlotRef)
> use {} for multi-line if
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 39:   // All Exprs with cost greater than this will be materialized. Since we don't currently
> Remove TODO, instead comment where this value came from, maybe with some ex
Done


Line 46: 
> needs a brief comment, in particular why we need to store them
Done


Line 172:    */
> We should collect the SlotRefs only for non-materialized ordering exprs.
Done


Line 180:     // substOrderBy is a mapping from slot refs in the sort node's input and ordering
> update comment to reflect the new contents of the substOrderBy smap
Done


Line 196:       SlotDescriptor origSlotDesc = origSlotRef.getDesc();
> update comment, we are not only substituting slot refs
Done


Line 208:     substituteOrderingExprs(substOrderBy, analyzer);
> Say something about the args of this function. Also mention side-effects (p
Done


Line 210:     // Update the tuple descriptor used to materialize the input of the sort.
> make this the first sentence in this comment
Done


Line 211:     setMaterializedTupleInfo(sortTupleDesc, sortTupleExprs);
> no more hint, right?
Done


Line 214:   }
> from the materialized sort exprs to the new SlotRefs
Done


Line 216:   /**
> Instead of returning a new smap and combining with the other one, why not p
I updated it to take the smap as a parameter, but the problem with returning the materialized
exprs to avoid side effects is that we also need to update isAsc and nullsFirst.


Line 217:    * Materialize ordering exprs that are non-deterministic, are more expensive than
a cost
> analyzer needed?
For 'addSlotDescriptor'


Line 219:    * 'sortTupleDesc' as their parent.
> remove
Done


Line 228:    * exprs that get materialized.
> Why TODO? What happens when there are no sort exprs at all left?
There's already a lot in this review, and I wasn't sure how easy this would be. I'm happy
to take a look if you want.

If there are no sort exprs, we still go through the whole sort but every row is considered
equal (by TupleRowComparator::Compare"). This is the same behavior as would currently happen,
except slightly faster as the literal expressions don't get repeatedly evaluated and compared.

I added a test to at least show it works as expected.


http://gerrit.cloudera.org:8080/#/c/6322/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

Line 374:   public void testSortExprMaterialization() {
> testSortExprMaterialization
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/testdata/workloads/functional-planner/queries/PlannerTest/sort-materialization.test
File testdata/workloads/functional-planner/queries/PlannerTest/sort-materialization.test:

Line 19
> I think these tests need more examples cheap and expensive exprs to make su
I'm not sure how to make that work with PlannerTest, but I added a test to query_tests/test_udfs
that works.

For UDAs, assuming that you mean something like:
select bool_col, sum(min(int_col)) over(order by uda(int_col)) from functional.alltypes group
by 1;
I believe that will already be materialized, in a SlotRef created in AggregateInfoBase.createTupleDesc.


Line 37
> Can you ask Greg about some specific expensive ordering exprs that have com
Done


Line 43
> i think it's questionable whether something like this should be materialize
Its very difficult to say what a good threshold would be since our cost model is pretty bad
- partially my fault, since I wrote it in the first place, but also because we don't have
any information about the actual cost of functions.

Given the discussion on the design doc that not materializing an expensive expr is probably
a bigger problem than materializing a cheap expr, I'm inclined to set it relatively low, especially
if we're not going to provide an override to the user.

For example, the function 'regex_replace' should definitely be materialized but depending
on its parameters it can be costed as low as 13 (10 for the function call + 3 parameters).
Unfortunately that's also the lowest possible cost that 'if' could have (its actually more
expensive here due to the binary predicate).

So, maybe we set the threshold as FUNCTION_CALL_COST (10) so that way all expensive functions
are guaranteed to be materialized but things like simple arithmetic operations are left non-materialized
(for example slotref + slotref + slotref would be costed as 5).

And of course we should consider improving the cost model, though not in this patch.


Line 106
> It doesn't seem right to print this here because the merging exchange does 
Done


http://gerrit.cloudera.org:8080/#/c/6322/1/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

Line 165:     # "order by random()" with different seeds should produce different orderings.
> Confusing sentence, what is 'unordered'?
Done


Line 168:     results_seed1 = self.execute_query(seed_query % "1")
> remove
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6322
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message