impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
Date Fri, 07 Apr 2017 05:38:59 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs
......................................................................


Patch Set 4:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/6322/2/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 240:         materializedDesc.setIsMaterialized(true);
> As discussed, I removed the handling of literals.
Ahh right, let me explain just to close on this:

select r from (select rand() r from t) v order by r

Here "r" in the outer query block is a SlotRef into the virtual inline-view tuple. So we will
just materialize that SlotRef.

On the other hand, it means our cost-based materialization decision may not be accurate through
inline views (but that's a much less severe issue). For example,

select c from (select c1 + c2 c from t) v order by c

we will likewise materialize "c" even though the underlying expr "c1 + c2" is cheap according
to our costing.

Still sounds worth filing a JIRA for making this decision more accurate (i.e., during planning),
but I think we can definitely consider all JIRAs in the commit msg fixed after this patch
:)


http://gerrit.cloudera.org:8080/#/c/6322/4/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 39:   // All Exprs with cost greater than this will be materialized. Since we don't currently
All ordering exprs with cost...


Line 40:   // have any information about actual function costs, this value guarantees that
all
guarantees -> is intended to ensure


Line 42:   // operations unmaterialized, for example 'SlotRef + SlotRef' would have cost 3
and not
remove cost "3" since this might change and it will be hard to tell this comment needs to
be updated


Line 167:    * output by the sort node. Done by first calling createMaterailzedOrderExprs()
to
The "Done by" sentence sounds a little procedural/complicated, can you condense/simplify it?


Line 170:    * expressions. The materialized exprs are substituted with slot refs into the
new
exprs


Line 181:     // substOrderBy is a mapping from:
Can be improved for clarity. Also point 2 is not quite accurate.

Here's a proposal:

// Mapping from exprs evaluated on the sort input that get materialized into the sort tuple
to their corresponding SlotRefs in the sort tuple. The following exprs are materialized:
1. Ordering exprs that we chose to materialize
2. SlotRefs against the sort input contained in the result and ordering exprs after substituting
the materialized ordering exprs


Line 191:     Set<SlotRef> sourceSlots = Sets.newHashSet();
// SlotRefs in the result and ordering exprs after substituting the materialized ordering
exprs.


Line 201:       if (!substOrderBy.getRhs().contains(origSlotRef)) {
contains() is expensive (and every iteration might grow the rhs).
I suggest checking whether a SlotRef references the sort tuple id.

I know we use contains() elsewhere extensively, but mostly because we have no simple and cheap
alternative. Here the alternative is pretty simple.


Line 222:    * Materialize ordering exprs that are non-deterministic, contain a UDF (since
we don't
Easier to read if you have bullet points for the things we materialize, e.g.:
Materialize the following ordering exprs:
- non-deterministic exprs
- exprs that contain a UDF (since ...)


http://gerrit.cloudera.org:8080/#/c/6322/4/testdata/workloads/functional-planner/queries/PlannerTest/sort-expr-materialization.test
File testdata/workloads/functional-planner/queries/PlannerTest/sort-expr-materialization.test:

Line 1: #sort on a non-deterministic expr, gets materialized
space after #


Line 118: # expensive order expr materialized in merging exchange
Same as case above? We don't materialize in a merging exchange, the preceding sort always
does that.


Line 139: # sort on udf, gets materialized
Need additional tests to cover the tricky code to avoid redundant expr materializations. Examples
for inspiration:

select concat(a, b, c, d) c from t order by c


http://gerrit.cloudera.org:8080/#/c/6322/4/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

Line 158: class TestRandomSort(ImpalaTestSuite):
This will run over all test vectors right? I think it makes sense to only run this over text/none
or parquet/none. You can try running this with exhaustive and see how often this test is run.


Line 172:     # Include "random()" in the select list to check that its sorted correctly.
it's


Line 174:         "select random() as r from functional.alltypessmall order by r").data,
also add one with an inline view:

select r from (select random() r functional.alltypessmall) v order by r


Line 185:     """Tests that a window function over 'order by random()' worked as expected."""
works as expected


-- 
To view, visit http://gerrit.cloudera.org:8080/6322
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message