impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-2805: Order conjuncts based on selectivity and cost
Date Fri, 29 Apr 2016 16:57:21 GMT
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-2805: Order conjuncts based on selectivity and cost
......................................................................


Patch Set 15:

(1 comment)

File Edit Options Buffers Tools Help                                                     
                                                                                         
                                                                         
TPCH results from the latest version of the code:

 +----------+-----------------------+---------+------------+------------+----------------+
 | Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
 +----------+-----------------------+---------+------------+------------+----------------+
 | TPCH()   | parquet / none / none | 3.15    | -0.46%     | 2.65       | -0.75%         |
 +----------+-----------------------+---------+------------+------------+----------------+

 +----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | Workload | Query          | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) |
StdDev(%) | Base StdDev(%) | Num Clients | Iters |
 +----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | TPCH()   | TPCH-Q14       | parquet / none / none | 2.63   | 2.55        |   +3.19%   |
  1.68%   |   0.82%        | 1           | 5     |
 | TPCH()   | TPCH-Q22       | parquet / none / none | 1.18   | 1.15        |   +2.92%   |
  4.60%   |   2.52%        | 1           | 5     |
 | TPCH()   | TPCH-Q15       | parquet / none / none | 4.04   | 3.94        |   +2.60%   |
  3.02%   |   1.81%        | 1           | 5     |
 | TPCH()   | TPCH-Q3        | parquet / none / none | 4.13   | 4.09        |   +1.00%   |
  1.27%   |   1.07%        | 1           | 5     |
 | TPCH()   | TPCH-Q19       | parquet / none / none | 4.23   | 4.19        |   +0.94%   |
  1.03%   |   1.77%        | 1           | 5     |
 | TPCH()   | TPCH-Q18       | parquet / none / none | 5.40   | 5.37        |   +0.58%   |
  0.80%   |   1.37%        | 1           | 5     |
 | TPCH()   | TPCH-Q2        | parquet / none / none | 1.24   | 1.23        |   +0.58%   |
  4.61%   |   1.74%        | 1           | 5     |
 | TPCH()   | TPCH-Q17       | parquet / none / none | 3.46   | 3.44        |   +0.53%   |
  1.34%   |   1.89%        | 1           | 5     |
 | TPCH()   | TPCH-Q21       | parquet / none / none | 9.05   | 9.02        |   +0.40%   |
  0.50%   |   0.68%        | 1           | 5     |
 | TPCH()   | TPCH-Q20       | parquet / none / none | 3.09   | 3.10        |   -0.43%   |
  1.72%   |   1.99%        | 1           | 5     |
 | TPCH()   | TPCH-Q5        | parquet / none / none | 3.55   | 3.57        |   -0.51%   |
  0.11%   |   1.11%        | 1           | 5     |
 | TPCH()   | TPCH-Q7        | parquet / none / none | 4.07   | 4.10        |   -0.60%   |
  0.83%   |   1.55%        | 1           | 5     |
 | TPCH()   | TPCH-Q9        | parquet / none / none | 4.80   | 4.84        |   -0.82%   |
  0.76%   |   1.85%        | 1           | 5     |
 | TPCH()   | TPCH-Q4        | parquet / none / none | 3.77   | 3.80        |   -0.88%   |
  1.50%   |   3.12%        | 1           | 5     |
 | TPCH()   | TPCH-Q8        | parquet / none / none | 3.75   | 3.80        |   -1.27%   |
  1.46%   |   1.15%        | 1           | 5     |
 | TPCH()   | TPCH-Q13       | parquet / none / none | 1.83   | 1.86        |   -1.63%   |
  0.16%   |   1.08%        | 1           | 5     |
 | TPCH()   | TPCH-Q12       | parquet / none / none | 3.43   | 3.49        |   -1.73%   |
  2.01%   |   1.86%        | 1           | 5     |
 | TPCH()   | TPCH-Q10       | parquet / none / none | 3.70   | 3.79        |   -2.32%   |
  0.58%   |   1.33%        | 1           | 5     |
 | TPCH()   | TPCH-Q6        | parquet / none / none | 2.41   | 2.49        |   -3.00%   |
  0.96%   |   0.91%        | 1           | 5     |
 | TPCH()   | TPCH-Q16       | parquet / none / none | 1.17   | 1.21        |   -3.15%   |
  1.87%   |   4.64%        | 1           | 5     |
 | TPCH()   | TPCH-Q1        | parquet / none / none | 3.67   | 3.81        |   -3.84%   |
  1.13%   |   3.97%        | 1           | 5     |
 | TPCH()   | TPCH-Q11       | parquet / none / none | 1.29   | 1.35        |   -4.75%   |
  1.72%   |   3.07%        | 1           | 5     |
 +----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+

I did 5 iterations this time, so there's a little less noise. The only query that shows an
increase of > 1 std is Q14, which wasn't affected by this change so that probably reflects
random variation.

The tpch queries that did have their plans change:

Q6:

-   predicates: l_shipdate >= '1994-01-01', l_shipdate < '1995-01-01', l_discount >=
0.05, l_discount <= 0.07, l_quantity < 24
+   predicates: l_discount >= 0.05, l_discount <= 0.07, l_quantity < 24, l_shipdate
>= '1994-01-01', l_shipdate < '1995-01-01'

All we did here is more numeric comparison in front of string comparisons. We don't have selectivity
estimates for any of these, so this seems to be clearly the best that we can do under our
current model.

Q16:

-|     predicates: p_brand != 'Brand#45', NOT p_type LIKE 'MEDIUM POLISHED%', p_size IN (49,
14, 23, 45, 19, 3, 36, 9)
+|     predicates: p_size IN (49, 14, 23, 45, 19, 3, 36, 9), p_brand != 'Brand#45', NOT p_type
LIKE 'MEDIUM POLISHED%'

We've discussed this one a lot, but to sum up: we know that the IN is highly selective and
less expensive, so putting it in front is what we want.

Q17:

-|  |     predicates: p_brand = 'Brand#23', p_container = 'MED BOX'
+|  |     predicates: p_container = 'MED BOX', p_brand = 'Brand#23'

As two string comparisons of similar length, we're estimating their costs as about the same,
but we (correctly) estimate that "MED BOX" is more selective, so we want that one first.

There are a few other tpch plans that have multiple conjuncts under the same plan node where
we're not making any changes to the order, but none that look obviously like they would benefit
from being reordered (most have just two conjuncts that are eit\
her both string comparisons or both numeric comparisons), so I'm reasonably confident this
shows we're making good ordering decisions, at least for tpch. I'll have new results up for
tpcds soon.

http://gerrit.cloudera.org:8080/#/c/2598/15/testdata/workloads/functional-planner/queries/PlannerTest/joins.test
File testdata/workloads/functional-planner/queries/PlannerTest/joins.test:

Line 1648: |  hash predicates: string_col = string_col, bigint_col = bigint_col
> what happened?
Originally, we had just hard coded the idea that string comparisons are more expensive than
numeric comparisons (with VAR_LENGTH_BINARY_PRED_COST), and we don't have selectivity for
either, so we had reordered these.

All of the strings in both string_cols here are of length 1, though, so when I moved to basing
string comparison cost off of string length, the cost of the string_col comparison went down
a lot, which is reasonable - comparing two strings of length 1 is roughly the same cost as
comparing two integers.

However, we're actually estimating the cost of the string comparison here as 3x less than
the int comparison, which is probably not right and points to a flaw in my string comparison
cost calculation - I'm only using the string lengths, and not adding in the child costs as
I do everywhere else, which is wrong (eg. because a string in a comparison might actually
be produced by a complicated subexpression and we shouldn't drop the cost of that subexpression),
so I'll fix this and submit a new version.


-- 
To view, visit http://gerrit.cloudera.org:8080/2598
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message