[ https://issues.apache.org/jira/browse/DRILL5254?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Paul Rogers updated DRILL5254:

Description:
Drill uses Calcite for query parsing and optimization. Drill uses Calcite's default selectivity
(reduction factor) rules to compute the number of rows removed by a filter.
The default rules appear to be overly aggressive in estimating reductions. In a production
use case, an input with 4 billion rows was estimated to return just 40K rows from a filter.
That is, the filter estimated a 1/1,000,000 reduction in rows. As it turns out, the actual
reduction was closer to 1/2.
The result was that the planner compared the expected 40K rows against another input of 2.5
million rows, and decided the 40K rows would be best on the build side of a hash join. When
confronted with the actual 3 billion rows, the hash join ran out of memory.
The moral of the story is that, in Drill, it is worth being conservative when planning for
memoryintensive operations.
The (sanitized) filter is the following, annotated with (a guess at) the default reduction
factors in each term:
{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
'Value5','Value6','Value7','Value8','Value9')  25%
AND col2_i <=3  25%
AND col3_s1 = 'Y'  15%
AND col4_s1 = 'Y'  15%
AND col5_s6 not like '%str1%'  25%
AND col5_s6 not like '%str2%'  25%
AND col5_s6 not like '%str3%'  25%
AND col5_s6 not like '%str4%'  25%
{code}
Total reduction is something like:
{code}
.25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
{code}
Filter estimation is a known hard problem. In general, one needs statistics and other data,
and even then the estimates are just guesses.
Still it is possible to ensure that the defaults are at least unbiased. That is if we assume
that the probability of A LIKE B being 25%, then the probability of A NOT LIKE B should be
75%, not also 25%.
This JIRA suggests creating an experimental set of defaults based on the "core" Calcite defaults,
but with other reduction factors derived using the laws of probability. In particular:
 Operator  Revised  Explanation  Calcite Default
 =  0.15  Default in Calcite  0.15
 <>  0.85  1  p(=)  0.5
 <  0.425  p(<>) / 2  0.5
 >  0.425  p(<>) / 2  0.5
 <=  0.575  p(<) + p(=)  0.5
 >=  0.575  p(>) + p(=)  0.5
 LIKE  0.25  Default in Calcite  0.25
 NOT LIKE  0.75  1  p(LIKE)  0.25
 NOT NULL  0.90  Default in Calcite  0.90
 IS NULL  0.10  1  p(NOT NULL)  0.25
 IS TRUE  0.5  1 / 2  0.25
 IS FALSE  0.5  1 / 2  0.25
 IS NOT TRUE  0.55  1  p(IS TRUE)  p(IS NULL)  0.25
 IS NOT FALSE  0.55  1  p(IS FALSE)  p(IS NULL)  0.25
 A OR B  Varies  min(p(A) + p(B)  p(A ^ B), 0.5)  0.5
 IN (a)  0.15  p(=)  0.25
 x IN (a, b, c, ...)  Varies  p(x = a v x = b v x = c v ...)  0.25
 NOT A  Varies  1  p(A)  0.25
 BETWEEN a AND B  0.28  p(<=) / 2  0.25
The Calcite defaults were identified by inspection and verified by tests.
The probability of the IS NOT TRUE statement assumes the presence of nulls, while IS TRUE
does not. The rule for OR caps the reduction factor at 0.5 per standard practice.
With the revised rules, the example WHERE reduction becomes:
{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
'Value5','Value6','Value7','Value8','Value9')  50%
AND col2_i <=3  57%
AND col3_s1 = 'Y'  15%
AND col4_s1 = 'Y'  15%
AND col5_s6 not like '%str1%'  85%
AND col5_s6 not like '%str2%'  85%
AND col5_s6 not like '%str3%'  85%
AND col5_s6 not like '%str4%'  85%
.5 * .57 * .15^2 * .85^4 = 0.003
{code}
The new rules are not a panacea: they are still just guesses. However, they are unbiased guesses
based on the rules of probability which result in more conservative reductions of filters.
The result may be better plans in queries with large conjunctions (large number of expressions
AND'ed together.)
was:
Drill uses Calcite for query parsing and optimization. Drill uses Calcite's default selectivity
(reduction factor) rules to compute the number of rows removed by a filter.
The default rules appear to be overly aggressive in estimating reductions. In a production
use case, an input with 4 billion rows was estimated to return just 40K rows from a filter.
That is, the filter estimated a 1/1,000,000 reduction in rows. As it turns out, the actual
reduction was closer to 1/2.
The result was that the planner compared the expected 40K rows against another input of 2.5
million rows, and decided the 40K rows would be best on the build side of a hash join. When
confronted with the actual 3 billion rows, the hash join ran out of memory.
The moral of the story is that, in Drill, it is worth being conservative when planning for
memoryintensive operations.
The (sanitized) filter is the following, annotated with (a guess at) the default reduction
factors in each term:
{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
'Value5','Value6','Value7','Value8','Value9')  25%
AND col2_i <=3  25%
AND col3_s1 = 'Y'  15%
AND col4_s1 = 'Y'  15%
AND col5_s6 not like '%str1%'  25%
AND col5_s6 not like '%str2%'  25%
AND col5_s6 not like '%str3%'  25%
AND col5_s6 not like '%str4%'  25%
{code}
Total reduction is something like:
{code}
.25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
{code}
Filter estimation is a known hard problem. In general, one needs statistics and other data,
and even then the estimates are just guesses.
Still it is possible to ensure that the defaults are at least unbiased. That is if we assume
that the probability of A LIKE B being 25%, then the probability of A NOT LIKE B should be
75%, not also 25%.
This JIRA suggests creating an experimental set of defaults based on the "core" Calcite defaults,
but with other reduction factors derived using the laws of probability. In particular:
 Operator  Revised  Explanation  Calcite Default
 =  0.15  Default in Calcite  0.15
 <>  0.85  1  p(=)  0.5
 <  0.425  p(<>) / 2  0.5
 >  0.425  p(<>) / 2  0.5
 <=  0.575  p(<) + p(=)  0.5
 >=  0.575  p(>) + p(=)  0.5
 LIKE  0.25  Default in Calcite  0.25
 NOT LIKE  0.75  1  p(LIKE)  0.25
 NOT NULL  0.90  Default in Calcite  0.90
 IS NULL  0.10  1  p(NOT NULL)  0.25
 IS TRUE  0.5  1 / 2  0.25
 IS FALSE  0.5  1 / 2  0.25
 IS NOT TRUE  0.55  1  p(IS TRUE)  p(IS NULL)  0.25
 IS NOT FALSE  0.55  1  p(IS FALSE)  p(IS NULL)  0.25
 A OR B  Varies  min(p(A) + p(B)  p(A ^ B), 0.5)  0.5
 IN (a)  0.15  p(=)  0.25
 x IN (a, b, c, ...)  Varies  p(x = a v x = b v x = c v ...)  0.5
 NOT A  Varies  1  p(A)  0.25
 BETWEEN a AND B  0.28  p(<=) / 2  0.25
The Calcite defaults were identified by inspection and verified by tests.
The probability of the IS NOT TRUE statement assumes the presence of nulls, while IS TRUE
does not. The rule for OR caps the reduction factor at 0.5 per standard practice.
With the revised rules, the example WHERE reduction becomes:
{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
'Value5','Value6','Value7','Value8','Value9')  50%
AND col2_i <=3  57%
AND col3_s1 = 'Y'  15%
AND col4_s1 = 'Y'  15%
AND col5_s6 not like '%str1%'  85%
AND col5_s6 not like '%str2%'  85%
AND col5_s6 not like '%str3%'  85%
AND col5_s6 not like '%str4%'  85%
.5 * .57 * .15^2 * .85^4 = 0.003
{code}
The new rules are not a panacea: they are still just guesses. However, they are unbiased guesses
based on the rules of probability which result in more conservative reductions of filters.
The result may be better plans in queries with large conjunctions (large number of expressions
AND'ed together.)
> Enhance default reduction factors in optimizer
> 
>
> Key: DRILL5254
> URL: https://issues.apache.org/jira/browse/DRILL5254
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Fix For: 1.10
>
>
> Drill uses Calcite for query parsing and optimization. Drill uses Calcite's default selectivity
(reduction factor) rules to compute the number of rows removed by a filter.
> The default rules appear to be overly aggressive in estimating reductions. In a production
use case, an input with 4 billion rows was estimated to return just 40K rows from a filter.
That is, the filter estimated a 1/1,000,000 reduction in rows. As it turns out, the actual
reduction was closer to 1/2.
> The result was that the planner compared the expected 40K rows against another input
of 2.5 million rows, and decided the 40K rows would be best on the build side of a hash join.
When confronted with the actual 3 billion rows, the hash join ran out of memory.
> The moral of the story is that, in Drill, it is worth being conservative when planning
for memoryintensive operations.
> The (sanitized) filter is the following, annotated with (a guess at) the default reduction
factors in each term:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
> 'Value5','Value6','Value7','Value8','Value9')  25%
> AND col2_i <=3  25%
> AND col3_s1 = 'Y'  15%
> AND col4_s1 = 'Y'  15%
> AND col5_s6 not like '%str1%'  25%
> AND col5_s6 not like '%str2%'  25%
> AND col5_s6 not like '%str3%'  25%
> AND col5_s6 not like '%str4%'  25%
> {code}
> Total reduction is something like:
> {code}
> .25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
> {code}
> Filter estimation is a known hard problem. In general, one needs statistics and other
data, and even then the estimates are just guesses.
> Still it is possible to ensure that the defaults are at least unbiased. That is if we
assume that the probability of A LIKE B being 25%, then the probability of A NOT LIKE B should
be 75%, not also 25%.
> This JIRA suggests creating an experimental set of defaults based on the "core" Calcite
defaults, but with other reduction factors derived using the laws of probability. In particular:
>  Operator  Revised  Explanation  Calcite Default
>  =  0.15  Default in Calcite  0.15
>  <>  0.85  1  p(=)  0.5
>  <  0.425  p(<>) / 2  0.5
>  >  0.425  p(<>) / 2  0.5
>  <=  0.575  p(<) + p(=)  0.5
>  >=  0.575  p(>) + p(=)  0.5
>  LIKE  0.25  Default in Calcite  0.25
>  NOT LIKE  0.75  1  p(LIKE)  0.25
>  NOT NULL  0.90  Default in Calcite  0.90
>  IS NULL  0.10  1  p(NOT NULL)  0.25
>  IS TRUE  0.5  1 / 2  0.25
>  IS FALSE  0.5  1 / 2  0.25
>  IS NOT TRUE  0.55  1  p(IS TRUE)  p(IS NULL)  0.25
>  IS NOT FALSE  0.55  1  p(IS FALSE)  p(IS NULL)  0.25
>  A OR B  Varies  min(p(A) + p(B)  p(A ^ B), 0.5)  0.5
>  IN (a)  0.15  p(=)  0.25
>  x IN (a, b, c, ...)  Varies  p(x = a v x = b v x = c v ...)  0.25
>  NOT A  Varies  1  p(A)  0.25
>  BETWEEN a AND B  0.28  p(<=) / 2  0.25
> The Calcite defaults were identified by inspection and verified by tests.
> The probability of the IS NOT TRUE statement assumes the presence of nulls, while IS
TRUE does not. The rule for OR caps the reduction factor at 0.5 per standard practice.
> With the revised rules, the example WHERE reduction becomes:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
> 'Value5','Value6','Value7','Value8','Value9')  50%
> AND col2_i <=3  57%
> AND col3_s1 = 'Y'  15%
> AND col4_s1 = 'Y'  15%
> AND col5_s6 not like '%str1%'  85%
> AND col5_s6 not like '%str2%'  85%
> AND col5_s6 not like '%str3%'  85%
> AND col5_s6 not like '%str4%'  85%
> .5 * .57 * .15^2 * .85^4 = 0.003
> {code}
> The new rules are not a panacea: they are still just guesses. However, they are unbiased
guesses based on the rules of probability which result in more conservative reductions of
filters. The result may be better plans in queries with large conjunctions (large number of
expressions AND'ed together.)

This message was sent by Atlassian JIRA
(v6.3.15#6346)
