drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gautam Kumar Parai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-4743) HashJoin's not fully parallelized in query plan
Date Thu, 28 Jul 2016 00:28:20 GMT

     [ https://issues.apache.org/jira/browse/DRILL-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gautam Kumar Parai updated DRILL-4743:
--------------------------------------
    Description: 
The underlying problem is filter selectivity under-estimate for a query with complicated predicates
e.g. deeply nested and/or predicates. This leads to under parallelization of the major fragment
doing the join. 

To really resolve this problem we need table/column statistics to correctly estimate the selectivity.
However, in the absence of statistics OR even when existing statistics are insufficient to
get a correct estimate of selectivity this will serve as a workaround.

For now, the fix is to provide options for controlling the lower and upper bounds for filter
selectivity. The user can use the following options. The selectivity can be varied between
0 and 1 with min selectivity always less than or equal to max selectivity.
{code}planner.filter.min_selectivity_estimate_factor 
planner.filter.max_selectivity_estimate_factor 
{code} 

When using 'explain plan including all attributes for ' it should cap the estimated ROWCOUNT
based on these options. Estimated ROWCOUNT of operators downstream is not directly controlled
by these options. However, they may change as a result of dependency between different operators.
The FILTER operator only operates on the input of its immediate upstream operator (e.g. SCAN,
AGG). If two different filters are present in the same plan, they might have different selectivities
based on their immediate upstream operators ROWCOUNT.

  was:
The underlying problem is filter selectivity under-estimate for a query with complicated predicates
e.g. deeply nested and/or predicates. This leads to under parallelization of the major fragment
doing the join. 

To really resolve this problem we need table/column statistics to correctly estimate the selectivity.
However, in the absence of statistics OR even when existing statistics are insufficient to
get a correct estimate of selectivity this will serve as a workaround.

For now, the fix is to provide options for controlling the lower and upper bounds for filter
selectivity. The user can use the following options. The selectivity can be varied between
0 and 1 with min selectivity always less than or equal to max selectivity.
{code}planner.filter.min_selectivity_estimate_factor 
planner.filter.max_selectivity_estimate_factor 
{code} 

When using 'explain plan including all attributes for ' it should cap the estimated ROWCOUNT
based on these options. Estimated ROWCOUNT of operators downstream is not directly controlled
by these options. However, they may change as a result of dependency between different operators.


> HashJoin's not fully parallelized in query plan
> -----------------------------------------------
>
>                 Key: DRILL-4743
>                 URL: https://issues.apache.org/jira/browse/DRILL-4743
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Gautam Kumar Parai
>            Assignee: Gautam Kumar Parai
>              Labels: doc-impacting
>             Fix For: 1.8.0
>
>
> The underlying problem is filter selectivity under-estimate for a query with complicated
predicates e.g. deeply nested and/or predicates. This leads to under parallelization of the
major fragment doing the join. 
> To really resolve this problem we need table/column statistics to correctly estimate
the selectivity. However, in the absence of statistics OR even when existing statistics are
insufficient to get a correct estimate of selectivity this will serve as a workaround.
> For now, the fix is to provide options for controlling the lower and upper bounds for
filter selectivity. The user can use the following options. The selectivity can be varied
between 0 and 1 with min selectivity always less than or equal to max selectivity.
> {code}planner.filter.min_selectivity_estimate_factor 
> planner.filter.max_selectivity_estimate_factor 
> {code} 
> When using 'explain plan including all attributes for ' it should cap the estimated ROWCOUNT
based on these options. Estimated ROWCOUNT of operators downstream is not directly controlled
by these options. However, they may change as a result of dependency between different operators.
The FILTER operator only operates on the input of its immediate upstream operator (e.g. SCAN,
AGG). If two different filters are present in the same plan, they might have different selectivities
based on their immediate upstream operators ROWCOUNT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message