drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4188) Change the default value of planner.enable_hash_single_key to false
Date Fri, 11 Dec 2015 21:40:46 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053615#comment-15053615

Jacques Nadeau commented on DRILL-4188:

What do you think about trying to enhance the costing to address this rather disabling the
plan? For example, we could do basic cardinality estimates/approximations depending on the
number of columns included in the aggregation. That way we still get the benefits in the scenarios
you described at the end of your description without having the problems described in the
first part.

> Change the default value of planner.enable_hash_single_key to false
> -------------------------------------------------------------------
>                 Key: DRILL-4188
>                 URL: https://issues.apache.org/jira/browse/DRILL-4188
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.4.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
> The planner.enable_hash_single_key flag is used by the HashJoin and MergeJoin plans to
do hash distribution on both sides of the join when it is a multi-column join (e.g T1.a1 =
T2.a2 AND T1.b1 = T2.b2).   The default value of this parameter is True, which means that
Drill will generate multiple plans each with hash distribute on only 1 column.  The final
plan chosen is based on costing.  
> However, due to lack of column statistics, this approach is problematic because we could
end up picking the first column for hash distribution if all plans cost the same and if this
column has low number of distinct values, there could be substantial skew in distribution.
> Doing the hash distribution on all columns should be the default, so I propose to change
planner.enable_hash_single_key to False.  The scenario where we might still want single column
hash distribution is when the join is done after some other operation (e.g window function,
grouped-aggregation) where the child already does a hash-distribution on 1 column that is
part of the join.  However, for those case, we may want to selectively enable this flag. 

This message was sent by Atlassian JIRA

View raw message