drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jinfengni <...@git.apache.org>
Subject [GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...
Date Tue, 12 Sep 2017 18:56:23 GMT
Github user jinfengni commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @vvysotskyi , what I'm thinking is not just comparing the ratio of column count vs ratio
of row count.  
    
    Let's take a step back. This SwapHashJoinVisitor is trying to correct the join decision
made by either VolcanoPlanner or HepPlanner, in the sense the original join order might need
put bigger dataset on the build side. The swap is needed because we want to avoid high memory
requirement on hash join operator. The question is how do we define "big dataset". For now,
SwapHashJoinVisitor simply uses rowCount, which is not sufficient. 
    
    In stead, we probably should add a method `getSelfMemCost` to all `Prel` node. For non-blocking
operator, it's simply returning either 0 or some constant (to hold one single batch). For
non-blocking operator such as HashJoin, it will return a value proportional to rowCount X
columnCount (more precisely, total number of bytes per row, considering different column data
type). 
    
    Same as existing method of `computeSelfCost`, we need `getCumulativeMemCost` which will
return the cumulative cost for child nodes rooted at one `Prel` node.  With this `getSelfMemCost`
and `getCumulativeMemCost` defined for HashJoin, and a HashJoin with input1, input2 as inputs,
we could estimate cumulative memory cost for HashJoin(input1, input2), and HashJoin(input2,
input1), and use that as criteria to decide whether we have to switch them. 
    
    This idea is not trying to adjust the row count estimation. In stead, it's trying to change
the criteria where we may think it's necessary to swap, based on the observation that we want
to do swap only when we want to reduce memory requirement for a query. 
    
    Will the above idea work? If it could not address this issue, it's probably fine to go
with what you proposed. Before we go that option, please give some thoughts about the above
idea. 
    



---

Mime
View raw message