drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-1162) 25 way join ended up with OOM
Date Sun, 10 Sep 2017 06:04:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160235#comment-16160235

ASF GitHub Bot commented on DRILL-1162:

Github user jinfengni commented on the issue:

    This OOM problem exposes two problems. The first one is in planning time, where we choose
a sub-optimal plan, due to the inaccurate estimation of row count because of missing of appropriate
statistics. The second one is in execution time, where we may need understand whether Drill
uses too much memory and whether spill to disk is an option. I think the two are complementary
to each other; even when we have spill to disk for hash join, if planner choose a sub-optimal
plan, the query still could take long, long time to complete.
    Looks like the PR is addressing the 1st issue. I agree that the root cause is row count
estimation, which is more appropriate to defer to the enhancement of statistics support. For
swapJoin logic, the proposal of getMaxRowCount() seems to be in the line of adjusting row
count estimation. I like better the idea of combining row count + column count, which was
essentially adopted in swapInput() by LoptJoinOptmizeRule.
    For HashTable build side cost, hash table only has to hold the join key. However, since
hash join is a blocking operator, it has hold all the records in the build side, meaning total
memory requirement (for both hash table + non-join key columns) depends on row count and column
count. Therefore, the cost model of hash join should reflect that. Can we use similar idea
in SwapHashJoinVisitor?
    One further improvement would be to modify HashJoinPrel.computeSelfCost(). Today we only
consider join key width, and it makes sense to adjust that logic, by considering the total
column counts in build side. Such logic could be extracted into a common places, then SwapHashJoinVisitor
could call the same shared logic to decide whether it's cost-wise optimal to swap the input
sides. Thoughts?

> 25 way join ended up with OOM
> -----------------------------
>                 Key: DRILL-1162
>                 URL: https://issues.apache.org/jira/browse/DRILL-1162
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow, Query Planning & Optimization
>            Reporter: Rahul Challapalli
>            Assignee: Volodymyr Vysotskyi
>            Priority: Critical
>             Fix For: Future
>         Attachments: error.log, oom_error.log
> git.commit.id.abbrev=e5c2da0
> The below query results in 0 results being returned 
> {code:sql}
> select count(*) from `lineitem1.parquet` a 
> inner join `part.parquet` j on a.l_partkey = j.p_partkey 
> inner join `orders.parquet` k on a.l_orderkey = k.o_orderkey 
> inner join `supplier.parquet` l on a.l_suppkey = l.s_suppkey 
> inner join `partsupp.parquet` m on j.p_partkey = m.ps_partkey and l.s_suppkey = m.ps_suppkey

> inner join `customer.parquet` n on k.o_custkey = n.c_custkey 
> inner join `lineitem2.parquet` b on a.l_orderkey = b.l_orderkey 
> inner join `lineitem2.parquet` c on a.l_partkey = c.l_partkey 
> inner join `lineitem2.parquet` d on a.l_suppkey = d.l_suppkey 
> inner join `lineitem2.parquet` e on a.l_extendedprice = e.l_extendedprice 
> inner join `lineitem2.parquet` f on a.l_comment = f.l_comment 
> inner join `lineitem2.parquet` g on a.l_shipdate = g.l_shipdate 
> inner join `lineitem2.parquet` h on a.l_commitdate = h.l_commitdate 
> inner join `lineitem2.parquet` i on a.l_receiptdate = i.l_receiptdate 
> inner join `lineitem2.parquet` o on a.l_receiptdate = o.l_receiptdate 
> inner join `lineitem2.parquet` p on a.l_receiptdate = p.l_receiptdate 
> inner join `lineitem2.parquet` q on a.l_receiptdate = q.l_receiptdate 
> inner join `lineitem2.parquet` r on a.l_receiptdate = r.l_receiptdate 
> inner join `lineitem2.parquet` s on a.l_receiptdate = s.l_receiptdate 
> inner join `lineitem2.parquet` t on a.l_receiptdate = t.l_receiptdate 
> inner join `lineitem2.parquet` u on a.l_receiptdate = u.l_receiptdate 
> inner join `lineitem2.parquet` v on a.l_receiptdate = v.l_receiptdate 
> inner join `lineitem2.parquet` w on a.l_receiptdate = w.l_receiptdate 
> inner join `lineitem2.parquet` x on a.l_receiptdate = x.l_receiptdate;
> {code}
> However when we remove the last 'inner join' and run the query it returns '716372534'.
Since the last inner join is similar to the one's before it, it should match some records
and return the data appropriately.
> The logs indicated that it actually returned 0 results. Attached the log file.

This message was sent by Atlassian JIRA

View raw message