drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-1162) 25 way join ended up with OOM
Date Tue, 08 Aug 2017 17:03:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16118636#comment-16118636

Paul Rogers commented on DRILL-1162:

The challenge here is that we are looking for a quick bug fix solution to a problem which
appears to be quite difficult. So, there is no good answer.

On the one hand, it seems that the planner cannot correctly predict the size of the two sides
of the join. I am leery of a simple solution that just flips the sides. That is likely to
fix the current issue, but break many queries that now work. This is true because, here, we
are only worried about this one query. But, Drill has to handle all queries, even those that
this ticket is not concerned with. Flipping the sides is likely to cause regressions, which
will cause us to back out the change, and put us back where we started.

Is there a principled way to make the decision to flip? Perhaps based on the analysis above
about the effect of cascaded joins?

Second, none of this addresses the real issue: that the hash join operator uses too much memory
(heap in one scenario, direct in another.) There is no analysis of why we exhaust each resource,
so it is not possible to identify if any particular hack is likely to solve the issue.

If direct memory exhaustion is caused by excessive hash join table size, then spill-to-disk
may solve it. But, if we are building large tables unnecessarily (we've seen this in other
cases), then smarter planning rules, or better run-time adjustment, may be needed.

If the heap is exhausted, then we have no understanding of why that should be so. What is
using heap? The hash tables use direct memory. Do we understand why heap was exhausted?

Without understanding these fundamentals, we are only hacking and, IMHO, one hack is as good
as another; they are just random shots in the dark. If we have very limited time to fix a
deep issue, then hacking is all we can do, of course.

> 25 way join ended up with OOM
> -----------------------------
>                 Key: DRILL-1162
>                 URL: https://issues.apache.org/jira/browse/DRILL-1162
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow, Query Planning & Optimization
>            Reporter: Rahul Challapalli
>            Assignee: Volodymyr Vysotskyi
>            Priority: Critical
>             Fix For: Future
>         Attachments: error.log, oom_error.log
> git.commit.id.abbrev=e5c2da0
> The below query results in 0 results being returned 
> {code:sql}
> select count(*) from `lineitem1.parquet` a 
> inner join `part.parquet` j on a.l_partkey = j.p_partkey 
> inner join `orders.parquet` k on a.l_orderkey = k.o_orderkey 
> inner join `supplier.parquet` l on a.l_suppkey = l.s_suppkey 
> inner join `partsupp.parquet` m on j.p_partkey = m.ps_partkey and l.s_suppkey = m.ps_suppkey

> inner join `customer.parquet` n on k.o_custkey = n.c_custkey 
> inner join `lineitem2.parquet` b on a.l_orderkey = b.l_orderkey 
> inner join `lineitem2.parquet` c on a.l_partkey = c.l_partkey 
> inner join `lineitem2.parquet` d on a.l_suppkey = d.l_suppkey 
> inner join `lineitem2.parquet` e on a.l_extendedprice = e.l_extendedprice 
> inner join `lineitem2.parquet` f on a.l_comment = f.l_comment 
> inner join `lineitem2.parquet` g on a.l_shipdate = g.l_shipdate 
> inner join `lineitem2.parquet` h on a.l_commitdate = h.l_commitdate 
> inner join `lineitem2.parquet` i on a.l_receiptdate = i.l_receiptdate 
> inner join `lineitem2.parquet` o on a.l_receiptdate = o.l_receiptdate 
> inner join `lineitem2.parquet` p on a.l_receiptdate = p.l_receiptdate 
> inner join `lineitem2.parquet` q on a.l_receiptdate = q.l_receiptdate 
> inner join `lineitem2.parquet` r on a.l_receiptdate = r.l_receiptdate 
> inner join `lineitem2.parquet` s on a.l_receiptdate = s.l_receiptdate 
> inner join `lineitem2.parquet` t on a.l_receiptdate = t.l_receiptdate 
> inner join `lineitem2.parquet` u on a.l_receiptdate = u.l_receiptdate 
> inner join `lineitem2.parquet` v on a.l_receiptdate = v.l_receiptdate 
> inner join `lineitem2.parquet` w on a.l_receiptdate = w.l_receiptdate 
> inner join `lineitem2.parquet` x on a.l_receiptdate = x.l_receiptdate;
> {code}
> However when we remove the last 'inner join' and run the query it returns '716372534'.
Since the last inner join is similar to the one's before it, it should match some records
and return the data appropriately.
> The logs indicated that it actually returned 0 results. Attached the log file.

This message was sent by Atlassian JIRA

View raw message