drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <jinfengn...@gmail.com>
Subject Re: OOM : Direct buffer memory
Date Sat, 14 May 2016 00:07:35 GMT
When drillbit run into OOM, and has not recovered from OOM yet
(release the allocated memory), I could not see the way that drillbit
could continue to serve the other concurrent queries, unless those
queries do not need use any direct memory at all. On the other hand,
if the cause of query failure is not OOM, it makes sense to not let
such failure to fail all the other running queries .



On Fri, May 13, 2016 at 4:53 PM, rahul challapalli
<challapallirahul@gmail.com> wrote:
> @jinfeng & @hakim
>
> For (1) I will raise a jira.
>
> For (2) I am arguing that we shouldn't fail the other concurrent queries
> when one query hits an OOM, especially when the fragments related to other
> queries themselves succeeded (need to check on this). We should handle the
> OOM case in a better way where we do not end up closing the channel.
> Thoughts?
>
> - Rahul
>
> On Fri, May 13, 2016 at 4:35 PM, Abdel Hakim Deneche <adeneche@maprtech.com>
> wrote:
>
>> 1. you are right, the root allocator should prevent an allocation that
>> exceeds the total available memory, but I'm not sure if all allocations in
>> the rpc layer go through Drill's accountor. Also Netty internal
>> fragmentation could cause this issue even though we are still below our
>> memory limit.
>> 2. unfortunately, when a channel is closed we don't get back most of the
>> acknowledgements for the messages that were sent through that channel, and
>> we are forced to fail any query that's still waiting for an ack from that
>> channel. The more queries are running in parallel, the more chances a large
>> number of them will be affected by this.
>>
>> On Fri, May 13, 2016 at 4:22 PM, rahul challapalli <
>> challapallirahul@gmail.com> wrote:
>>
>> > 1. This looks like a bug with the allocator unless there is a reason for
>> > not enforcing a limit(total direct memory available) on the memory
>> > allocated to all the fragments
>> > 2. This looks like a bigger problem as we are unnecessarily failing all
>> the
>> > other queries as a result of one fragment causing OOM. It makes sense if
>> > the drillbit was un-responsive after a fragment hit an OOM. But I was
>> able
>> > to connect to that specific drillbit after the failures and ran the same
>> > failing queries successfully.
>> >
>> > - Rahul
>> >
>> > On Fri, May 13, 2016 at 4:06 PM, Abdel Hakim Deneche <
>> > adeneche@maprtech.com>
>> > wrote:
>> >
>> > > 1. you are getting this error because the Drillbit is running out of
>> > direct
>> > > memory. It's thrown by Netty when it couldn't allocate a new chunk of
>> > > direct memory from the system. I know for each query, the allocator
>> will
>> > > enforce the query's limit. But I'm not sure we actually properly
>> compute
>> > > those limits to not exceed the total direct memory limit.
>> > > 2. when we hit a channel closed exception, all fragments that were
>> > > transmitting on that channel will most likely fail even though they
>> > didn't
>> > > run out of memory. It's hard to tell where the memory went without more
>> > > information about the queries you were trying to run
>> > >
>> > > On Fri, May 13, 2016 at 3:45 PM, rahul challapalli <
>> > > challapallirahul@gmail.com> wrote:
>> > >
>> > > > Drillers,
>> > > >
>> > > > I was executing 20 queries using 10 concurrent clients on an 8 node
>> > > > cluster. First 10 queries succeed and the remaining 10 queries fail
>> > with
>> > > > "ChannelClosedException". The logs suggested that all the fragments
>> > > running
>> > > > on one node hit an "java.lang.OutOfMemoryError: Direct buffer
>> memory".
>> > 2
>> > > > questions here.
>> > > >    1. Can someone explain why we are even seeing this error.
>> Shouldn't
>> > > the
>> > > > allocator detect this condition upfront?
>> > > >    2. Why did all the fragments fail. Where did the memory go?
>> > > >
>> > > > - Rahul
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > Abdelhakim Deneche
>> > >
>> > > Software Engineer
>> > >
>> > >   <http://www.mapr.com/>
>> > >
>> > >
>> > > Now Available - Free Hadoop On-Demand Training
>> > > <
>> > >
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>>
>> Abdelhakim Deneche
>>
>> Software Engineer
>>
>>   <http://www.mapr.com/>
>>
>>
>> Now Available - Free Hadoop On-Demand Training
>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> >
>>

Mime
View raw message