ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Karachentsev <dkarachent...@gridgain.com>
Subject Re: Grid hang on compute
Date Thu, 08 Dec 2016 07:34:14 GMT

Opened a ticket to support Yakov's proposal.

On 08.12.2016 4:03, Dmitriy Setrakyan wrote:
> Is there any way we can detect this and prevent from happening? Or perhaps
> start rejecting jobs if they can potentially block the system?
> On Wed, Dec 7, 2016 at 8:11 AM, Yakov Zhdanov <yzhdanov@apache.org> wrote:
>> Proper solution here is to have communication backpressure per policy -
>> SYSTEM or PUBLIC, but not single point as it is now. I think we can achieve
>> this having two queues per communication session or (which looks a bit
>> easier to implement) to have separate connections.
>> As a workaround you can increase the limit. Setting it to 0 may lead to a
>> potential OOME on sender or receiver sides.
>> --Yakov
>> 2016-12-07 20:35 GMT+07:00 Dmitry Karachentsev <dkarachentsev@gridgain.com
>>> :
>>> Igniters!
>>> Recently faced with arguable issue, it looks like a bug. Scenario is
>>> following:
>>> 1) Start two data nodes with some cache.
>>> 2) From one node in async mode post some big number of jobs to another.
>>> That jobs do some cache operations.
>>> 3) Grid hangs almost immediately and all threads are sleeping except
>>> public ones, they are waiting for response.
>>> This happens because all cache and job messages are queued on
>>> communication and limited with default number (1024). It looks like jobs
>>> are waiting for cache responses that could not be received due to this
>>> limit. It's hard to diagnose and looks not convenient (as I know we have
>> no
>>> limitation in docs for using cache ops from compute jobs).
>>> So, my question is. Should we try to solve that or, may be, it's enough
>> to
>>> update documentation with recommendation to disable queue limit for such
>>> cases?

View raw message