ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yakov Zhdanov <yzhda...@apache.org>
Subject Re: Grid hang on compute
Date Wed, 07 Dec 2016 16:11:07 GMT
Proper solution here is to have communication backpressure per policy -
SYSTEM or PUBLIC, but not single point as it is now. I think we can achieve
this having two queues per communication session or (which looks a bit
easier to implement) to have separate connections.

As a workaround you can increase the limit. Setting it to 0 may lead to a
potential OOME on sender or receiver sides.

--Yakov

2016-12-07 20:35 GMT+07:00 Dmitry Karachentsev <dkarachentsev@gridgain.com>:

> Igniters!
>
> Recently faced with arguable issue, it looks like a bug. Scenario is
> following:
>
> 1) Start two data nodes with some cache.
>
> 2) From one node in async mode post some big number of jobs to another.
> That jobs do some cache operations.
>
> 3) Grid hangs almost immediately and all threads are sleeping except
> public ones, they are waiting for response.
>
> This happens because all cache and job messages are queued on
> communication and limited with default number (1024). It looks like jobs
> are waiting for cache responses that could not be received due to this
> limit. It's hard to diagnose and looks not convenient (as I know we have no
> limitation in docs for using cache ops from compute jobs).
>
> So, my question is. Should we try to solve that or, may be, it's enough to
> update documentation with recommendation to disable queue limit for such
> cases?
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message