"threads on all the other clients block for a period of time" - how long is this period of time?

It definitely makes sense to try more recent version of Ignite.

The thread dump that you have shown should be only waiting for all data nodes, which usually are server nodes, so it's not obvious how it is related to client leaving.

Ilya Kasnacheev

вт, 23 апр. 2019 г. в 20:50, Matt Nohelty <nolt2232@gmail.com>:
What period of time are you asking about?  We deploy fairly regularly so our application servers (i.e. the Ignite clients) get restarted at least weekly which will trigger a disconnect and reconnect event for each.  We have not noticed any issues during our regular release process but in this case we are shutting down the Ignite clients gracefully with Ignite#close.  However, it's also possible that something bad happens on an application servers causing it to crash.  This is the scenario where we've seen blocking across the cluster.  We'd obviously like our application servers to be as independent of one another as possible and it's problematic if an issue on one server is allowed to ripple across all of them.

I should have mentioned it in my initial post but we are currently using version 2.4.  I received the following response on my Stack Overflow post:  "When topology changes, partition map exchange is triggered internally. It blocks all operations on the cluster. Also in old versions ongoing rebalancing was cancelled. But in the latest versions client connection/disconnection doesn't affect some processes like this. So, it's worth trying the most fresh release"

This comment also mentions PME so it sounds like you both are referencing the same behavior.  However, this comment also states that client connect/disconnect events do not trigger  PME in the more recent versions of Ignite.  Can anyone confirm that this is true, and if so, which version was this change made in?

Thank you very much for the help.  

On Tue, Apr 23, 2019 at 10:00 AM Ilya Kasnacheev <ilya.kasnacheev@gmail.com> wrote:

What's the period of time?

When client disconnects, topology will change, which will trigger waiting for PME, which will delay all further operations until PME is finished.

Avoid having short-lived clients.

Ilya Kasnacheev

вт, 23 апр. 2019 г. в 03:40, Matt Nohelty <nolt2232@gmail.com>:

I already posted this question to stack overflow here https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected but this mailing list is probably more appropriate.

We use Apache Ignite for caching and are seeing some unexpected behavior across all of the clients of cluster when one of the clients fails. The Ignite cluster itself has three servers and there are approximately 12 servers connecting to that cluster as clients. The cluster has persistence disabled and many of the caches have near caching enabled.

What we are seeing is that when one of the clients fail (out of memory, high CPU, network connectivity, etc.), threads on all the other clients block for a period of time. During these times, the Ignite servers themselves seem fine but I see things like the following in the logs:

Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]
Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]

The topology itself is clearly changing when a client connects/disconnects but is there anything happening internally inside the cluster that could cause blocking on other clients? I would expect re-balancing of data when a server disconnects but not a client.

From a thread dump, I see many threads stuck in the following state:

java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000000078a86ff18> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452)
at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056)
at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339)
at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403)
at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
at java.lang.Iterable.forEach(Iterable.java:74)

Any ideas, suggestions, or further avenues to investigate would be much appreciated.