I already posted this question to stack overflow here https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected but this mailing list is probably more appropriate.
We use Apache Ignite for caching and are seeing some unexpected behavior across all of the clients of cluster when one of the clients fails. The Ignite cluster itself has three servers and there are approximately 12 servers connecting to that cluster as clients. The cluster has persistence disabled and many of the caches have near caching enabled.
What we are seeing is that when one of the clients fail (out of memory, high CPU, network connectivity, etc.), threads on all the other clients block for a period of time. During these times, the Ignite servers themselves seem fine but I see things like the following in the logs:
Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]
Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]
The topology itself is clearly changing when a client connects/disconnects but is there anything happening internally inside the cluster that could cause blocking on other clients? I would expect re-balancing of data when a server disconnects but not a client.
From a thread dump, I see many threads stuck in the following state:
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000078a86ff18> (a java.util.concurrent.CountDownLatch$Sync)
Any ideas, suggestions, or further avenues to investigate would be much appreciated.