ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: What happens when a client gets disconnected
Date Thu, 25 Apr 2019 09:21:20 GMT

"threads on all the other clients block for a period of time" - how long is
this period of time?

It definitely makes sense to try more recent version of Ignite.

The thread dump that you have shown should be only waiting for all data
nodes, which usually are server nodes, so it's not obvious how it is
related to client leaving.

Ilya Kasnacheev

вт, 23 апр. 2019 г. в 20:50, Matt Nohelty <nolt2232@gmail.com>:

> What period of time are you asking about?  We deploy fairly regularly so
> our application servers (i.e. the Ignite clients) get restarted at least
> weekly which will trigger a disconnect and reconnect event for each.  We
> have not noticed any issues during our regular release process but in this
> case we are shutting down the Ignite clients gracefully with Ignite#close.
> However, it's also possible that something bad happens on an application
> servers causing it to crash.  This is the scenario where we've seen
> blocking across the cluster.  We'd obviously like our application servers
> to be as independent of one another as possible and it's problematic if an
> issue on one server is allowed to ripple across all of them.
> I should have mentioned it in my initial post but we are currently using
> version 2.4.  I received the following response on my Stack Overflow post:
> "When topology changes, partition map exchange is triggered internally. It
> blocks all operations on the cluster. Also in old versions ongoing
> rebalancing was cancelled. But in the latest versions client
> connection/disconnection doesn't affect some processes like this. So, it's
> worth trying the most fresh release"
> This comment also mentions PME so it sounds like you both are referencing
> the same behavior.  However, this comment also states that client
> connect/disconnect events do not trigger  PME in the more recent versions
> of Ignite.  Can anyone confirm that this is true, and if so, which version
> was this change made in?
> Thank you very much for the help.
> On Tue, Apr 23, 2019 at 10:00 AM Ilya Kasnacheev <
> ilya.kasnacheev@gmail.com> wrote:
>> Hello!
>> What's the period of time?
>> When client disconnects, topology will change, which will trigger waiting
>> for PME, which will delay all further operations until PME is finished.
>> Avoid having short-lived clients.
>> Regards,
>> --
>> Ilya Kasnacheev
>> вт, 23 апр. 2019 г. в 03:40, Matt Nohelty <nolt2232@gmail.com>:
>>> I already posted this question to stack overflow here
>>> https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected
>>> but this mailing list is probably more appropriate.
>>> We use Apache Ignite for caching and are seeing some unexpected behavior
>>> across all of the clients of cluster when one of the clients fails. The
>>> Ignite cluster itself has three servers and there are approximately 12
>>> servers connecting to that cluster as clients. The cluster has persistence
>>> disabled and many of the caches have near caching enabled.
>>> What we are seeing is that when one of the clients fail (out of memory,
>>> high CPU, network connectivity, etc.), threads on all the other clients
>>> block for a period of time. During these times, the Ignite servers
>>> themselves seem fine but I see things like the following in the logs:
>>> Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB,
heap=XXX.GB]Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]
>>> The topology itself is clearly changing when a client
>>> connects/disconnects but is there anything happening internally inside the
>>> cluster that could cause blocking on other clients? I would expect
>>> re-balancing of data when a server disconnects but not a client.
>>> From a thread dump, I see many threads stuck in the following state:
>>> java.lang.Thread.State: TIMED_WAITING (parking)
>>> at sun.misc.Unsafe.park(Native Method)- parking to wait for  <0x000000078a86ff18>
(a java.util.concurrent.CountDownLatch$Sync)
>>> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>>> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>>> at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452)
>>> at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056)
>>> at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733)
>>> at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339)
>>> at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
>>> at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403)
>>> at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
>>> at java.lang.Iterable.forEach(Iterable.java:74)...
>>> Any ideas, suggestions, or further avenues to investigate would be much
>>> appreciated.

View raw message