ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DLopez <d.lope...@gmail.com>
Subject Re: One failing node stalling the whole cluster
Date Sun, 05 Jun 2016 19:18:08 GMT
Hi Dennis,
I agree that it shouldn't happen but I have been able to reproduce it in
other machines consistently and the only "connection" that they have is that
they share the Ignite replicated caches.

One machine is basically reading from several caches and filling up some
data to be returned, I can have 25 clients requesting some data and
everything is fine. The other one is a different application, that basically
fills up the replicated caches from the DB but receives no direct requests.
Someone forgot to control a batch job in this second application and it can
be run many times, consuming up all the memory in this second application.
The strange thing is that when the second applications start GCing like
crazy, the first one starts going slower and slower up to a point when it
stops answering requests. If I kill -9 the second application, the first one
goes back to normal behaviour immediately and can respond 25 simultaneous
requests again with normal response times. I can restart the second
application and repeat the same thing and the behaviour is the same.

So I can tell you it's no application code or garbage collection issue in
the other app. The batch job in the second app, that we run manually for
this test, is not replicated and does nothing related to ignite, it does not
even use the replicated caches.

The only thing I can think of that would show this behaviour would be the
sync. process in the Ignite caches slowing down/stalling the reading of
values. As the second app. starts experiencing GC issues and slows down the
Ignite sync. process, then it affects the other apps reading the caches. So
I was wondering if the sync. mechanism might have some kind of lock on the
caches that would prevent reading from them.

I'll see if I can replicate it in a small scale experiment, apart from
testing with ignite 1.6.

Thanks for your input

View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p5432.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

View raw message