ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@gridgain.com>
Subject Re: One failing node stalling the whole cluster
Date Fri, 03 Jun 2016 12:58:20 GMT
Hi Daniel,

Actually a failure of one machine shouldn’t lead to the whole cluster shutdown unless your
application code was executed on those nodes as well and killed other nodes due to high GC
pauses or some other reason.

My first suggestion is to tune garbage collection appropriately:
https://apacheignite.readme.io/v1.6/docs/jvm-and-system-tuning#jvm-tuning-for-clusters-with-on_heap-caches

and track GC logs to adjust the settings if needed
https://apacheignite.readme.io/v1.6/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats

If the issue still happens please share GC logs and logs from all the nodes with us. Probably
we will be able to pin point the problem on your side.

—
Denis

> On Jun 2, 2016, at 11:21 AM, Daniel López <d.lopez.j@gmail.com> wrote:
> 
> Hi there,
> 
> We are using Ignite 1.5.0 and we are experiencing a strange issue where one node stalls
the others node in the cluster. We are using CacheMode.REPLICATED caches to store data on
heap on several nodes to improve latency.
> In one of the latest upgrades someone introduced a bug in the system that could cause
one node to consume too much memory and start having GC issues. Sh*t happens :).
> The problem, however, is that when this node starts to crawl due to heavy GC usage, it
starts spitting these logs:
> 
> |Failed to process selector key (will close): GridSelectorNioSessionImpl [selectorIdx=0,
queueSize=24, writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0
lim=32768 cap=32768], recovery=GridNioRecoveryDescriptor [acked=284144, resendCnt=0, rcvCnt=284230,
reserved=true, lastAck=284224, nodeLeft=false, node=TcpDiscoveryNode [id=1109a421-ec72-4534-99c4-df5d7e4f6136,
addrs=[x.y.z4, 127.0.0.1], sockAddrs=[machine.env/x.y.z4:3808, /x.y.z:3808, /127.0.0.1:3808
<http://127.0.0.1:3808/>], discPort=3808, order=32, intOrder=17, lastExchangeTime=1464854224616,
loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false], connected=true, connectCnt=0,
queueLimit=5120], super=GridNioSessionImpl [locAddr=/x.y.z3:47100, rmtAddr=/x.y.z4:33450,
createTime=1464854224707, closeTime=0, bytesSent=10838655, bytesRcvd=221207982, sndSchedTime=1464854575433,
lastSndTime=1464854575433, lastRcvTime=1464854575433, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@63f44e30, directMode=true], GridConnectionBytesVerifyFilter],
accepted=true]]
> WARN |2016-06-02T08:03:02,575||TcpCommunicationSpi|Closing NIO session because of unhandled
exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Conexión reinicializada por la
máquina remota]
> 
> And the other nodes in the cluster start to produce these other logs and access to the
cache slows down/pauses greatly:
> 
> GridCachePartitionExchangeManager|Failed to send partitions full message [node=TcpDiscoveryNode
[id=913ea465-ed45-4ec9-a4b7-d2c5f9c57a2e, a....
> TcpDiscoverySpi|Failed to ping node (status check will be initiated): .... 
> GridDiscoveryManager|Node FAILED: TcpDiscoveryNode [id=...
> 
> That the node with the GC issues stops working is "normal", even if undesired, but what
really worries us is that it causes the other nodes in the cluster to stop being able to use
the replicated caches, so one node can bring down the whole cluster.
> 
> If we stop the offeding node, the others go back to normal behaviour and work as fast
as always.
> 
> We are going to solve the application bug, of course, but is there any configuration
setting that we can tweak so one bug in one machine does not bring the whole cluster to a
halt?
> 
> Ignite is configured to use TcpDiscoverySpi with TcpDiscoveryVmIpFinder with a list of
addresses (11 nodes per set currently)
> Each node has 29 caches configured like this:
>         cacheConfiguration.setCacheMode(CacheMode.REPLICATED);
>         cacheConfiguration.setCopyOnRead(false);
>         cacheConfiguration.setEagerTtl(false);
> 
> Thanks,
> D.
> 
> PD: Yes, we'll have to try with latest Ignite version but we wanted to know if there
is any configuration setting that might help first, before having to migrate and restart the
whole testing/process.


Mime
View raw message