ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abhishek Gupta (BLOOMBERG/ 731 LEX)" <agupta...@bloomberg.net>
Subject Re: Grid suddenly went in bad state
Date Tue, 24 Sep 2019 17:45:57 GMT
Hello Folks,
      Would really appreciate any suggestions you could provide about the below.


Thanks,
Abhishek

From: user@ignite.apache.org At: 09/20/19 15:11:33To:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state


Find attached the logs from 3 of the nodes and their GC graphs. The logs from the other nodes
look pretty much the same. 

Some questions - 
1. What could be the trigger for the "Partition states validation has failed for group"  in
node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator 
   3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator
changed [prev=ZookeeperClusterNode [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147,
127.0.0.1, 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste       
rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=false, client=false]]
   3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator
changed [prev=ZookeeperClusterNode [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148,
10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], cur=ZookeeperCluste       
rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=false, client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is that by design?
Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for node 1.  Any help
in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek


From: user@ignite.apache.org At: 09/20/19 05:24:59To:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite you're currently
using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor
On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <agupta726@bloomberg.net>
wrote:

Hello,
      I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed
to be performing normally until at one point it started logging "Partition states validation
has failed for group" warning - see attached log file.  This kept happening for about 3 minutes
and then stopped (see line 85 in the attached log file).  Just then a client seems to have
connected (see line 135 where connection was accepted). But soon after, it kept logging the
below exception. After a while (~1 hour), it started showing logging "Partition states validation
has failed for group" again (line 284). 


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed
partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture
[topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode
[id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126,
loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion
[topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager -
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0],
force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed
to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
        at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182)
~[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
        at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
~[?:1.8.0_172]
        at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
~[ignite-core-2.7.5-0-2.jar:2.7.5]
        ... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client
had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115]  - Critical system error detected.
Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false,
timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null,
finished=false, heartbeatTs=1568920228643]
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last
part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute
ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000,
state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation
failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss for 
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000,
retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to
execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001,
state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation
failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443)
[ignite-zookeeper-2.7.5.jar:2.7.5]
        at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939)
[ignite-zookeeper-2.7.5.jar:2.7.5]
        at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671)
[ignite-zookeeper-2.7.5.jar:2.7.5]
        at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629)
[ignite-zookeeper-2.7.5.jar:2.7.5]
        at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414)
[ignite-zookeeper-2.7.5.jar:2.7.5]
        at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42)
[datagrid-cass-dependencies.jar:?]
        at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137)
[ignite-core-2.7.5-0-2.jar:2.7.5]
        at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400)
[ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection
to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node
SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110,
10.126.230.37, 127.0.0.1],
 order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175]  - Critical system error detected.
Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler
[ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SEGMENTATION, err=null]]


Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek


Mime
View raw message