ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Voronkin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever
Date Thu, 04 Oct 2018 13:48:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pavel Voronkin updated IGNITE-9793:
-----------------------------------
    Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#FF0000}724{color} > 910 > 655 > ...* ****node
failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 and all subsequent
nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why activation hangs.

Details bellow:

 

Coordinator deactivated:

2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98],
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110,
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=156, offline=4]

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, addrs=ArrayList [10.116.215.3],
sockAddrs=HashSet [grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73,
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=155, offline=5]

topology rolled to version 188.

Coordinator rejects node 724 form joining the cluster:

2018-09-24 15:17:27.491 [INFO ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611]
 2018-09-24 15:17:27.491 [WARN ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Close incoming connection, unknown node [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, ses=GridSelectorNioSessionImpl
[worker=GridWorker [name=grid-nio-worker-tcp-comm-9, igniteInstanceName=DPL_GRID%DplGridNodeName,
finished=false, hashCode=632168435, interrupted=false, runner=grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%]AbstractNioClientWorker
[idx=9, bytesRcvd=326476100, bytesSent=39876884360, bytesRcvd0=42, bytesSent0=18, select=true,
super=]DirectNioClientWorker [super=], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768
cap=32768], readBuf=java.nio.DirectByteBuffer[pos=42 lim=42 cap=32768], inRecovery=null, outRecovery=null,
super=GridNioSessionImpl [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611, createTime=1537791447487,
closeTime=0, bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, sndSchedTime=1537791447487,
lastSndTime=1537791447487, lastRcvTime=1537791447487, readsPaused=false, filterChain=FilterChain[filters=[,
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]] 

 

Node 724 SEGMENTED before activation starts:

2018-09-24 15:17:50.068 [WARN ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList
[10.116.206.98], sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110,
intOrder=110, lastExchangeTime=1537791470063, loc=true, ver=2.5.1#20180906-sha1:ebde6c79,
isClient=false]

2018-09-24 15:18:07.145 [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
Critical system error detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext [type=SEGMENTATION, err=null]]

Coordinator started activating topology without nodes 724 and 931.

2018-09-24 15:19:48.686 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]

But node 910 which is next to node 724 haven't received activation message.

Hovever 655 which is next after 910 received ChangeStateMessage:

2018-09-24 15:21:08.430 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]

Coordinator received all expected messages except 910, that was:

2018-09-24 15:24:25.911 [INFO ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=073f1598-6b70-49df-8f45-126735611775, *allReceived=false*]

 

GridDhtPartitionsExchangeFuture hangs forever:

09-24 15:33:46.784 [WARN ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.apache.ignite.internal.diagnostic]
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=e002e011-8d1c-4353-a0f3-b71264c5b0f4]. Dumping pending objects that might be the cause:

 

So one node in the ring missed the message, hovever all other nodes in topology got it, how
is that possible?

 

  was:
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator) -> 601 > {color:#ff0000}724{color} > 910 > 655-* *> ...* **node
failed

*665(coordinator) -> 601 -> 910 -> 655-*  *> ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 and all subsequent
nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why activation hangs.

Details bellow:

 

Coordinator deactivated:

2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98],
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110,
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=156, offline=4]

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, addrs=ArrayList [10.116.215.3],
sockAddrs=HashSet [grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73,
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=155, offline=5]

topology rolled to version 188.

Coordinator rejects node 724 form joining the cluster:

2018-09-24 15:17:27.491 [INFO ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611]
 2018-09-24 15:17:27.491 [WARN ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Close incoming connection, unknown node [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, ses=GridSelectorNioSessionImpl
[worker=GridWorker [name=grid-nio-worker-tcp-comm-9, igniteInstanceName=DPL_GRID%DplGridNodeName,
finished=false, hashCode=632168435, interrupted=false, runner=grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%]AbstractNioClientWorker
[idx=9, bytesRcvd=326476100, bytesSent=39876884360, bytesRcvd0=42, bytesSent0=18, select=true,
super=]DirectNioClientWorker [super=], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768
cap=32768], readBuf=java.nio.DirectByteBuffer[pos=42 lim=42 cap=32768], inRecovery=null, outRecovery=null,
super=GridNioSessionImpl [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611, createTime=1537791447487,
closeTime=0, bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, sndSchedTime=1537791447487,
lastSndTime=1537791447487, lastRcvTime=1537791447487, readsPaused=false, filterChain=FilterChain[filters=[,
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]] 

 

Node 724 SEGMENTED before activation starts:

2018-09-24 15:17:50.068 [WARN ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList
[10.116.206.98], sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110,
intOrder=110, lastExchangeTime=1537791470063, loc=true, ver=2.5.1#20180906-sha1:ebde6c79,
isClient=false]

2018-09-24 15:18:07.145 [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
Critical system error detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext [type=SEGMENTATION, err=null]]

Coordinator started activating topology without nodes 724 and 931.

2018-09-24 15:19:48.686 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]

But node 910 which is next to node 724 haven't received activation message.

Hovever 655 which is next after 910 received ChangeStateMessage:

2018-09-24 15:21:08.430 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]

Coordinator received all expected messages except 910, that was:

2018-09-24 15:24:25.911 [INFO ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=073f1598-6b70-49df-8f45-126735611775, *allReceived=false*]

 

GridDhtPartitionsExchangeFuture hangs forever:

09-24 15:33:46.784 [WARN ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.apache.ignite.internal.diagnostic]
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=e002e011-8d1c-4353-a0f3-b71264c5b0f4]. Dumping pending objects that might be the cause:

 

So one node in the ring missed the message, hovever all other nodes in topology got it, how
is that possible?

 


> Deactivation, segmentation of one node, activation may lead to hang activation forever
> --------------------------------------------------------------------------------------
>
>                 Key: IGNITE-9793
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9793
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: Pavel Voronkin
>            Priority: Major
>
> There is topology with ring of nodes:
> *665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated
> *665(coordinator)* *> 601 > {color:#FF0000}724{color} > 910 > 655 > ...* ****node
failed
> *665(coordinator) > 601 > 910 > 655 > ...* **activated
> During activation node 910 haven't received StateChangedMessage, hovever 655 and all
subsequent nodes received and responded to coordinator.
> So coordinator expects to have 154 messages but received only 153 that is why activation
hangs.
> Details bellow:
>  
> Coordinator deactivated:
> 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
>  2018-09-24 15:09:01.620 [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
Server latch is created [latch=CompletableLatchUid
> {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}
> , participantsSize=160]
>  2018-09-24 15:09:01.621 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]
> nodes 601, 724, 910, 655 were deactivated:
> 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.328 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.334 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.332 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> Spi on coordinator received node 724 failed message:
> 2018-09-24 15:17:00.220 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98],
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110,
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, offheap=190000.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=156, offline=4]
> topology rolled to version 187, then another node 931 failed:
> 2018-09-24 15:17:00.466 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, addrs=ArrayList [10.116.215.3],
sockAddrs=HashSet [grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73,
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, offheap=190000.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=155, offline=5]
> topology rolled to version 188.
> Coordinator rejects node 724 form joining the cluster:
> 2018-09-24 15:17:27.491 [INFO ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611]
>  2018-09-24 15:17:27.491 [WARN ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Close incoming connection, unknown node [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, ses=GridSelectorNioSessionImpl
[worker=GridWorker [name=grid-nio-worker-tcp-comm-9, igniteInstanceName=DPL_GRID%DplGridNodeName,
finished=false, hashCode=632168435, interrupted=false, runner=grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%]AbstractNioClientWorker
[idx=9, bytesRcvd=326476100, bytesSent=39876884360, bytesRcvd0=42, bytesSent0=18, select=true,
super=]DirectNioClientWorker [super=], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768
cap=32768], readBuf=java.nio.DirectByteBuffer[pos=42 lim=42 cap=32768], inRecovery=null, outRecovery=null,
super=GridNioSessionImpl [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611, createTime=1537791447487,
closeTime=0, bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, sndSchedTime=1537791447487,
lastSndTime=1537791447487, lastRcvTime=1537791447487, readsPaused=false, filterChain=FilterChain[filters=[,
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]] 
>  
> Node 724 SEGMENTED before activation starts:
> 2018-09-24 15:17:50.068 [WARN ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
Node is out of topology (probably, due to short-time network problems).
>  2018-09-24 15:17:50.069 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList
[10.116.206.98], sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110,
intOrder=110, lastExchangeTime=1537791470063, loc=true, ver=2.5.1#20180906-sha1:ebde6c79,
isClient=false]
> 2018-09-24 15:18:07.145 [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
Critical system error detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext [type=SEGMENTATION, err=null]]
> Coordinator started activating topology without nodes 724 and 931.
> 2018-09-24 15:19:48.686 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]
> But node 910 which is next to node 724 haven't received activation message.
> Hovever 655 which is next after 910 received ChangeStateMessage:
> 2018-09-24 15:21:08.430 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]
> Coordinator received all expected messages except 910, that was:
> 2018-09-24 15:24:25.911 [INFO ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=073f1598-6b70-49df-8f45-126735611775, *allReceived=false*]
>  
> GridDhtPartitionsExchangeFuture hangs forever:
> 09-24 15:33:46.784 [WARN ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.apache.ignite.internal.diagnostic]
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=e002e011-8d1c-4353-a0f3-b71264c5b0f4]. Dumping pending objects that might be the cause:
>  
> So one node in the ring missed the message, hovever all other nodes in topology got it,
how is that possible?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message