ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Voronkin (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever
Date Tue, 27 Nov 2018 12:14:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pavel Voronkin resolved IGNITE-9793.
------------------------------------
    Resolution: Duplicate

Dpulicate of IGNITE-10417

> Deactivation, segmentation of one node, activation may lead to hang activation forever
> --------------------------------------------------------------------------------------
>
>                 Key: IGNITE-9793
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9793
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: Pavel Voronkin
>            Priority: Major
>         Attachments: 601_gc_server_memory.log.0.current.7z, 724_gc_server_memory.log.0.current.7z,
910_gc_server_memory.log.0.current.7z
>
>
> There is topology with ring of nodes:
> *665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated
> *665(coordinator)* *> 601 > {color:#ff0000}724{color} > 910 > 655 > ...* ****node
failed
> *665(coordinator) > 601 > 910 > 655 > ...* **activated
> During activation node 910 haven't received StateChangedMessage, hovever 655 and all
subsequent nodes received and responded to coordinator.
> So coordinator expects to have 154 messages but received only 153 that is why activation
hangs.
> Details bellow:
> *Coordinator deactivated:*
> 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
>  2018-09-24 15:09:01.620 [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
Server latch is created [latch=CompletableLatchUid
> {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}
> , participantsSize=160]
>  2018-09-24 15:09:01.621 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]
> *nodes 601, 724, 910, 655 were deactivated:*
> 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.328 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.334 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.332 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Successfully deactivated data structures, services and caches [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1,
client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> *Spi on coordinator received node 724 failed message:*
> 2018-09-24 15:17:00.220 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98],
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110,
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, offheap=190000.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=156, offline=4]
> *topology rolled to version 187, then another node 931 failed:*
> 2018-09-24 15:17:00.466 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, addrs=ArrayList [10.116.215.3],
sockAddrs=HashSet [grid931.domain/10.116.215.3:47500], discPort=47500, order=73, intOrder=73,
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, offheap=190000.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
^-- Baseline [id=6, size=160, online=155, offline=5]
> *topology rolled to version 188.*
> *Coordinator rejects node 724 form joining the cluster:*
> 2018-09-24 15:17:27.491 [INFO ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611]
>  2018-09-24 15:17:27.491 [WARN ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Close incoming connection, unknown node [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, ses=GridSelectorNioSessionImpl
[worker=GridWorker [name=grid-nio-worker-tcp-comm-9, igniteInstanceName=DPL_GRID%DplGridNodeName,
finished=false, hashCode=632168435, interrupted=false, runner=grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%]AbstractNioClientWorker
[idx=9, bytesRcvd=326476100, bytesSent=39876884360, bytesRcvd0=42, bytesSent0=18, select=true,
super=]DirectNioClientWorker [super=], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768
cap=32768], readBuf=java.nio.DirectByteBuffer[pos=42 lim=42 cap=32768], inRecovery=null, outRecovery=null,
super=GridNioSessionImpl [locAddr=/10.116.206.11:47100, rmtAddr=/10.116.206.98:41611, createTime=1537791447487,
closeTime=0, bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, sndSchedTime=1537791447487,
lastSndTime=1537791447487, lastRcvTime=1537791447487, readsPaused=false, filterChain=FilterChain[filters=[,
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]] 
> *Node 724 SEGMENTED before activation starts:*
> 2018-09-24 15:17:50.068 [WARN ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
Node is out of topology (probably, due to short-time network problems).
>  2018-09-24 15:17:50.069 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList
[10.116.206.98], sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110,
intOrder=110, lastExchangeTime=1537791470063, loc=true, ver=2.5.1#20180906-sha1:ebde6c79,
isClient=false]
> 2018-09-24 15:18:07.145 [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
Critical system error detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext [type=SEGMENTATION, err=null]]
> *Node 724 stopped:*
> 2018-09-24 15:18:09.931 [INFO ][node-stopper][o.a.i.i.IgniteKernal%DPL_GRID%DplGridNodeName] 
> *Coordinator started activating topology without nodes 724 and 931:*
> 2018-09-24 15:19:48.686 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]
> *Node 601 received activation message:*
> 2018-09-24 15:21:08.430 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]
> *But node 910 which is next to node 724 missed ChangeStateMessage activation message
from 601.*
> *Nevertheless node 601 has connected to node 910:*
> 2018-09-21 14:29:35.843 [INFO ][grid-nio-worker-tcp-comm-2-#122%DPL_GRID%DplGridNodeName%|#122%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/10.116.214.83:59739, rmtAddr=grid910.domain/10.116.214.210:47100]
> *Hovever 655 which is next after 910 received ChangeStateMessage:*
> 2018-09-24 15:21:08.816 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, topVer=AffinityTopologyVersion
[topVer=188, minorTopVer=1]]
> *Coordinator received all expected messages except 910, that last was:*
> 2018-09-24 15:24:25.911 [INFO ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=073f1598-6b70-49df-8f45-126735611775, *allReceived=false*]
> *GridDhtPartitionsExchangeFuture hangs forever:*
> 09-24 15:33:46.784 [WARN ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.apache.ignite.internal.diagnostic]
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1],
node=e002e011-8d1c-4353-a0f3-b71264c5b0f4]. Dumping pending objects that might be the cause:
> So one node in the ring missed the message, hovever all other nodes in topology got it,
how is that possible?
> GC logs for node 609, 724, 910 attached. There is no anything unusual in GC for nodes
609 or 910.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message