ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semen Boikov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (IGNITE-1758) Clients don't survive during massive servers shutdown
Date Mon, 09 Nov 2015 13:31:11 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991651#comment-14991651
] 

Semen Boikov edited comment on IGNITE-1758 at 11/9/15 1:30 PM:
---------------------------------------------------------------

Created test which restarts only server nodes, it fails from time to time with assert:
{noformat}
[11:51:24]W:		 [org.apache.ignite:ignite-core] java.lang.AssertionError: Invalid node order:
TcpDiscoveryNode [id=005cd5de-f1f1-435c-8ac4-f4474c28d000, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503],
discPort=47503, order=0, intOrder=61, lastExchangeTime=1446724284050, loc=false, ver=1.5.0#20151105-sha1:94119c29,
isClient=false]
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:51)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:48)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:3362)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9176)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9149)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:616)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleNodes(TcpDiscoveryNodesRing.java:128)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl.notifyDiscovery(ServerImpl.java:1260)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl.access$2700(ServerImpl.java:157)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddFinishedMessage(ServerImpl.java:3685)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2157)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5600)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
{noformat}

Assert fails since when NodeAddFinished event is received for some node there are nodes with
lower internal order which did not receive NodeAddFinished.

Debugged this failure, found that it is possible that some node can get IO error trying to
send message to next node and add this node in failed list, but this node can still be alive
and can process messages from others nodes. To fix this issue it is necessary to pass failedNodes
collection to next nodes so that it will be consistent across all nodes.


was (Author: sboikov):
Created test which restarts only server nodes, it fails from time to time with assert:
{noformat}
[11:51:24]W:		 [org.apache.ignite:ignite-core] java.lang.AssertionError: Invalid node order:
TcpDiscoveryNode [id=005cd5de-f1f1-435c-8ac4-f4474c28d000, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503],
discPort=47503, order=0, intOrder=61, lastExchangeTime=1446724284050, loc=false, ver=1.5.0#20151105-sha1:94119c29,
isClient=false]
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:51)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:48)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:3362)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9176)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9149)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:616)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleNodes(TcpDiscoveryNodesRing.java:128)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl.notifyDiscovery(ServerImpl.java:1260)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl.access$2700(ServerImpl.java:157)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddFinishedMessage(ServerImpl.java:3685)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2157)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5600)
[11:51:24]W:		 [org.apache.ignite:ignite-core] 	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
{noformat}

Assert fails since when NodeAddFinished event is received for some node there are nodes with
lower internal order which did not receive NodeAddFinished.

Debugged this failure, found that sometimes TcpDiscoveryNodeAddedMessage was missed during
topology changes and new coordinator did not handle it. From logs could not say exactly why
message was missed. If increase size of PendingMessages this assert does not reproduce, but
test someties hangs on continuous query start.

> Clients don't survive during massive servers shutdown
> -----------------------------------------------------
>
>                 Key: IGNITE-1758
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1758
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: ignite-1.4
>            Reporter: Denis Magda
>            Assignee: Semen Boikov
>            Priority: Blocker
>             Fix For: 1.5
>
>         Attachments: ignite-1758-test.patch
>
>
> There is a real world use case.
> Start sensible amount of servers and clients.
> Perform cache operations under a transaction.
> Stop a half of the servers. Clients must survive and keep execution their transactions.
> Did the following test:
> - Started 14 servers and 14 clients;
> - Clients execute transactional put operations;
> - Stopped 7 servers.
> Getting different assertions on clients side.
> {noformat}
> [15:47:33,401][ERROR][tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18][TcpDiscoverySpi]
Runtime error caught during grid runnable execution: IgniteSpiThread [name=tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18]
> java.lang.AssertionError: lastVer=29, newVer=32, locNode=TcpDiscoveryNode [id=80f14def-9d49-43a0-96bc-6b83aedb3008,
addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:0], discPort=0, order=26, intOrder=0, lastExchangeTime=1445428036418,
loc=true, ver=1.4.1#19700101-sha1:00000000, isClient=true], msg=TcpDiscoveryNodeFailedMessage
[failedNodeId=3020dc65-ed3e-426f-8784-5bb766961003, order=4, warning=null, super=TcpDiscoveryAbstractMessage
[sndNodeId=10c5cfe9-df07-4dfe-a5c0-460087aa9001, id=eed3e3a8051-008a978d-28cc-4f0c-8728-4a815f858000,
verifierNodeId=800cf998-828e-4f56-af6a-c2760c5ed008, topVer=32, pendingIdx=0, isClient=false]]
> 	at org.apache.ignite.spi.discovery.tcp.ClientImpl.updateTopologyHistory(ClientImpl.java:720)
> 	at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$2700(ClientImpl.java:118)
> 	at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeFailedMessage(ClientImpl.java:1812)
> 	at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1543)
> 	at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1467)
> 	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {noformat}
> {noformat}
> java.lang.AssertionError: Missed message future [rcvCnt=141, acked=0, desc=GridNioRecoveryDescriptor
[acked=0, resendCnt=0, rcvCnt=0, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode
[id=6090f64b-e019-440b-9d0e-c3642bd3a006, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503],
discPort=47503, order=3, intOrder=3, lastExchangeTime=1445428027468, loc=false, ver=1.4.1#19700101-sha1:00000000,
isClient=false], connected=false, connectCnt=1, queueLimit=5120]]
> 	at org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.ackReceived(GridNioRecoveryDescriptor.java:181)
> 	at org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.onHandshake(GridNioRecoveryDescriptor.java:251)
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2331)
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2084)
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:1978)
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1914)
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1880)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1066)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1214)
> 	at org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.publish(GridClockSyncProcessor.java:305)
> 	at org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.access$800(GridClockSyncProcessor.java:54)
> 	at org.apache.ignite.internal.processors.clock.GridClockSyncProcessor$TimeCoordinator.body(GridClockSyncProcessor.java:382)
> 	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message