ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey Gura (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-4003) Slow or faulty client can stall the whole cluster.
Date Sun, 25 Dec 2016 19:19:58 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15776941#comment-15776941
] 

Andrey Gura commented on IGNITE-4003:
-------------------------------------

At this moment there are several problems. I think that all this problems have common root
of problem: race condition on {{clients}} map and {{GridNioRecoveryDescriptor}}. There are
the following behaviour aspects that lead to infinite retries connect to remote node:

# Client was removed from {{clients}} map on local node but still exists on remote node. As
result remote node rejects connection with "Received incoming connection when already connected
to this node, rejecting" message. 
# Remote node rejects connection because it already tries connect to local node but can't
reserve recovery descriptor. Remote node rejects connection with "Received incoming connection
when already connected to this node, rejecting" message.
# Remote node doesn't reject connection but don't send {{RecoveryLastReceivedMessage}} because
of impossibility to reserve recovery descriptor on {{tryReserve}} call.



> Slow or faulty client can stall the whole cluster.
> --------------------------------------------------
>
>                 Key: IGNITE-4003
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4003
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache, general
>    Affects Versions: 1.7
>            Reporter: Vladimir Ozerov
>            Assignee: Andrey Gura
>            Priority: Critical
>             Fix For: 2.0
>
>
> Steps to reproduce:
> 1) Start two server nodes and some data to cache.
> 2) Start a client from Docker subnet, which is not visible from the outside. Client will
join the cluster.
> 3) Try to put something to cache or start another node to force rabalance.
> Cluster is stuck at this moment. Root cause - servers are constantly trying to establish
outgoing connection to the client, but fail as Docker subnet is not visible from the outside.
It may stop virtually all cluster operations.
> Typical thread dump:
> {code}
> org.apache.ignite.IgniteCheckedException: Failed to send message (node may have left
the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode
[id=a15d74c2-1ec2-4349-9640-aeacd70d8714, addrs=[127.0.0.1, 172.17.0.6], sockAddrs=[/127.0.0.1:0,
/127.0.0.1:0, /172.17.0.6:0], discPort=0, order=7241, intOrder=3707, lastExchangeTime=1474096941045,
loc=false, ver=1.5.23#20160526-sha1:259146da, isClient=true], topic=T4 [topic=TOPIC_CACHE,
id1=949732fd-1360-3a58-8d9e-0ff6ea6182cc, id2=a15d74c2-1ec2-4349-9640-aeacd70d8714, id3=2],
msg=GridContinuousMessage [type=MSG_EVT_NOTIFICATION, routineId=7e13c48e-6933-48b2-9f15-8d92007930db,
data=null, futId=null], policy=2]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1129)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1347)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1227)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1198)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1180)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:841)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:800)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:787)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.access$700(CacheContinuousQueryHandler.java:91)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler$1.onEntryUpdated(CacheContinuousQueryHandler.java:412)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:343)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:250)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3476)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtForceKeysFuture$MiniFuture.onResult(GridDhtForceKeysFuture.java:548)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtForceKeysFuture.onResult(GridDhtForceKeysFuture.java:207)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.processForceKeyResponse(GridDhtPreloader.java:636)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.access$1000(GridDhtPreloader.java:81)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.onMessage(GridDhtPreloader.java:202)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.onMessage(GridDhtPreloader.java:200)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$MessageHandler.apply(GridDhtPreloader.java:877)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$MessageHandler.apply(GridDhtPreloader.java:859)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:582)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:280)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:204)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:80)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:163)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1058)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:836)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.access$1700(GridIoManager.java:104)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:799)
[ignite-core-1.5.23.jar:1.5.23]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_51]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_51]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
> Caused by: org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote
node: TcpDiscoveryNode [id=a15d74c2-1ec2-4349-9640-aeacd70d8714, addrs=[127.0.0.1, 172.17.0.6],
sockAddrs=[/127.0.0.1:0, /127.0.0.1:0, /172.17.0.6:0], discPort=0, order=7241, intOrder=3707,
lastExchangeTime=1474096941045, loc=false, ver=1.5.23#20160526-sha1:259146da, isClient=true]
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1986)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1926)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1124)
[ignite-core-1.5.23.jar:1.5.23]
> 	... 32 common frames omitted
> Caused by: org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node
still alive?). Make sure that each GridComputeTask and GridCacheTransaction has a timeout
set in order to prevent parties from waiting forever in case of network issues [nodeId=a15d74c2-1ec2-4349-9640-aeacd70d8714,
addrs=[/172.17.0.6:47100, /127.0.0.1:47100]]
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2489)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2130)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2024)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1960)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1926)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1124)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1347)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1227)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1198)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1180)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:841)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:800)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:787)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.access$700(CacheContinuousQueryHandler.java:91)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler$1.onEntryUpdated(CacheContinuousQueryHandler.java:412)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:343)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:250)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3476)
[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$MiniFuture.onResult(GridDhtLockFuture.java:1213)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onResult(GridDhtLockFuture.java:529)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.processDhtLockResponse(GridDhtTransactionalCacheAdapter.java:639)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.access$100(GridDhtTransactionalCacheAdapter.java:89)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$5.apply(GridDhtTransactionalCacheAdapter.java:151)
~[ignite-core-1.5.23.jar:1.5.23]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$5.apply(GridDhtTransactionalCacheAdapter.java:149)
~[ignite-core-1.5.23.jar:1.5.23]
> 	... 12 common frames omitted
> 	Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address:
/172.17.0.6:47100
> 		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2494)
~[ignite-core-1.5.23.jar:1.5.23]
> 		... 35 common frames omitted
> 	Caused by: java.net.SocketTimeoutException: null
> 		at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
> 		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2353)
> 		... 35 common frames omitted
> 	Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address:
/127.0.0.1:47100
> 		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2494)
~[ignite-core-1.5.23.jar:1.5.23]
> 		... 35 common frames omitted
> 	Caused by: org.apache.ignite.IgniteCheckedException: Remote node ID is not as expected
[expected=a15d74c2-1ec2-4349-9640-aeacd70d8714, rcvd=48cccf25-7c29-4048-bd52-704acdb552e6]
> 		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2604)
> 		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2361)
> 		... 35 common frames omitted
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message