ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semen Boikov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-1003) Communication issues when running client node in separate subnetwork
Date Tue, 09 Jun 2015 14:57:01 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579022#comment-14579022
] 

Semen Boikov commented on IGNITE-1003:
--------------------------------------

Did some testing with one server/one client, found one suspicous place in server dump at the
moment when client compains about exchange timeout:
{noformat}
"grid-nio-worker-0-#67%null%" prio=10 tid=0x00007ff3888ce800 nid=0x1824 runnable [0x00007ff30dfbd000]
   java.lang.Thread.State: RUNNABLE
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	- locked <0x00000000ed988a28> (a java.net.SocksSocketImpl)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
	at java.net.Socket.connect(Socket.java:579)
	at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.openSocket(TcpDiscoverySpi.java:1097)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:541)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:470)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:433)
	at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.pingNode(TcpDiscoverySpi.java:346)
	at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.tryFailNode(GridDiscoveryManager.java:1459)
	at org.apache.ignite.internal.managers.GridManagerAdapter$1.tryFailNode(GridManagerAdapter.java:484)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2.onDisconnected(TcpCommunicationSpi.java:256)
	at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onExceptionCaught(GridNioFilterChain.java:253)
	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
	at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onExceptionCaught(GridNioCodecFilter.java:74)
	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
	at org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onExceptionCaught(GridConnectionBytesVerifyFilter.java:65)
	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
	at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onExceptionCaught(GridNioServer.java:1985)
	at org.apache.ignite.internal.util.nio.GridNioFilterChain.onExceptionCaught(GridNioFilterChain.java:157)
	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:1521)
	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1346)
	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1275)
	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1159)
	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:108)
	at java.lang.Thread.run(Thread.java:722)
{noformat}

Here nio worker hangs in tryFailNode() so communication IO is blocked, need to move tryFailNode
from nio worker.

> Communication issues when running client node in separate subnetwork
> --------------------------------------------------------------------
>
>                 Key: IGNITE-1003
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1003
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: sprint-4
>            Reporter: Valentin Kulichenko
>            Priority: Blocker
>             Fix For: sprint-5
>
>         Attachments: client.zip, server.zip, test.xml
>
>
> Test is the following:
> * Run 8 server nodes on one box.
> * Start and stop client node in a loop on a different box in different subnetwork (e.g.,
over VPN).
> On one if iterations node join process will hang for several minutes due to timeouts
in initial partition exchange. At some point communication between some of the server nodes
stops working - messages wait in queue until connection is closed and these messages are recovered.
> Attached are configuration file used to run the test and logs with communication debug
enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message