Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A0FB4200CE1 for ; Thu, 31 Aug 2017 18:11:25 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A82EC16A5F3; Thu, 31 Aug 2017 16:11:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C95C016A5E2 for ; Thu, 31 Aug 2017 18:11:10 +0200 (CEST) Received: (qmail 31714 invoked by uid 500); 31 Aug 2017 16:11:04 -0000 Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list issues@ignite.apache.org Received: (qmail 31699 invoked by uid 99); 31 Aug 2017 16:11:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Aug 2017 16:11:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5224BC0598 for ; Thu, 31 Aug 2017 16:11:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 4NuAaHVGT27s for ; Thu, 31 Aug 2017 16:11:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id F3CFB5FBBA for ; Thu, 31 Aug 2017 16:11:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 03A8EE0E29 for ; Thu, 31 Aug 2017 16:11:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 431E324157 for ; Thu, 31 Aug 2017 16:11:00 +0000 (UTC) Date: Thu, 31 Aug 2017 16:11:00 +0000 (UTC) From: "Ilya Kasnacheev (JIRA)" To: issues@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (IGNITE-6071) Client may detect necessity for reconnect for too long MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 31 Aug 2017 16:11:25 -0000 [ https://issues.apache.org/jira/browse/IGNITE-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149185#comment-16149185 ] Ilya Kasnacheev commented on IGNITE-6071: ----------------------------------------- I don't understand what were the circumstances that led to the problem, however I have reproduced it as a wider issue. I've wrote a letter to dev@ titled "Hopeless looping in TcpCommunicationSpi" > Client may detect necessity for reconnect for too long > ------------------------------------------------------ > > Key: IGNITE-6071 > URL: https://issues.apache.org/jira/browse/IGNITE-6071 > Project: Ignite > Issue Type: Bug > Affects Versions: 2.1 > Reporter: Yakov Zhdanov > Assignee: Ilya Kasnacheev > > There was a GC pause on client that caused servers to drop client due to inability to establish TCP communication connection. Then it took some time for client to detect that it has been dropped. During that time client many times attempted to connect to server which can be seen in the logs. After client detected its drop and reconnected servers fired node added event and no log flood can be found any more. > We need to find out why client was reconnecting via communication and did not detect the drop for such a long time. > I hope this can be reproduced in test: > * start 2 servers > * start client > * suspend all client threads with Thread.suspend() - just filter threads of current JVM by name and suspend ones belonging to the client. > {noformat} > [10:12:24,785][WARNING][disco-event-worker-#71%null%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=dd71479c-41ba-443e-b25c-3803a2a94f4f, addrs=[10.44.3.14, 127.0.0.1], sockAddrs=[/127.0.0.1:0, XXX.com/10.44.3.14:0], discPort=0, order=2, intOrder=2, lastExchangeTime=1502269008673, loc=false, ver=2.1.1#20170618-sha1:09ce29e0, isClient=true] > [10:12:24,785][INFO][disco-event-worker-#71%null%][GridDiscoveryManager] Topology snapshot [ver=5, servers=2, clients=1, CPUs=144, heap=76.0GB] > [10:12:24,794][INFO][exchange-worker-#72%null%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], crd=false, evt=12, node=TcpDiscoveryNode [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, addrs=[10.44.3.11, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, XXX.com/10.44.3.11:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1502269944782, loc=true, ver=2.1.1#20170618-sha1:09ce29e0, isClient=false], evtNode=TcpDiscoveryNode [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, addrs=[10.44.3.11, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, XXX.com/10.44.3.11:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1502269944782, loc=true, ver=2.1.1#20170618-sha1:09ce29e0, isClient=false], customEvt=null] > [10:12:24,813][INFO][exchange-worker-#72%null%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], crd=false] > [10:12:24,819][INFO][exchange-worker-#72%null%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=5, minorTopVer=0], evt=NODE_FAILED, node=dd71479c-41ba-443e-b25c-3803a2a94f4f] > [10:12:28,344][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52474] > [10:12:28,348][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52482] > [10:12:28,356][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52506] > [10:12:28,362][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52522] > [10:12:28,368][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52538] > [10:12:28,374][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52554] > [10:12:28,380][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52570] > [10:12:28,386][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52586] > [10:12:28,392][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52602] > [10:12:28,397][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52618] > [10:12:28,402][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52634] > [10:12:28,407][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52650] > [10:12:28,412][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:52666] > ... > [10:18:32,684][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:43604] > [10:18:32,690][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:43620] > [10:18:32,695][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.44.3.11:47100, rmtAddr=/10.44.3.14:43636] > [10:18:42,831][INFO][disco-event-worker-#71%null%][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=2e80b0f0-21db-451d-a264-34ba16e00ffa, addrs=[10.44.3.14, 127.0.0.1], sockAddrs=[/127.0.0.1:0, gbrdsr000002837.intranet.barcapint.com/10.44.3.14:0], discPort=0, order=6, intOrder=5, lastExchangeTime=1502270322805, loc=false, ver=2.1.1#20170618-sha1:09ce29e0, isClient=true] > [10:18:42,832][INFO][disco-event-worker-#71%null%][GridDiscoveryManager] Topology snapshot [ver=6, servers=2, clients=2, CPUs=144, heap=90.0GB] > [10:18:42,833][INFO][exchange-worker-#72%null%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=false, evt=10, node=TcpDiscoveryNode [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, addrs=[10.44.3.11, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, XXX.com/10.44.3.11:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1502270322815, loc=true, ver=2.1.1#20170618-sha1:09ce29e0, isClient=false], evtNode=TcpDiscoveryNode [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, addrs=[10.44.3.11, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, XXX.com/10.44.3.11:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1502270322815, loc=true, ver=2.1.1#20170618-sha1:09ce29e0, isClient=false], customEvt=null] > [10:18:42,851][INFO][exchange-worker-#72%null%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=false] > [10:18:42,855][INFO][exchange-worker-#72%null%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=6, minorTopVer=0], evt=NODE_JOINED, node=2e80b0f0-21db-451d-a264-34ba16e00ffa] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)