Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E7EA2200C8E for ; Thu, 8 Jun 2017 18:52:29 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E68ED160BC3; Thu, 8 Jun 2017 16:52:29 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E67D1160BE7 for ; Thu, 8 Jun 2017 18:52:28 +0200 (CEST) Received: (qmail 22084 invoked by uid 500); 8 Jun 2017 16:52:28 -0000 Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list issues@ignite.apache.org Received: (qmail 22071 invoked by uid 99); 8 Jun 2017 16:52:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jun 2017 16:52:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id AF4611A001D for ; Thu, 8 Jun 2017 16:52:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.001 X-Spam-Level: X-Spam-Status: No, score=-100.001 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 9qEwjKclOW0K for ; Thu, 8 Jun 2017 16:52:25 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 30F2F5FB98 for ; Thu, 8 Jun 2017 16:52:25 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2DFA9E0DA4 for ; Thu, 8 Jun 2017 16:52:24 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3561221E12 for ; Thu, 8 Jun 2017 16:52:20 +0000 (UTC) Date: Thu, 8 Jun 2017 16:52:20 +0000 (UTC) From: "Alexei Scherbakov (JIRA)" To: issues@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (IGNITE-5457) Weird discovery behavior on split brain. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 08 Jun 2017 16:52:30 -0000 [ https://issues.apache.org/jira/browse/IGNITE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-5457: -------------------------------------- Description: I observe buggy behavior in case of simulated split brain. Nodes in DataCenter1 (where coordinator is located) are slowly leave grid, while nodes in DataCenter2 stay in grid forever. In logs I see multiple attemps to kick coordinator by communcation by socket timeout, but number of nodes does not change. Note what my failureDetectionTimeout is significantly higher than communication socket timeout. Looks like coordinator cannot be kicked from topology by TcpCommuncationSpi. {noformat} 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed out (consider increasing 'connTimeout' configuration property) [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000] 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster [rmtNode=TcpDiscoveryNode [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1496936257121, loc=false, ver=1.10.3#20170604-sha1:30521a17, isClient=false]] org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] ... 6 common frames omitted Caused by: java.net.SocketTimeoutException: null at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884) ... 6 common frames omitted 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200] ^-- H/N/C [hosts=144, nodes=160, CPUs=8064] ^-- CPU [cur=0.2%, avg=2.37%, GC=0%] ^-- PageMemory [pages=604144] ^-- Heap [used=33396MB, free=49.04%, comm=65536MB] ^-- Non heap [used=171MB, free=-1%, comm=173MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=0, qSize=0] ^-- Outbound messages queue [size=0] {noformat} was: I observe buggy behavior in case of simulated split brain. Nodes in DataCenter1 (where coordinator is located) are slowly leave grid, while nodes in DataCenter2 stay in grid forever. In logs I see multiple attemps to kick coordinator by communcation by socket timeout, but number of nodes does not change. Note what my failureDetectionTimeout is significantly higher than communication socket timeout. Looks like coordinator cannot be kicked from topology by TcpCommuncationSpi. {noformat} 19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=131ms, reason='timeout'] 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed out (consider increasing 'connTimeout' configuration property) [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000] 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster [rmtNode=TcpDiscoveryNode [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1496936257121, loc=false, ver=1.10.3#20170604-sha1:30521a17, isClient=false]] org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] ... 6 common frames omitted Caused by: java.net.SocketTimeoutException: null at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884) ... 6 common frames omitted 19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=130ms, reason='timeout'] 19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=211ms, reason='timeout'] 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200] ^-- H/N/C [hosts=144, nodes=160, CPUs=8064] ^-- CPU [cur=0.2%, avg=2.37%, GC=0%] ^-- PageMemory [pages=604144] ^-- Heap [used=33396MB, free=49.04%, comm=65536MB] ^-- Non heap [used=171MB, free=-1%, comm=173MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=0, qSize=0] ^-- Outbound messages queue [size=0] {noformat} > Weird discovery behavior on split brain. > ---------------------------------------- > > Key: IGNITE-5457 > URL: https://issues.apache.org/jira/browse/IGNITE-5457 > Project: Ignite > Issue Type: Bug > Components: general > Affects Versions: 2.0 > Reporter: Alexei Scherbakov > Priority: Critical > Fix For: 2.2 > > > I observe buggy behavior in case of simulated split brain. > Nodes in DataCenter1 (where coordinator is located) are slowly leave grid, > while nodes in DataCenter2 stay in grid forever. > In logs I see multiple attemps to kick coordinator by communcation by socket timeout, but number of nodes does not change. > Note what my failureDetectionTimeout is significantly higher than communication socket timeout. > Looks like coordinator cannot be kicked from topology by TcpCommuncationSpi. > {noformat} > 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed out (consider increasing 'connTimeout' configuration property) [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000] > 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster [rmtNode=TcpDiscoveryNode [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1496936257121, loc=false, ver=1.10.3#20170604-sha1:30521a17, isClient=false]] > org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]] > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null] > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > ... 6 common frames omitted > Caused by: java.net.SocketTimeoutException: null > at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118) > at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884) > ... 6 common frames omitted > 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - > Metrics for local node (to disable set 'metricsLogFrequency' to 0) > ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200] > ^-- H/N/C [hosts=144, nodes=160, CPUs=8064] > ^-- CPU [cur=0.2%, avg=2.37%, GC=0%] > ^-- PageMemory [pages=604144] > ^-- Heap [used=33396MB, free=49.04%, comm=65536MB] > ^-- Non heap [used=171MB, free=-1%, comm=173MB] > ^-- Public thread pool [active=0, idle=0, qSize=0] > ^-- System thread pool [active=0, idle=0, qSize=0] > ^-- Outbound messages queue [size=0] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)