Return-Path: X-Original-To: apmail-geode-issues-archive@minotaur.apache.org Delivered-To: apmail-geode-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 34F7C18B3E for ; Mon, 22 Feb 2016 21:44:32 +0000 (UTC) Received: (qmail 42485 invoked by uid 500); 22 Feb 2016 21:43:58 -0000 Delivered-To: apmail-geode-issues-archive@geode.apache.org Received: (qmail 42459 invoked by uid 500); 22 Feb 2016 21:43:57 -0000 Mailing-List: contact issues-help@geode.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@geode.incubator.apache.org Delivered-To: mailing list issues@geode.incubator.apache.org Received: (qmail 42450 invoked by uid 99); 22 Feb 2016 21:43:57 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2016 21:43:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 72536C0D47 for ; Mon, 22 Feb 2016 21:43:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -3.221 X-Spam-Level: X-Spam-Status: No, score=-3.221 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id s9RZLRkkTAFU for ; Mon, 22 Feb 2016 21:43:54 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 1FC9E5FB7F for ; Mon, 22 Feb 2016 21:43:19 +0000 (UTC) Received: (qmail 34564 invoked by uid 99); 22 Feb 2016 21:43:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2016 21:43:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 88CB22C1F68 for ; Mon, 22 Feb 2016 21:43:18 +0000 (UTC) Date: Mon, 22 Feb 2016 21:43:18 +0000 (UTC) From: "ASF subversion and git services (JIRA)" To: issues@geode.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (GEODE-870) 2 locators connecting simultaneously both think they are the coordinator even after one is kicked out as a surprise member MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/GEODE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157743#comment-15157743 ] ASF subversion and git services commented on GEODE-870: ------------------------------------------------------- Commit f7dd4fdf4893d0535f259526f214715e79f62ebc in incubator-geode's branch refs/heads/feature/GEODE-870 from [~ukohlmeyer] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=f7dd4fd ] GEODE-870: Handling multiple concurrent locator restarts. Elder locator nomination > 2 locators connecting simultaneously both think they are the coordinator even after one is kicked out as a surprise member > -------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-870 > URL: https://issues.apache.org/jira/browse/GEODE-870 > Project: Geode > Issue Type: Bug > Components: membership > Reporter: Udo Kohlmeyer > Assignee: Udo Kohlmeyer > > The scenario is to permanently remove a locator from the distributed system. > Steps to reproduce: > Start 3 locators > Start 2 servers > Stop locator 1 > Stop locators 2 and 3 > Reconfigure locators 2 and 3 without locator 1 > Restart locators 2 and 3 > Both locators think they are the coordinator: > locator-2 log messages: > [info 2015/09/14 15:47:13.844 PDT locator-2
tid=0x1] Membership: lead member is now 192.168.2.7(server-1:67247):37028 > [info 2015/09/14 15:47:13.850 PDT locator-2 tid=0x46] GemFire failure detection is now monitoring 192.168.2.7(server-1:67247):37028 > [info 2015/09/14 15:47:13.850 PDT locator-2
tid=0x1] This member, 192.168.2.7(locator-2:67411:locator):64755, is becoming group coordinator. > [info 2015/09/14 15:47:13.854 PDT locator-2
tid=0x1] Membership: sending new view [[192.168.2.7(locator-2:67411:locator):64755|28] [192.168.2.7(server-1:67247):37028/7081, 192.168.2.7(server-2:67265):43233/7082, 192.168.2.7(locator-2:67411:locator):64755/7072]] (3 mbrs) > [info 2015/09/14 15:47:13.866 PDT locator-2
tid=0x1] Admitting member <192.168.2.7(server-1:67247):37028>. Now there are 1 non-admin member(s). > [info 2015/09/14 15:47:13.867 PDT locator-2
tid=0x1] Admitting member <192.168.2.7(server-2:67265):43233>. Now there are 2 non-admin member(s). > [info 2015/09/14 15:47:13.867 PDT locator-2
tid=0x1] Admitting member <192.168.2.7(locator-2:67411:locator):64755>. Now there are 3 non-admin member(s). > [info 2015/09/14 15:47:13.869 PDT locator-2
tid=0x1] Membership: Finished view processing viewID = 28 > [info 2015/09/14 15:47:15.178 PDT locator-2
tid=0x1] Starting server location for Distribution Locator on boglesbymac[9092] > locator-3 log messages: > [info 2015/09/14 15:47:13.846 PDT locator-3
tid=0x1] Membership: lead member is now 192.168.2.7(server-1:67247):37028 > [info 2015/09/14 15:47:13.852 PDT locator-3 tid=0x47] GemFire failure detection is now monitoring 192.168.2.7(server-1:67247):37028 > [info 2015/09/14 15:47:13.853 PDT locator-3
tid=0x1] This member, 192.168.2.7(locator-3:67410:locator):9461, is becoming group coordinator. > [info 2015/09/14 15:47:13.855 PDT locator-3
tid=0x1] Membership: sending new view [[192.168.2.7(locator-3:67410:locator):9461|28] [192.168.2.7(server-1:67247):37028/7081, 192.168.2.7(server-2:67265):43233/7082, 192.168.2.7(locator-3:67410:locator):9461/7073]] (3 mbrs) > [info 2015/09/14 15:47:13.868 PDT locator-3
tid=0x1] Admitting member <192.168.2.7(server-1:67247):37028>. Now there are 1 non-admin member(s). > [info 2015/09/14 15:47:13.868 PDT locator-3
tid=0x1] Admitting member <192.168.2.7(server-2:67265):43233>. Now there are 2 non-admin member(s). > [info 2015/09/14 15:47:13.869 PDT locator-3
tid=0x1] Admitting member <192.168.2.7(locator-3:67410:locator):9461>. Now there are 3 non-admin member(s). > [info 2015/09/14 15:47:13.870 PDT locator-3
tid=0x1] Membership: Finished view processing viewID = 28 > [info 2015/09/14 15:47:15.213 PDT locator-3
tid=0x1] Starting server location for Distribution Locator on boglesbymac[9093] > Both server logs show locator-3 being admitted, then expired: > [finest 2015/09/14 15:47:13.888 PDT server-1 tid=0x71] Membership: Received message from surprise member: <192.168.2.7(locator-3:67410:locator):9461>. My view number is 28 it is 28 > [finest 2015/09/14 15:47:13.888 PDT server-1 tid=0x71] Membership: Processing surprise addition <192.168.2.7(locator-3:67410:locator):9461> > [info 2015/09/14 15:47:13.889 PDT server-1 tid=0x71] Admitting member <192.168.2.7(locator-3:67410:locator):9461>. Now there are 4 non-admin member(s). > [info 2015/09/14 15:47:13.896 PDT server-1 tid=0x5c] Member 192.168.2.7(locator-2:67411:locator):64755 is equivalent or in the same redundancy zone. > [info 2015/09/14 15:47:13.900 PDT server-1 tid=0x73] Member 192.168.2.7(locator-3:67410:locator):9461 is equivalent or in the same redundancy zone. > [info 2015/09/14 15:49:03.791 PDT server-1 tid=0x4d] Membership: expiring membership of surprise member <192.168.2.7(locator-3:67410:locator):9461> > [finest 2015/09/14 15:49:03.791 PDT server-1 tid=0x4d] Membership: destroying < 192.168.2.7(locator-3:67410:locator):9461 > > [finest 2015/09/14 15:49:03.792 PDT server-1 tid=0x4d] Membership: added shunned member < 192.168.2.7(locator-3:67410:locator):9461 > > [finest 2015/09/14 15:49:03.792 PDT server-1 tid=0x4d] Membership: dispatching uplevel departure event for < 192.168.2.7(locator-3:67410:locator):9461 > > [info 2015/09/14 15:49:03.793 PDT server-1 tid=0x4d] Member at 192.168.2.7(locator-3:67410:locator):9461 unexpectedly left the distributed cache: not seen in membership view in 100000ms > AFAICT from the logs, locator-3 has no idea its not the coordinator. The process is still alive, and its locator thread is still alive: > "Distribution Locator on boglesbymac[9093]" daemon prio=5 tid=0x00007ffd5e90e000 nid=0x7003 runnable [0x00000001127c8000] > java.lang.Thread.State: RUNNABLE > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at com.gemstone.org.jgroups.stack.tcpserver.TcpServer.run(TcpServer.java:246) > at com.gemstone.org.jgroups.stack.tcpserver.TcpServer$2.run(TcpServer.java:196) > Also, if a client connects to it, it'll provide the servers to findAllServers request. > This code: > private void dumpServers() { > PoolImpl pool = (PoolImpl) PoolManager.find("pool"); > AutoConnectionSourceImpl connectionSource = (AutoConnectionSourceImpl) pool.getConnectionSource(); > List knownLocators = pool.getLocators(); > ArrayList allServers = connectionSource.findAllServers(); // message to locator > System.out.println("Locator " + knownLocators + " knows about the following " + (allServers == null ? 0 : allServers.size()) + " servers:"); > for (ServerLocation server : allServers) > { System.out.println("\t" + server); } > } > Dumps this output from locator-3: > Locator [localhost/127.0.0.1:9093] knows about the following 2 servers: > 192.168.2.7:40402 > 192.168.2.7:40401 -- This message was sent by Atlassian JIRA (v6.3.4#6332)