geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEODE-870) 2 locators connecting simultaneously both think they are the coordinator even after one is kicked out as a surprise member
Date Wed, 24 Feb 2016 23:41:18 GMT

    [ https://issues.apache.org/jira/browse/GEODE-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166352#comment-15166352
] 

ASF subversion and git services commented on GEODE-870:
-------------------------------------------------------

Commit 83a6dc31eb16d26175608532f7c6ea8105e64e95 in incubator-geode's branch refs/heads/feature/GEODE-870
from [~ukohlmeyer]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=83a6dc3 ]

GEODE-870: Handling multiple concurrent locator restarts. Elder locator nomination


> 2 locators connecting simultaneously both think they are the coordinator even after one
is kicked out as a surprise member
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-870
>                 URL: https://issues.apache.org/jira/browse/GEODE-870
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Udo Kohlmeyer
>            Assignee: Udo Kohlmeyer
>
> The scenario is to permanently remove a locator from the distributed system.
> Steps to reproduce:
> Start 3 locators
> Start 2 servers
> Stop locator 1
> Stop locators 2 and 3
> Reconfigure locators 2 and 3 without locator 1
> Restart locators 2 and 3
> Both locators think they are the coordinator:
> locator-2 log messages:
> [info 2015/09/14 15:47:13.844 PDT locator-2 <main> tid=0x1] Membership: lead member
is now 192.168.2.7(server-1:67247)<v3>:37028
> [info 2015/09/14 15:47:13.850 PDT locator-2 <FD_SOCK Ping thread> tid=0x46] GemFire
failure detection is now monitoring 192.168.2.7(server-1:67247)<v3>:37028
> [info 2015/09/14 15:47:13.850 PDT locator-2 <main> tid=0x1] This member, 192.168.2.7(locator-2:67411:locator)<ec>:64755,
is becoming group coordinator.
> [info 2015/09/14 15:47:13.854 PDT locator-2 <main> tid=0x1] Membership: sending
new view [[192.168.2.7(locator-2:67411:locator)<ec><v28>:64755|28] [192.168.2.7(server-1:67247)<v3>:37028/7081,
192.168.2.7(server-2:67265)<v4>:43233/7082, 192.168.2.7(locator-2:67411:locator)<ec><v28>:64755/7072]]
(3 mbrs)
> [info 2015/09/14 15:47:13.866 PDT locator-2 <main> tid=0x1] Admitting member <192.168.2.7(server-1:67247)<v3>:37028>.
Now there are 1 non-admin member(s).
> [info 2015/09/14 15:47:13.867 PDT locator-2 <main> tid=0x1] Admitting member <192.168.2.7(server-2:67265)<v4>:43233>.
Now there are 2 non-admin member(s).
> [info 2015/09/14 15:47:13.867 PDT locator-2 <main> tid=0x1] Admitting member <192.168.2.7(locator-2:67411:locator)<ec><v28>:64755>.
Now there are 3 non-admin member(s).
> [info 2015/09/14 15:47:13.869 PDT locator-2 <main> tid=0x1] Membership: Finished
view processing viewID = 28
> [info 2015/09/14 15:47:15.178 PDT locator-2 <main> tid=0x1] Starting server location
for Distribution Locator on boglesbymac[9092]
> locator-3 log messages:
> [info 2015/09/14 15:47:13.846 PDT locator-3 <main> tid=0x1] Membership: lead member
is now 192.168.2.7(server-1:67247)<v3>:37028
> [info 2015/09/14 15:47:13.852 PDT locator-3 <FD_SOCK Ping thread> tid=0x47] GemFire
failure detection is now monitoring 192.168.2.7(server-1:67247)<v3>:37028
> [info 2015/09/14 15:47:13.853 PDT locator-3 <main> tid=0x1] This member, 192.168.2.7(locator-3:67410:locator)<ec>:9461,
is becoming group coordinator.
> [info 2015/09/14 15:47:13.855 PDT locator-3 <main> tid=0x1] Membership: sending
new view [[192.168.2.7(locator-3:67410:locator)<ec><v28>:9461|28] [192.168.2.7(server-1:67247)<v3>:37028/7081,
192.168.2.7(server-2:67265)<v4>:43233/7082, 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461/7073]]
(3 mbrs)
> [info 2015/09/14 15:47:13.868 PDT locator-3 <main> tid=0x1] Admitting member <192.168.2.7(server-1:67247)<v3>:37028>.
Now there are 1 non-admin member(s).
> [info 2015/09/14 15:47:13.868 PDT locator-3 <main> tid=0x1] Admitting member <192.168.2.7(server-2:67265)<v4>:43233>.
Now there are 2 non-admin member(s).
> [info 2015/09/14 15:47:13.869 PDT locator-3 <main> tid=0x1] Admitting member <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>.
Now there are 3 non-admin member(s).
> [info 2015/09/14 15:47:13.870 PDT locator-3 <main> tid=0x1] Membership: Finished
view processing viewID = 28
> [info 2015/09/14 15:47:15.213 PDT locator-3 <main> tid=0x1] Starting server location
for Distribution Locator on boglesbymac[9093]
> Both server logs show locator-3 being admitted, then expired:
> [finest 2015/09/14 15:47:13.888 PDT server-1 <P2P message reader@233ba812> tid=0x71]
Membership: Received message from surprise member: <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>.
My view number is 28 it is 28
> [finest 2015/09/14 15:47:13.888 PDT server-1 <P2P message reader@233ba812> tid=0x71]
Membership: Processing surprise addition <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>
> [info 2015/09/14 15:47:13.889 PDT server-1 <P2P message reader@233ba812> tid=0x71]
Admitting member <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>. Now
there are 4 non-admin member(s).
> [info 2015/09/14 15:47:13.896 PDT server-1 <Pooled High Priority Message Processor
4> tid=0x5c] Member 192.168.2.7(locator-2:67411:locator)<ec><v28>:64755 is
equivalent or in the same redundancy zone.
> [info 2015/09/14 15:47:13.900 PDT server-1 <Pooled High Priority Message Processor
5> tid=0x73] Member 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 is equivalent
or in the same redundancy zone.
> [info 2015/09/14 15:49:03.791 PDT server-1 <Timer-4> tid=0x4d] Membership: expiring
membership of surprise member <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>
> [finest 2015/09/14 15:49:03.791 PDT server-1 <Timer-4> tid=0x4d] Membership: destroying
< 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 >
> [finest 2015/09/14 15:49:03.792 PDT server-1 <Timer-4> tid=0x4d] Membership: added
shunned member < 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 >
> [finest 2015/09/14 15:49:03.792 PDT server-1 <Timer-4> tid=0x4d] Membership: dispatching
uplevel departure event for < 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461
>
> [info 2015/09/14 15:49:03.793 PDT server-1 <Timer-4> tid=0x4d] Member at 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461
unexpectedly left the distributed cache: not seen in membership view in 100000ms
> AFAICT from the logs, locator-3 has no idea its not the coordinator. The process is still
alive, and its locator thread is still alive:
> "Distribution Locator on boglesbymac[9093]" daemon prio=5 tid=0x00007ffd5e90e000 nid=0x7003
runnable [0x00000001127c8000]
> java.lang.Thread.State: RUNNABLE
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at com.gemstone.org.jgroups.stack.tcpserver.TcpServer.run(TcpServer.java:246)
> at com.gemstone.org.jgroups.stack.tcpserver.TcpServer$2.run(TcpServer.java:196)
> Also, if a client connects to it, it'll provide the servers to findAllServers request.
> This code:
> private void dumpServers() {
> PoolImpl pool = (PoolImpl) PoolManager.find("pool");
> AutoConnectionSourceImpl connectionSource = (AutoConnectionSourceImpl) pool.getConnectionSource();
> List<InetSocketAddress> knownLocators = pool.getLocators();
> ArrayList<ServerLocation> allServers = connectionSource.findAllServers(); // message
to locator
> System.out.println("Locator " + knownLocators + " knows about the following " + (allServers
== null ? 0 : allServers.size()) + " servers:");
> for (ServerLocation server : allServers)
> { System.out.println("\t" + server); }
> }
> Dumps this output from locator-3:
> Locator [localhost/127.0.0.1:9093] knows about the following 2 servers:
> 192.168.2.7:40402
> 192.168.2.7:40401



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message