geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan José Ramos Cassella (JIRA) <j...@apache.org>
Subject [jira] [Assigned] (GEODE-6977) WAN Auto Discovery Failure
Date Wed, 17 Jul 2019 14:26:00 GMT

     [ https://issues.apache.org/jira/browse/GEODE-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Juan José Ramos Cassella reassigned GEODE-6977:
-----------------------------------------------

    Assignee: Juan José Ramos Cassella

> WAN Auto Discovery Failure
> --------------------------
>
>                 Key: GEODE-6977
>                 URL: https://issues.apache.org/jira/browse/GEODE-6977
>             Project: Geode
>          Issue Type: Improvement
>          Components: membership, wan
>            Reporter: Juan José Ramos Cassella
>            Assignee: Juan José Ramos Cassella
>            Priority: Major
>
> The topology for the test is 4 clusters with 4 locators and 4 servers each, the sites
are connected using a ring topology (cluster1 knows about cluster2, cluster2 knows about cluster3,
cluster3 knows about cluster4 and cluster4 knows about cluster1).
> After the startup finished, it was found out that one locator (out of sixteen) didn't
know about another locator in other site.
> According to the docs:
> {noformat}
> Discovery for Multi-Site Systems
> Each Geode cluster in a WAN configuration uses locators to discover remote clusters as
well as local members.
> Each locator in a WAN configuration defines a unique distributed-system-id property that
identifies the local cluster to which it belongs. A locator uses the remote-locators property
to define the addresses of one or more locators in remote clusters to use for WAN distribution.
> When a locator starts up, it contacts each locator that is configured in the remote-locators
property to exchange information about the available locators and gateway receivers in the
cluster. The locator also shares information about locators and gateway receivers in any other
Geode clusters that have connected to the cluster. Connected clusters can then use the shared
gateway receiver information to distribute region events according to their configured gateway
senders.
> Each time a new locator starts up or an existing locator shuts down, the changed information
is broadcast to other connected Geode clusters.
> {noformat}
> I've added some extra logging to the {{LocatorMembershipListenerImpl}} class and reproduced
the issue from scratch. In this particular run {{locator_1_1}} ({{rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}})
was not found in {{locator_3_3}} ({{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}). The
member {{locator 1_1}} is the last one to startup:
> {noformat}
> [info 2019/07/12 12:10:47.737 PDT <vm_34_thr_66_locator_1_3_host1_6234> tid=0x15]
Locator started on 10.32.111.135[22488]
> [info 2019/07/12 12:10:47.756 PDT <vm_36_thr_68_locator_2_1_host2_8759> tid=0x17]
Locator started on 10.32.109.20[23639]
> [info 2019/07/12 12:10:47.809 PDT <vm_44_thr_76_locator_4_1_host2_8840> tid=0x17]
Locator started on 10.32.109.20[24645]
> [info 2019/07/12 12:10:48.800 PDT <vm_41_thr_73_locator_3_2_host1_6306> tid=0x15]
Locator started on 10.32.111.135[23280]
> [info 2019/07/12 12:10:50.042 PDT <vm_47_thr_79_locator_4_4_host2_8863> tid=0x15]
Locator started on 10.32.109.20[29961]
> [info 2019/07/12 12:10:50.162 PDT <vm_38_thr_70_locator_2_3_host2_8780> tid=0x15]
Locator started on 10.32.109.20[27708]
> [info 2019/07/12 12:10:50.172 PDT <vm_46_thr_78_locator_4_3_host2_8860> tid=0x15]
Locator started on 10.32.109.20[22615]
> [info 2019/07/12 12:10:50.493 PDT <vm_39_thr_71_locator_2_4_host2_8797> tid=0x17]
Locator started on 10.32.109.20[25209]
> [info 2019/07/12 12:10:51.052 PDT <vm_37_thr_69_locator_2_2_host2_8766> tid=0x15]
Locator started on 10.32.109.20[24073]
> [info 2019/07/12 12:10:51.067 PDT <vm_35_thr_67_locator_1_4_host1_6246> tid=0x15]
Locator started on 10.32.111.135[28694]
> [info 2019/07/12 12:10:51.082 PDT <vm_45_thr_77_locator_4_2_host2_8849> tid=0x15]
Locator started on 10.32.109.20[20182]
> [info 2019/07/12 12:10:51.112 PDT <vm_42_thr_74_locator_3_3_host1_6314> tid=0x15]
Locator started on 10.32.111.135[22655]
> [info 2019/07/12 12:10:51.709 PDT <vm_33_thr_65_locator_1_2_host1_6229> tid=0x15]
Locator started on 10.32.111.135[28293]
> [info 2019/07/12 12:10:52.007 PDT <vm_40_thr_72_locator_3_1_host1_6297> tid=0x15]
Locator started on 10.32.111.135[22923]
> [info 2019/07/12 12:10:52.625 PDT <vm_43_thr_75_locator_3_4_host1_6326> tid=0x15]
Locator started on 10.32.111.135[26144]
> [info 2019/07/12 12:10:56.183 PDT <vm_32_thr_64_locator_1_1_host1_6225> tid=0x15]
Locator started on 10.32.111.135[23955]
> {noformat}
> After {{locator_1_1}} joins the cluster, every locator informs the other locators about
this new addition (and also about the other locators known so far), as expected.
> The logs below show that all locators tried to share the addition of locator {{locator_1_1
rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with {{locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
> {noformat}
> /locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:57.153 PDT <Thread-26>
tid=0x7c] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[28293]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:57.192 PDT <Thread-24>
tid=0x6c] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[22488]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:58.011 PDT <Thread-44>
tid=0xa7] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[28694]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:57.796 PDT <Thread-25>
tid=0x6d] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[23639]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:58.334 PDT <Thread-64>
tid=0xa2] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[24073]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:57.368 PDT <Thread-29>
tid=0x78] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[27708]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:57.086 PDT <Thread-42>
tid=0x93] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[25209]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:10:59.059 PDT <Thread-23>
tid=0x76] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[22923]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:58.027 PDT <Thread-25>
tid=0x6c] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[23280]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:58.890 PDT <Thread-30>
tid=0x7f] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[26144]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:57.506 PDT <Thread-23>
tid=0x74] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[24645]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:57.179 PDT <Thread-25>
tid=0x77] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[20182]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:10:59.006 PDT <Thread-53>
tid=0xb1] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[22615]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:57.507 PDT <Thread-28>
tid=0xa1] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955]
Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[29961]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> {noformat}
> One second afterwards, every single locator also logs that it failed to share the information
about {{locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with {{locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
> {noformat}
> ./locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:58.155 PDT <Thread-26>
tid=0x7c] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:58.196 PDT <Thread-24>
tid=0x6c] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:59.012 PDT <Thread-44>
tid=0xa7] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:58.798 PDT <Thread-25>
tid=0x6d] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:59.336 PDT <Thread-64>
tid=0xa2] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:58.371 PDT <Thread-29>
tid=0x78] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:58.093 PDT <Thread-42>
tid=0x93] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:11:00.061 PDT <Thread-23>
tid=0x76] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:59.029 PDT <Thread-25>
tid=0x6c] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:59.903 PDT <Thread-30>
tid=0x7f] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:58.514 PDT <Thread-23>
tid=0x74] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:58.368 PDT <Thread-25>
tid=0x77] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:11:00.007 PDT <Thread-53>
tid=0xb1] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:58.515 PDT <Thread-28>
tid=0xa1] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955
with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> {noformat}
> This is also backed up by the fact that {{locator_3_3}} {{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}
is the only one that never logs the reception of the {{LocatorJoinMessage}} for {{locator_1_1
rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}:
> {noformat}
> ##### grep -lR "LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7\[23955\]"
. | sort
> ./locatorgemfire_1_1_6225/system.log
> ./locatorgemfire_1_2_6229/system.log
> ./locatorgemfire_1_3_6234/system.log
> ./locatorgemfire_1_4_6246/system.log
> ./locatorgemfire_2_1_8759/system.log
> ./locatorgemfire_2_2_8766/system.log
> ./locatorgemfire_2_3_8780/system.log
> ./locatorgemfire_2_4_8797/system.log
> ./locatorgemfire_3_1_6297/system.log
> ./locatorgemfire_3_2_6306/system.log
> ./locatorgemfire_3_4_6326/system.log
> ./locatorgemfire_4_1_8840/system.log
> ./locatorgemfire_4_2_8849/system.log
> ./locatorgemfire_4_3_8860/system.log
> ./locatorgemfire_4_4_8863/system.log
> {noformat}
> The exception when trying to exchange the locator information {{rs-GEM-2603-1105a2i3large-hydra-client-7:23955}}
with {{rs-GEM-2603-1105a2i3large-hydra-client-7:22655}} is always the same on every single
locator:
> {noformat}
> java.net.SocketTimeoutException: connect timed out
> 	at java.net.PlainSocketImpl.socketConnect(Native Method)
> 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> 	at java.net.Socket.connect(Socket.java:589)
> 	at org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:898)
> 	at org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:839)
> 	at org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:828)
> 	at org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:205)
> 	at org.apache.geode.cache.client.internal.locator.wan.LocatorMembershipListenerImpl$1.run(LocatorMembershipListenerImpl.java:120)
> {noformat}
> Even though there are no errors on the logs for {{locator_3_3}} within the relevant time
frame, the statistics show that the cpu is at 100% and that there are no samples taken for
around 5 seconds, probably due to a short JVM pause or resource contention, that's probably
the reason why the connection from other locators failed. That said, there were no more additions/deletions
of extra locators within any of the clusters either (which would have fired another round
of messages from every locator to every other known locator about the addition/deletion),
so in the end "it is expected" that {{locator_3_3}} didn't have the information about {{locator_1_1}}
in its internal map of known locators. 
> There's no retry logic within {{LocatorJoinMessage.locatorJoined}} and the timeout of
1 second seems to be too short: any locator might be busy for a short period of time and,
as such, the information about the new locator will be missed until there's another locator
membership change; this ticket is to change that: increase the timeout and retry a couple
of times by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message