ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Rosenvold <krosenv...@apache.org>
Subject Misconfigured Ip6 address on node causes cluster outage on node join
Date Thu, 07 Sep 2017 07:48:30 GMT
I've just spent the better part of a week diagnosing our production
cluster. It turned out someone (tm) had turned on IP6 router advertisement
without actually enabling IP6 routing, hence our servers in one datacenter
started picking up IP6 addresses that were not known or resolvable in the
other datacenter (they could reach one another within the same datacenter
on IP6). This would cause total cluster failures upon reconfigurations that
happen after rouge IP6 nodes had joined the cluster.

While most of this is our own fault, there is at least one patch patch I'd
like to make, but am somewhat unsure of the implications:

Our "ground zero" failure started with an NPE in https://github.com/apache/

I see that this class as well as TcpDiscoverySharedFsIpFinder,
TcpDiscoveryZookeeperIpFinder and TcpDiscoveryGoogleStorageIpFinder all
have an implicit assumption that addr.getAddress() cannot be null (which
was our case upon reconfiguring of an already running cluster).

Reading the ignite code I see various assertions that getAddress() can
return null, but the discovery code does not seem to find this necessary. I
am assuming this might be because this code is dealing with addresses from
the local host.

I can see two possible solutions to this:

A) Remove the address that is isUnresolved(), simply ignoring it, possibly
with a warning.
B) Fail hard with an error message indicating misconfiguration because
obviously it's not happening to all of you guys ?


p.s. I'll make a jira and a patch if necessary.

View raw message