On 25 June 2015 at 07:28, Round, Mark <Mark.Round@sky.uk> wrote:
> I have a 5-node Zookeeper 3.4.6 cluster across 3 data centres (2
> zookeepers in each “main” DC, and a 5th in a 3rd DC for quorum). I see that
> the two nodes in one DC have regular “issues” where they get kicked out of
> the cluster and the ZooKeeperServer process stops for a few minutes until
> the node rejoins. I’d like to know a couple of things, if someone could
> please point me in the direction of the relevant docs I’d greatly
> appreciate it.
>
> 1.) Is it expected behaviour that when a node is kicked from the cluster,
> it will not be allowed to re-join for a period ? From the logs below I can
> see that re-establishing a valid cluster took around 15 minutes.
>
I don't think so.
2.) It appears that the leader closes connections to the affected followers
> after a “transaction timeout” occurs. Where would I find out what this
> timeout is ? Is this the same thing as a session timout (e.g. The default
> of 20 * tickTime) ?
>
https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L496
> 3.) Where can I find the definition of the different fields in the
> election log messages (I.e. What are “n.round”, “n.zxid”, “n.state” and so
> on) ?
Not sure if there's a better source than the source:
https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L687
-rgs
|