zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kuba Lekstan <kueb...@gmail.com>
Subject cluster/ephemeral nodes inconsistency
Date Thu, 13 Nov 2014 16:25:19 GMT
Hello,

A bit of details:
We have 5 node cluster, which we use for configuration distrubution and
monitoring active instances of our applications. Each application creates
its ephemeral node, so we know which apps are alive, how many of them there
is and what they are doing.

The problem had happen at 4th November, first time it was around 4AM,
second time around 12PM.
First time it was middle of the night when I got woken up, the support guys
told me that something is wrong with config distribution.

First I've checked apps for errors but didn't find anything interesting,
then I looked at what's in zookeeper (using node-zk-browser).
I've noticed that there are 3 ephemeral nodes which were created at 1st nov
(while the oldest application was started on 3rd nov), I could read its
data but was not able to delete them - was getting NONODE exception.

I thought wtf - why I cannot delete these nodes, something very bad had to
happen with ZK.

So I sshed on the leader and using CLI I tried to read these nodes but I
was not able to - the leader was telling me that such nodes doesn't exist.
After this I started to ssh to the rest of the nodes in cluster and trying
to read these nodes. Finally I found the server which did let me read the
data of these nodes.
Because of the inconsistency I've decided to restart it. Restart did help,
everything went back to normal state. The ephemeral nodes disappeared.

Similar situation had happen at 12PM but this time I had a lot more time to
look what is wrong. Second time the problem was about 3 ephemeral nodes
which were created at 1st now (again?). This time I dig a bit deeper and
look into logs and 4 letter commands - but could not find anything
interesting except the all these 3 nodes were created under different
sessionids but zk had no hosts connected under this sessionids.
Solution was similar to the one from 4AM but this time I've delete all
files in ZK data directory.

Oddly enough the problem happened twice on the same ZK node, the final
solution was to clear ZK data directory. After clearing the directory the
problem didn't happen again.

I tried to look for solution/similar problems, I found the posts where
people were complaining about ephemeral nodes not being removed after
client session gets closed. But I was not able to find posts about ZK not
being consistent.

What do you think about this? Can we do something to fix this?

Sorry for my english, I was doing my best. :)

Thanks, Kuba.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message