zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From German Blanco <german.blanco.bla...@gmail.com>
Subject Re: cluster/ephemeral nodes inconsistency
Date Thu, 13 Nov 2014 17:11:46 GMT
Hello,

which version of Zookeeper are you using?

On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <kuebzky@gmail.com> wrote:

> Hello,
>
> A bit of details:
> We have 5 node cluster, which we use for configuration distrubution and
> monitoring active instances of our applications. Each application creates
> its ephemeral node, so we know which apps are alive, how many of them there
> is and what they are doing.
>
> The problem had happen at 4th November, first time it was around 4AM,
> second time around 12PM.
> First time it was middle of the night when I got woken up, the support guys
> told me that something is wrong with config distribution.
>
> First I've checked apps for errors but didn't find anything interesting,
> then I looked at what's in zookeeper (using node-zk-browser).
> I've noticed that there are 3 ephemeral nodes which were created at 1st nov
> (while the oldest application was started on 3rd nov), I could read its
> data but was not able to delete them - was getting NONODE exception.
>
> I thought wtf - why I cannot delete these nodes, something very bad had to
> happen with ZK.
>
> So I sshed on the leader and using CLI I tried to read these nodes but I
> was not able to - the leader was telling me that such nodes doesn't exist.
> After this I started to ssh to the rest of the nodes in cluster and trying
> to read these nodes. Finally I found the server which did let me read the
> data of these nodes.
> Because of the inconsistency I've decided to restart it. Restart did help,
> everything went back to normal state. The ephemeral nodes disappeared.
>
> Similar situation had happen at 12PM but this time I had a lot more time to
> look what is wrong. Second time the problem was about 3 ephemeral nodes
> which were created at 1st now (again?). This time I dig a bit deeper and
> look into logs and 4 letter commands - but could not find anything
> interesting except the all these 3 nodes were created under different
> sessionids but zk had no hosts connected under this sessionids.
> Solution was similar to the one from 4AM but this time I've delete all
> files in ZK data directory.
>
> Oddly enough the problem happened twice on the same ZK node, the final
> solution was to clear ZK data directory. After clearing the directory the
> problem didn't happen again.
>
> I tried to look for solution/similar problems, I found the posts where
> people were complaining about ephemeral nodes not being removed after
> client session gets closed. But I was not able to find posts about ZK not
> being consistent.
>
> What do you think about this? Can we do something to fix this?
>
> Sorry for my english, I was doing my best. :)
>
> Thanks, Kuba.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message