zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: cluster/ephemeral nodes inconsistency
Date Thu, 15 Jan 2015 00:29:07 GMT
can you provide more info about the zookeeper deployment, are you running
any other applications along side zookeeper servers on the same nodes. I
remember seeing these issue when zookeeper server suffers from GC (GC
pauses longer than session timeout).

Looking at the timestamp of the operations at individual zk servers will
also help in triaging this issues. Can you can attach/paste the changes
that happened to this znode from the transaction logs? ( you can use
ZkLogFormatter or use ZKGrep
<https://issues.apache.org/jira/browse/HELIX-356> tool we wrote in Helix.)
We might be able to understand the sequence of operations.



On Wed, Jan 14, 2015 at 1:30 PM, Flavio Junqueira <
fpjunqueira@yahoo.com.invalid> wrote:

> Also, what was the last operation that changed the messed up znode and
> when has the operation been executed?
>
> -Flavio
>
> > On 14 Jan 2015, at 12:40, Flavio Junqueira <fpjunqueira@yahoo.com>
> wrote:
> >
> > But you do observe the session being closed, yes? And the ephemeral can
> be listed with getChildren but you can't get it with getData, is it right?
> >
> > -Flavio
> >
> >
> > On Wednesday, January 14, 2015 11:42 AM, Kuba Lekstan <kuebzky@gmail.com>
> wrote:
> >
> >
> > German, today it had happen on our secondary cluster which consist of 3
> > nodes, the leader didn't see the node but two other followers did.
> >
> > Flavio, I browsed the logs but was unable to find anything interesting,
> > only setData operations were issued.
> >
> > Problematic znode was last modified at 13 Jan 2015 17:xx, we have noticed
> > the issue at 14 Jan 2015 11:xx.
> >
> > 2015-01-14 10:52 GMT+01:00 Flavio Junqueira
> <fpjunqueira@yahoo.com.invalid <mailto:fpjunqueira@yahoo.com.invalid>>:
> >
> > > Hi there,
> > > I suggest a couple of things here:
> > > - Use LogFormatter to look into the transaction logs to check the
> > > operations that are actually coming across.- It would be nice be able
> to
> > > reproduce it outside your app, ideally as a junit test so that we can
> start
> > > working on it.
> > > I vaguely remember coming across such a problem, but I'll need to dig
> into
> > > it. Does anyone on this list recall a similar problem?
> > > -Flavio
> > >
> > >      On Wednesday, January 14, 2015 9:14 AM, Kuba Lekstan <
> > > kuebzky@gmail.com <mailto:kuebzky@gmail.com>> wrote:
> > >
> > >
> > >
> > >  German do you have any idea what might be causing these? Today same
> issue
> > > had happen.
> > >
> > > 2014-11-21 5:42 GMT+01:00 Yogesh Patil <patyogesh@gmail.com <mailto:
> patyogesh@gmail.com>>:
> > >
> > > > Hi Zookeepers,
> > > > I am also experiencing the similar problem since yestderday. I have
> > > pretty
> > > > much similar setup and ephemeral znodes in place for keep-alive kind
> of
> > > > function. I too see in spite of ZK session going down, ephemeral
> znodes
> > > > still LIVES.
> > > >
> > > > I am using ZK 3.5.0.
> > > >
> > > > Any solution/fix for this type of an issue??
> > > >
> > > >
> > > > --
> > > > Sincerely,
> > > >
> > > > *Yogesh Patil*
> > > >
> > > >
> > > >
> > > > On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <kuebzky@gmail.com
> <mailto:kuebzky@gmail.com>> wrote:
> > > >
> > > > > Sorry, forgot to mention. Version: 3.4.6.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > 2014-11-13 18:11 GMT+01:00 German Blanco <
> > > german.blanco.blanco@gmail.com <mailto:german.blanco.blanco@gmail.com>
> > > > >:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > which version of Zookeeper are you using?
> > > > > >
> > > > > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <kuebzky@gmail.com
> <mailto:kuebzky@gmail.com>>
> > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > A bit of details:
> > > > > > > We have 5 node cluster, which we use for configuration
> distrubution
> > > > and
> > > > > > > monitoring active instances of our applications. Each
> application
> > > > > creates
> > > > > > > its ephemeral node, so we know which apps are alive, how
many
> of
> > > them
> > > > > > there
> > > > > > > is and what they are doing.
> > > > > > >
> > > > > > > The problem had happen at 4th November, first time it was
> around
> > > 4AM,
> > > > > > > second time around 12PM.
> > > > > > > First time it was middle of the night when I got woken
up, the
> > > > support
> > > > > > guys
> > > > > > > told me that something is wrong with config distribution.
> > > > > > >
> > > > > > > First I've checked apps for errors but didn't find anything
> > > > > interesting,
> > > > > > > then I looked at what's in zookeeper (using node-zk-browser).
> > > > > > > I've noticed that there are 3 ephemeral nodes which were
> created at
> > > > 1st
> > > > > > nov
> > > > > > > (while the oldest application was started on 3rd nov),
I could
> read
> > > > its
> > > > > > > data but was not able to delete them - was getting NONODE
> > > exception.
> > > > > > >
> > > > > > > I thought wtf - why I cannot delete these nodes, something
> very bad
> > > > had
> > > > > > to
> > > > > > > happen with ZK.
> > > > > > >
> > > > > > > So I sshed on the leader and using CLI I tried to read
these
> nodes
> > > > but
> > > > > I
> > > > > > > was not able to - the leader was telling me that such nodes
> doesn't
> > > > > > exist.
> > > > > > > After this I started to ssh to the rest of the nodes in
> cluster and
> > > > > > trying
> > > > > > > to read these nodes. Finally I found the server which did
let
> me
> > > read
> > > > > the
> > > > > > > data of these nodes.
> > > > > > > Because of the inconsistency I've decided to restart it.
> Restart
> > > did
> > > > > > help,
> > > > > > > everything went back to normal state. The ephemeral nodes
> > > > disappeared.
> > > > > > >
> > > > > > > Similar situation had happen at 12PM but this time I had
a lot
> more
> > > > > time
> > > > > > to
> > > > > > > look what is wrong. Second time the problem was about 3
> ephemeral
> > > > nodes
> > > > > > > which were created at 1st now (again?). This time I dig
a bit
> > > deeper
> > > > > and
> > > > > > > look into logs and 4 letter commands - but could not find
> anything
> > > > > > > interesting except the all these 3 nodes were created under
> > > different
> > > > > > > sessionids but zk had no hosts connected under this sessionids.
> > > > > > > Solution was similar to the one from 4AM but this time
I've
> delete
> > > > all
> > > > > > > files in ZK data directory.
> > > > > > >
> > > > > > > Oddly enough the problem happened twice on the same ZK
node,
> the
> > > > final
> > > > > > > solution was to clear ZK data directory. After clearing
the
> > > directory
> > > > > the
> > > > > > > problem didn't happen again.
> > > > > > >
> > > > > > > I tried to look for solution/similar problems, I found
the
> posts
> > > > where
> > > > > > > people were complaining about ephemeral nodes not being
removed
> > > after
> > > > > > > client session gets closed. But I was not able to find
posts
> about
> > > ZK
> > > > > not
> > > > > > > being consistent.
> > > > > > >
> > > > > > > What do you think about this? Can we do something to fix
this?
> > > > > > >
> > > > > > > Sorry for my english, I was doing my best. :)
> > > > > > >
> > > > > > > Thanks, Kuba.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message