Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3190310800 for ; Thu, 15 Jan 2015 00:29:35 +0000 (UTC) Received: (qmail 540 invoked by uid 500); 15 Jan 2015 00:29:36 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 495 invoked by uid 500); 15 Jan 2015 00:29:36 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 477 invoked by uid 99); 15 Jan 2015 00:29:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2015 00:29:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of g.kishore@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2015 00:29:09 +0000 Received: by mail-wg0-f44.google.com with SMTP id y19so12001673wgg.3 for ; Wed, 14 Jan 2015 16:29:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=L5/a9i/u0Tzn97J6P3dPztlGnFWFV5NQm03FQYWt7ig=; b=dlLTxgapEwhhooStQnOBMHwXCtwtruNS9gaZn7eiskvc3vammubep+qg2qswoRVDYN fV3wFciRUDF8fkXUaoJbr/T5LxPDmt3p0OjdNrKcm6Wy3QRK9ZDikvNOUB+qEDFzjpWe +TwUalx9xCdhMlq7eR3Xqdry+VpfS7hFHjxfr1qUTflHaLXno1PADAiLa/RAxH2+K6sP Jsal8bBtjjQk8knqHwIi379IVMPKgjaS4KCP23ZUh3F8pPfxEmeCgK33GZ5w3sJVYoQf THlqH/qWBdFtvMKbfQeocgvIWxGCKlU/pyQHtzd8pRXq30RnwdJqus0mDEegb1YlNO67 SohA== MIME-Version: 1.0 X-Received: by 10.194.200.1 with SMTP id jo1mr12834840wjc.64.1421281747893; Wed, 14 Jan 2015 16:29:07 -0800 (PST) Received: by 10.194.57.130 with HTTP; Wed, 14 Jan 2015 16:29:07 -0800 (PST) In-Reply-To: <9B6081EB-8AC6-45E3-9036-063E9210B6AF@yahoo.com> References: <77166901.1003416.1421239240841.JavaMail.yahoo@jws10630.mail.bf1.yahoo.com> <9B6081EB-8AC6-45E3-9036-063E9210B6AF@yahoo.com> Date: Wed, 14 Jan 2015 16:29:07 -0800 Message-ID: Subject: Re: cluster/ephemeral nodes inconsistency From: kishore g To: "user@zookeeper.apache.org" Cc: Flavio Junqueira Content-Type: multipart/alternative; boundary=047d7b87501cb5861c050ca5f162 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b87501cb5861c050ca5f162 Content-Type: text/plain; charset=UTF-8 can you provide more info about the zookeeper deployment, are you running any other applications along side zookeeper servers on the same nodes. I remember seeing these issue when zookeeper server suffers from GC (GC pauses longer than session timeout). Looking at the timestamp of the operations at individual zk servers will also help in triaging this issues. Can you can attach/paste the changes that happened to this znode from the transaction logs? ( you can use ZkLogFormatter or use ZKGrep tool we wrote in Helix.) We might be able to understand the sequence of operations. On Wed, Jan 14, 2015 at 1:30 PM, Flavio Junqueira < fpjunqueira@yahoo.com.invalid> wrote: > Also, what was the last operation that changed the messed up znode and > when has the operation been executed? > > -Flavio > > > On 14 Jan 2015, at 12:40, Flavio Junqueira > wrote: > > > > But you do observe the session being closed, yes? And the ephemeral can > be listed with getChildren but you can't get it with getData, is it right? > > > > -Flavio > > > > > > On Wednesday, January 14, 2015 11:42 AM, Kuba Lekstan > wrote: > > > > > > German, today it had happen on our secondary cluster which consist of 3 > > nodes, the leader didn't see the node but two other followers did. > > > > Flavio, I browsed the logs but was unable to find anything interesting, > > only setData operations were issued. > > > > Problematic znode was last modified at 13 Jan 2015 17:xx, we have noticed > > the issue at 14 Jan 2015 11:xx. > > > > 2015-01-14 10:52 GMT+01:00 Flavio Junqueira > >: > > > > > Hi there, > > > I suggest a couple of things here: > > > - Use LogFormatter to look into the transaction logs to check the > > > operations that are actually coming across.- It would be nice be able > to > > > reproduce it outside your app, ideally as a junit test so that we can > start > > > working on it. > > > I vaguely remember coming across such a problem, but I'll need to dig > into > > > it. Does anyone on this list recall a similar problem? > > > -Flavio > > > > > > On Wednesday, January 14, 2015 9:14 AM, Kuba Lekstan < > > > kuebzky@gmail.com > wrote: > > > > > > > > > > > > German do you have any idea what might be causing these? Today same > issue > > > had happen. > > > > > > 2014-11-21 5:42 GMT+01:00 Yogesh Patil patyogesh@gmail.com>>: > > > > > > > Hi Zookeepers, > > > > I am also experiencing the similar problem since yestderday. I have > > > pretty > > > > much similar setup and ephemeral znodes in place for keep-alive kind > of > > > > function. I too see in spite of ZK session going down, ephemeral > znodes > > > > still LIVES. > > > > > > > > I am using ZK 3.5.0. > > > > > > > > Any solution/fix for this type of an issue?? > > > > > > > > > > > > -- > > > > Sincerely, > > > > > > > > *Yogesh Patil* > > > > > > > > > > > > > > > > On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan > wrote: > > > > > > > > > Sorry, forgot to mention. Version: 3.4.6. > > > > > > > > > > Thanks. > > > > > > > > > > 2014-11-13 18:11 GMT+01:00 German Blanco < > > > german.blanco.blanco@gmail.com > > > > >: > > > > > > > > > > > Hello, > > > > > > > > > > > > which version of Zookeeper are you using? > > > > > > > > > > > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > A bit of details: > > > > > > > We have 5 node cluster, which we use for configuration > distrubution > > > > and > > > > > > > monitoring active instances of our applications. Each > application > > > > > creates > > > > > > > its ephemeral node, so we know which apps are alive, how many > of > > > them > > > > > > there > > > > > > > is and what they are doing. > > > > > > > > > > > > > > The problem had happen at 4th November, first time it was > around > > > 4AM, > > > > > > > second time around 12PM. > > > > > > > First time it was middle of the night when I got woken up, the > > > > support > > > > > > guys > > > > > > > told me that something is wrong with config distribution. > > > > > > > > > > > > > > First I've checked apps for errors but didn't find anything > > > > > interesting, > > > > > > > then I looked at what's in zookeeper (using node-zk-browser). > > > > > > > I've noticed that there are 3 ephemeral nodes which were > created at > > > > 1st > > > > > > nov > > > > > > > (while the oldest application was started on 3rd nov), I could > read > > > > its > > > > > > > data but was not able to delete them - was getting NONODE > > > exception. > > > > > > > > > > > > > > I thought wtf - why I cannot delete these nodes, something > very bad > > > > had > > > > > > to > > > > > > > happen with ZK. > > > > > > > > > > > > > > So I sshed on the leader and using CLI I tried to read these > nodes > > > > but > > > > > I > > > > > > > was not able to - the leader was telling me that such nodes > doesn't > > > > > > exist. > > > > > > > After this I started to ssh to the rest of the nodes in > cluster and > > > > > > trying > > > > > > > to read these nodes. Finally I found the server which did let > me > > > read > > > > > the > > > > > > > data of these nodes. > > > > > > > Because of the inconsistency I've decided to restart it. > Restart > > > did > > > > > > help, > > > > > > > everything went back to normal state. The ephemeral nodes > > > > disappeared. > > > > > > > > > > > > > > Similar situation had happen at 12PM but this time I had a lot > more > > > > > time > > > > > > to > > > > > > > look what is wrong. Second time the problem was about 3 > ephemeral > > > > nodes > > > > > > > which were created at 1st now (again?). This time I dig a bit > > > deeper > > > > > and > > > > > > > look into logs and 4 letter commands - but could not find > anything > > > > > > > interesting except the all these 3 nodes were created under > > > different > > > > > > > sessionids but zk had no hosts connected under this sessionids. > > > > > > > Solution was similar to the one from 4AM but this time I've > delete > > > > all > > > > > > > files in ZK data directory. > > > > > > > > > > > > > > Oddly enough the problem happened twice on the same ZK node, > the > > > > final > > > > > > > solution was to clear ZK data directory. After clearing the > > > directory > > > > > the > > > > > > > problem didn't happen again. > > > > > > > > > > > > > > I tried to look for solution/similar problems, I found the > posts > > > > where > > > > > > > people were complaining about ephemeral nodes not being removed > > > after > > > > > > > client session gets closed. But I was not able to find posts > about > > > ZK > > > > > not > > > > > > > being consistent. > > > > > > > > > > > > > > What do you think about this? Can we do something to fix this? > > > > > > > > > > > > > > Sorry for my english, I was doing my best. :) > > > > > > > > > > > > > > Thanks, Kuba. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --047d7b87501cb5861c050ca5f162--