Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 23855 invoked from network); 6 Apr 2009 03:25:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Apr 2009 03:25:43 -0000 Received: (qmail 17473 invoked by uid 500); 6 Apr 2009 03:25:43 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 17385 invoked by uid 500); 6 Apr 2009 03:25:42 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 17375 invoked by uid 99); 6 Apr 2009 03:25:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Apr 2009 03:25:42 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [76.13.9.56] (HELO web65512.mail.ac4.yahoo.com) (76.13.9.56) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 06 Apr 2009 03:25:33 +0000 Received: (qmail 56368 invoked by uid 60001); 6 Apr 2009 03:25:11 -0000 Message-ID: <285164.50664.qm@web65512.mail.ac4.yahoo.com> X-YMail-OSG: cI3txyAVM1nR8AvcCzRuZGUWxvM2jW.bBOKKTcV1R.bBaUy7_bhH.2iYcL7uvrw6qjI4_Htel3C1ahhz4gCZsDGgj9_1FjCyL0Iaaon232DOfTm26PecU4THZn3Kuvqlw9K_54dLMf_1Ol8yZkai9EI.hvF8MR0no1vhijL2MH41JtM_PnsBJdvPONeQFiNtG7NAwlBmNqGgzvPPTwfPp1vPrimmfK3ZW1e20i_thB.WfhCNtqbJnDplqO8z_2v.G7P_C3pyG_TEtxMdFfNQwM4- Received: from [69.226.16.168] by web65512.mail.ac4.yahoo.com via HTTP; Sun, 05 Apr 2009 20:25:10 PDT X-RocketYMMF: apurtell X-Mailer: YahooMailWebService/0.7.289.1 Date: Sun, 5 Apr 2009 20:25:10 -0700 (PDT) From: Andrew Purtell Reply-To: apurtell@apache.org Subject: Re: [jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally To: hbase-dev@hadoop.apache.org In-Reply-To: <82b0992a0904051514i3fe11d27g8925a6ab4aeeaeaa@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Checked: Checked by ClamAV on apache.org That's an unfortunate side effect of some aspect of the ZK implementation, I suppose. HBase clients, regionservers, and masters with watches on ephemeral nodes will have to treat their disappearance as advisory only and check back once or twice before taking any recovery actions. It lengthens the time for recovery beyond what would be necessary without this wrinkle, which is unfortunate. Just to be clear by restart you are talking about re- initializing the ZK wrapper only, correct? It should not be necessary to restart everything on a node to deal with an expired ZK session, right? > From: Nitay > > The master did not respond correctly to a SessionExpired > event. I don't think there's a ZK bug. This is like > HBASE-1232. Both the master and regionserver got a > SessionExpired event. The bug I fixed for Ryan was just > with the client getting a SessionExpired. Andrew's > cluster shows us that it's just as likely for the master/ > RS to get this event. > > The only thing you can do on a SessionExpired event is to > completely restart the node. SessionExpired means your > ZooKeeper handle is dead, and your ephemeral nodes will go > away. Since every server in HBase has some ephemeral > node that indicates it liveness (e.g. /hbase/master, > /hbase/rs/...), the node has to completely restart. > > HBASE-1232, HBASE-1311, and HBASE-1312 are all the same > problem, just with three different points of view (client, > RS, master). > > On Sun, Apr 5, 2009 at 2:32 PM, Ryan Rawson wrote: > > > ZK keeps the note up as long as the session is still > > valid. > > So the question is: > > - did the master not respond correctly to an expired > > session? > > - is there a ZK bug (HOPE NOT!) > > > > -ryan > > > > On Sun, Apr 5, 2009 at 2:22 PM, Andrew Purtell wrote: > > > ZooKeeper: Master's ephemeral node went away > > > while it was still up and functioning normally