Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5FAB79583 for ; Tue, 20 Mar 2012 07:46:36 +0000 (UTC) Received: (qmail 34487 invoked by uid 500); 20 Mar 2012 07:46:35 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 34444 invoked by uid 500); 20 Mar 2012 07:46:35 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 34416 invoked by uid 99); 20 Mar 2012 07:46:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2012 07:46:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jon@cloudera.com designates 209.85.210.169 as permitted sender) Received: from [209.85.210.169] (HELO mail-iy0-f169.google.com) (209.85.210.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2012 07:46:29 +0000 Received: by iajr24 with SMTP id r24so14560805iaj.14 for ; Tue, 20 Mar 2012 00:46:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=HUi3eWPR+lpJhqwUXJQ+5i4ttlWrmlDKRIgZO6V7UfM=; b=l04kqD+0UnVDeUU5JxQnwc1SEYh+i3xudSFv3ZpnM6tYv1lJKpeH+DiMvxo5o0yyek R0Y745eFYS3BzXNwV/LkZ9yCDhOxBKllMxbdTcwkdMxu8p6ToVoHaJnSI3kKLGIB9L7J 4wzM6YlJtGuyhIUhwn/9BJmTc6TpVOauw2KwUSutP2uYGNpXlGG0y7o8VFplv8Fv3TZg 07zwoPOjlKlAhw9iqLndqRNn5n2SiE+mGuHfT1Av1h1QoWaM0sZBmiga/cS2VsGwtf4l i9JXjw1+mCi19ittbx4CpK4eR8Mf9xN6MIqdEyUycX91CfakR2PJFDmdUnIelspLI7OI ddUA== Received: by 10.50.220.138 with SMTP id pw10mr7848327igc.71.1332229568159; Tue, 20 Mar 2012 00:46:08 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.104.129 with HTTP; Tue, 20 Mar 2012 00:45:47 -0700 (PDT) In-Reply-To: References: From: Jonathan Hsieh Date: Tue, 20 Mar 2012 00:45:47 -0700 Message-ID: Subject: Re: Does the rolling-restart.sh script work? To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary=bcaec55551142b5e9f04bba7de66 X-Gm-Message-State: ALoCoQnqixleyo8qjUjb6NrXqr3pw/Gh5SonRSjKqMfvU6D/pQQgVFb7jdqrbS/ZRuM0PYlWPFx0 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec55551142b5e9f04bba7de66 Content-Type: text/plain; charset=ISO-8859-1 I got it -- ZK 3.4.0 included https://issues.apache.org/jira/browse/ZOOKEEPER-1059which changed stat to exit cleanly instead of throwing an NPE. Java programs exit with a ret code 1 (failure case) if main throws an exception. Looking at this ZK code an NPE would percolate out: https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeperMain.java#L736 https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeper.java#L980 This means previously ZKM would exit with ret code 1 and after the fix it has a ret code of 0. Seems like we need a new mechanism to check for if the /hbase/master zk node has expired. Suggestions on how to deal with this? Maybe we have something dump cluster stats to determine if masters and backup masters are down? Jon. On Tue, Mar 20, 2012 at 12:36 AM, Jonathan Hsieh wrote: > I'm trying to test HBASE-5589 -- to see if I can add an API call to > HMasterInterface and do a rolling-restart / upgrade on a live cluster which > lead me down another rabbit hole. > > I'm wondering how rolling-restart.sh script worked in the past (I can > spend more time setting up an older version to test this, but figured I'd > ask). > > I'm getting stuck when the bin/rolling-restart.sh tries to wait until the > Master ZNode expires. In this particular case, the script seems to hang > there forever (even after the /hbase/master ephemeral node expires). > > Here's the code in the script: > ---- > # make sure the master znode has been deleted before continuing > zparent=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool > zookeeper.znode.parent` > if [ "$zparent" == "null" ]; then zparent="/hbase"; fi > zmaster=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool > zookeeper.znode.master` > if [ "$zmaster" == "null" ]; then zmaster="master"; fi > zmaster=$zparent/$zmaster > echo -n "Waiting for Master ZNode ${zmaster} to expire" > while bin/hbase zkcli stat $zmaster >/dev/null 2>&1; do > echo -n "." > sleep 1 > done > echo #force a newline > ---- > > The problem is that 'bin/hbase zkcli stat /hbase/master ...' seems to > always returns with $? == 0 regardless if the znode is present or not > present! I've checked with Patrick Hunt (ZK committer) and this the > expected behavior. The only non-zero retcodes are for abnormal exits > (exceptions thrown) > > Here's the ZK code I was looking through > > https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeperMain.java#L736 > > > https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeper.java#L980 > > > Thoughts? > > Jon. > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // jon@cloudera.com > > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // jon@cloudera.com --bcaec55551142b5e9f04bba7de66--