hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: Does the rolling-restart.sh script work?
Date Tue, 20 Mar 2012 07:45:47 GMT
I got it --

ZK 3.4.0 included
https://issues.apache.org/jira/browse/ZOOKEEPER-1059which changed stat
to exit cleanly instead of throwing an NPE.  Java
programs exit with a ret code 1 (failure case) if main throws an
exception.  Looking at this ZK code an NPE would percolate out:
https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeperMain.java#L736

https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeper.java#L980

This means previously ZKM would exit with ret code 1 and after the fix it
has a ret code of 0.

Seems like we need a new mechanism to check for if the /hbase/master zk
node has expired.

Suggestions on how to deal with this?  Maybe we have something dump cluster
stats to determine if masters and backup masters are down?

Jon.

On Tue, Mar 20, 2012 at 12:36 AM, Jonathan Hsieh <jon@cloudera.com> wrote:

> I'm trying to test HBASE-5589 -- to see if I can add an API call to
> HMasterInterface and do a rolling-restart / upgrade on a live cluster which
> lead me down another rabbit hole.
>
> I'm wondering how rolling-restart.sh script worked in the past (I can
> spend more time setting up an older version to test this, but figured I'd
> ask).
>
> I'm getting stuck when the bin/rolling-restart.sh tries to wait until the
> Master ZNode expires.  In this particular case, the script seems to hang
> there forever (even after the /hbase/master ephemeral node expires).
>
> Here's the code in the script:
> ----
> # make sure the master znode has been deleted before continuing
>     zparent=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool
> zookeeper.znode.parent`
>     if [ "$zparent" == "null" ]; then zparent="/hbase"; fi
>     zmaster=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool
> zookeeper.znode.master`
>     if [ "$zmaster" == "null" ]; then zmaster="master"; fi
>     zmaster=$zparent/$zmaster
>     echo -n "Waiting for Master ZNode ${zmaster} to expire"
>     while bin/hbase zkcli stat $zmaster >/dev/null 2>&1; do
>       echo -n "."
>       sleep 1
>     done
>     echo #force a newline
> ----
>
> The problem is that 'bin/hbase zkcli stat /hbase/master ...' seems to
> always returns with $? == 0 regardless if the znode is present or not
> present!  I've checked with Patrick Hunt (ZK committer) and this the
> expected behavior.  The only non-zero retcodes are for abnormal exits
> (exceptions thrown)
>
> Here's the ZK code I was looking through
>
> https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeperMain.java#L736
>
>
> https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeper.java#L980
>
>
> Thoughts?
>
> Jon.
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message