hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: HBase master startup and the "Unable to read additional data from server sessionid 0x0" zk error.
Date Tue, 06 Dec 2011 02:05:05 GMT
1. In 0.92 it should recover right away from those errors.

2. I happened to us, it's fine.

I might add that you don't need to stop zookeeper when stopping HBase.
Our ZK ensembles have hundreds of days of uptime.

J-D

On Mon, Dec 5, 2011 at 5:10 AM, Mikael Sitruk <mikael.sitruk@gmail.com> wrote:
> Hi
>
> I would like to share with you my finding with the "Unable to read
> additional data from server sessionid 0x0" zk error which prevented HBase
> Master to start
>
> I have a cluster of 10 RS and a ZK quorum of 3 machines
> I use a script to start the cluster, hdfs, mapreduce, zk quorum, HBMaster
> and finally HBRS.
>
> Using the script everything started beside HBase.
>
> While checking into the log I found zk exception was thrown during the
> startup:
> 2011-12-05 00:05:34,622 ERROR
> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> java.lang.RuntimeException: Failed construction of Master: class
> org.apache.hadoop.hbase.master.HMaster
>        at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1069)
>        at
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:142)
>        at
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:102)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
>        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1083)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
>        at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
>        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:223)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1064)
>        ... 5 more
>
> Googling on the subject did not provide enough insight for my problem.
>
> I checked zk, and from the shell I got the same kind of exception,
> therefore I reinstalled zk, checked the command line and everything was ok.
> I thought that it will be the same with HBase, but not! Again I got the
> same behavior (HMaster failed), but this time zk was stable from the
> command line (zkCli).
>
> I continued with several experiments, then I found the sequence of
> operation that make the problem!
> If I start the ZK quorum in and order that is different than the ZK leader
> (the one with myid containing 1), the others zk and then immediately start
> HBase master then HBase master will failed to load with the error above.
> I added to the script 10 seconds wait between ZK start and HBase start and
> it resolved the problem.
>
> I suppose that the reason of the problem is that when another zk server is
> started prior the leader, then the zk quorum will begin some consensus to
> elect a new leader and this may take several seconds, during this time ZK
> quorum will not be available and HBMaster will failed to start.
>
> So I have several questions:
> 1. Is there a way in HBase at startup to check this situation and initiate
> a 10 second wait before trying to reconnect?
> 2. Let suppose that HBase is in the middle of some work and zk failure
> occurs (some node fail but still remaining n/2+1 zk server) and the
> election protocol begin, does HBase will be ok, or will it begin a shutdown
> sequence? My understanding is that HBase should be ok, as long as there is
> a zk quorum available, it may just need to reconnect, but should not
> shutdown nor be inaccessible.
>
>
> Regards,
> Mikael.S

Mime
View raw message