hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Buttler, David" <buttl...@llnl.gov>
Subject RE: Queries on Zookeeper failure and RegionServer restartup
Date Tue, 20 Sep 2011 17:15:34 GMT
Have you looked at this:
http://hbase.apache.org/book.html#zookeeper

Inline...

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com] 
Sent: Tuesday, September 20, 2011 9:32 AM
To: user@hbase.apache.org
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Hi David,

Thanks for your response. I am not clear with few things here :

1. Odd number of nodes in your zookeeper ensemble.
Why is it required. Can you please explain with example. Does that mean that if I have 3 nodes
on which I am running zookeeper and out of which 1 is failed, then the cluster will work.
And if out of 3 , 2 are failed then cluster will be down.

Buttler> Yes, this is correct.

2. " you do realize that you have to have a majority of zookeeper nodes alive for zookeeper
to work,"
Please explain this.

Buttler> Zookeeper needs a quorum of nodes.  The algorithm that zookeeper uses defines
a quorum as a simple majority.  I.e. more than half.  If you have 4 nodes, and 2 die, then
you have only 2 nodes alive, which is exactly half, not "more than half".  Zookeeper will
then assume that it can no longer function.  Therefore, the advice in the book is to have
an odd number of nodes so that you will never be in the case of having "exactly" half of your
nodes working.



Thanks


-----Original Message-----
From: Buttler, David [mailto:buttler1@llnl.gov] 
Sent: Tuesday, September 20, 2011 9:08 PM
To: user@hbase.apache.org
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Wait, you do realize that you have to have a majority of zookeeper nodes alive for zookeeper
to work, right?  That means that you get lower reliability with two nodes than one node: if
either node goes down, zookeeper will give up.  This also implies that you need to have an
odd number of nodes in your zookeeper ensemble.

Also, hbase requires synchronized time across the cluster.  You can't rely on the built-in
clocks to keep time synchronized to a close enough delta over a reasonable period of time
(e.g. after a month things will fall apart).  Luckily this is a solved problem: ntp

Dave


-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com] 
Sent: Tuesday, September 20, 2011 4:40 AM
To: user@hbase.apache.org
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Hi Ramkrishna,
Thanks for reply, I setup the system date and rechecked ,now region server are starting .

Thanks
Stuti

-----Original Message-----
From: Ramkrishna S Vasudevan [mailto:ramakrishnas@huawei.com] 
Sent: Tuesday, September 20, 2011 1:56 PM
To: user@hbase.apache.org
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Reg the clockoutofSync exception, just check if your cluster has same time set.  This problem
comes when you have time differences.

Best Regards
Ram

-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
Sent: Tuesday, September 20, 2011 1:28 PM
To: user@hbase.apache.org
Subject: Queries on Zookeeper failure and RegionServer restartup

Hi all,

I have 2 node cluster. I run Regionserver, Zookeeper on both nodes and Master on 1 and Backup
Master on other.

Here what I did : I stopped Zookeeper on 1 node and after that I was unable to access Hbase.

ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper
but the connection closes immediately. This could be a sign that the server has too many connections
(30 is the default).
Consider inspecting your ZK server logs for that error and then make sure you are reusing
HBaseConfiguration as often as you can. See HTable's javadoc for more information.

Queries :

1.        If one of the zookeeper is going down , cluster is inaccessible
then why we are running multiple zookeeper nodes?

2.       Is there some way that if one of zookeeper nodes are working,
cluster can be accessible?

Some other test :
If I stop RegionServer and Master on 1 node, then bakupMaster becomes Master and I can access
the Hbase cluster but when I try to restart Region server on the same node on which I have
shut down it gives me following error . How to fix this ?

2011-09-20 12:06:03,647 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=master,60020,1316500563205,
load=(requests=0, regions=0, usedHeap=22, maxHeap=993): Unhandled exception:
org.apache.hadoop.hbase.ClockOutOfSyncException: Server
master,60020,1316500563205 has been rejected; Reported time is too far out of sync with master.
 Time difference of 352381ms > max allowed of 30000ms
org.apache.hadoop.hbase.ClockOutOfSyncException:
org.apache.hadoop.hbase.ClockOutOfSyncException: Server
master,60020,1316500563205 has been rejected; Reported time is too far out of sync with master.
 Time difference of 352381ms > max allowed of 30000ms
                at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
sorImpl.java:39)
                at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
torAccessorImpl.java:27)
                at
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
                at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.j
ava:96)
                at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.
java:80)
                at
org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServ
er.java:1515)
                at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionS
erver.java:1479)
                at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:57
1)
                at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hbase.ClockOutOfSyncException: Server
master,60020,1316500563205 has been rejected; Reported time is too far out of sync with master.
 Time difference of 352381ms > max allowed of 30000ms

                at
org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.ja
va:181)
                at
org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManag
er.java:129)
                at
org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:615)


Your inputs are required

Thanks
Stuti

________________________________
::DISCLAIMER::
----------------------------------------------------------------------------
-------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named
recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or
opinions presented in this email are solely those of the author and may not necessarily reflect
the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and
/ or publication of this message without the prior written consent of the author of this e-mail
is strictly prohibited. If you have received this email in error please delete it and notify
the sender immediately. Before opening any mail and attachments please check them for viruses
and defect.

----------------------------------------------------------------------------
-------------------------------------------


Mime
View raw message