hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ac@hsk.hk" ...@hsk.hk>
Subject Re: A region server stopped (timeout after trying to connect local Zookeeper)
Date Wed, 21 Nov 2012 23:13:41 GMT
Hi,

Here are my HBase configuration and test:

1) {$HBASE_HOME}hbase/conf/hbase-site.xml
<property>
<name>hbase.ZooKeeper.quorum</name>
<value>m146,m145,m143</value>
</property>

<property>
<name>zookeeper.session.timeout</name>
<value>60000</value>
</property>


2) {$HBASE_HOME}hbase/conf/hbase-env.sh
export HBASE_MANAGES_ZK=false


3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the connection, it worked
[zk: m145,m146,m143(CONNECTED) 0]


4) from the logs, I found that the connectString was odd, the RegionServer did not use the
setting of "hbase.ZooKeeper.quorum" in conf/hbase-site.xml, it seemed that it always used
the default and tried to connect "localhost:2181" in the distributed cluster: 

	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=60000 watcher=regionserver:60020
	...
	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (Unable to locate
a login configura$
	...
	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect java.net.ConnectException:
Connection refused
	...  (remark: it tried above 3 times, then had FATAL error as follows)
       
	2012-11-21 17:21:57,846 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
Received unexpected KeeperException, re-throwing exception 
	...
	2012-11-21 17:21:57,847 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server ...



Please help.
 
Thanks





On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:

> Hi,
> 
> What do you have on your HBase configuration? Are you passing the name
> of the Quorum servers?
> $ cat conf/hbase-site.xml
> ......
>  </property>
>    <property>
>      <name>hbase.zookeeper.quorum</name>
>      <value>cube,latitude,node3</value>
>      <description>Comma separated list of servers in the ZooKeeper Quorum.
>      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>      By default this is set to localhost for local and pseudo-distributed modes
>      of operation. For a fully-distributed setup, this should be set to a full
>      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
> hbase-env.sh
>      this is the list of servers which we will start/stop ZooKeeper on.
>      </description>
>    </property>
> .....
> 
> 2012/11/21, ac@hsk.hk <ac@hsk.hk>:
>> Hi,
>> 
>> 
>> I have the following line in /etc/hosts in all servers, should I keep it or
>> comment it out or ...?
>> 
>> 127.0.0.1       localhost
>> 
>> Please help.
>> 
>> Thanks
>> 
>> 
>> 
>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> Please help!!
>>> 
>>> HBase version: 0.94
>>> ZooKeeper: 3.4.4
>>> 
>>> One of the regional servers stopped very quickly after HBASE is started:
>>> 
>>> ### Check JPS after HBASE cluster was started, could find the
>>> HRegionServer process (*** there is no any ZooKeeper instance running in
>>> this server ***)
>>> $ jps
>>> 24767 Jps
>>> 18418 TaskTracker
>>> 24678 HRegionServer
>>> 18156 DataNode
>>> 
>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>> $ jps
>>> 18418 TaskTracker
>>> 24784 Jps
>>> 18156 DataNode
>>> 
>>> 
>>> ### Here is the setting in hbase-site.xml ( enabled
>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>> <property>
>>> <name>hbase.cluster.distributed</name>
>>> <value>true</value>
>>> </property>
>>> 
>>> <property>
>>> <name>hbase.ZooKeeper.quorum</name>
>>> <value>m146,m145,m143</value>
>>> </property>
>>> 
>>> <property>
>>> <name>zookeeper.session.timeout</name>
>>> <value>60000</value>
>>> </property>
>>> 
>>> 
>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>> ZooKeeper
>>> export HBASE_MANAGES_ZK=false
>>> 
>>> 
>>> ###This server can connect to the 3 ZooKeepers,
>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED)
>>> 0]
>>> 
>>> 
>>> ### checked the hbase log file, found something odd,  seemed that it tried
>>> to connect local ZooKeeper
>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>> watcher=regionserver:60020
>>> 
>>> 2012-11-21 17:31:33,254 WARN
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
>>> ZooKeeper exception:
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>> 
>>> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter:
>>> Sleeping 2000ms before retry #1...
>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>> session timed out, have not heard from server in 60010ms for sessionid
>>> 0x0, closing socket connection and attempting reconnect
>>> 
>>> 2012-11-21 17:32:33,362 WARN
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
>>> ZooKeeper exception:
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>> 
>>> ......
>>> 
>>> 2012-11-21 17:34:33,570 ERROR
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
>>> failed after 3 retries
>>> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>> 2012-11-21 17:34:33,573 ERROR
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>> Received unexpected KeeperException, re-throwing exception
>>> 2012-11-21 17:34:33,573 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>>> ......
>>> 2012-11-21 17:34:33,576 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>> loaded coprocessors are: []
>>> 
>>> 2012-11-21 17:34:36,580 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>>> m144,60020,1353490232962: Initialization of RS failed.  Hence aborting
>>> RS.
>>> java.io.IOException: Received the shutdown message while waiting.
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>> 	at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>> 	at java.lang.Thread.run(Thread.java:662)
>>> 2012-11-21 17:34:36,581 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>> loaded coprocessors are: []
>>> 
>>> 
>>> Please help!
>>> QUESTION: Is it a bug and I need to check something else?
>>> 
>>> Thanks
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message