hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ac@hsk.hk" ...@hsk.hk>
Subject Re: A region server stopped (timeout after trying to connect local Zookeeper)
Date Thu, 22 Nov 2012 00:14:22 GMT
Hi 

I changed the order of ZooKeepers in the value of hbase.zookeeper.quorum,  from "m146,m145,m143"
to "m143,m145,m146", set timeout from 60000 to 70000, and commented out lzo property.  it
works now, here is the diff

1) $ diff hbase-site.xml hbase-site.xml.xxx 
41,44c41,43
< 
< <property> 
< <name>hbase.zookeeper.quorum</name> 
< <value>m143,m145,m146</value> 
---
> <property>
> <name>hbase.ZooKeeper.quorum</name>
> <value>m146,m145,m143</value>
49c48,55
< <value>70000</value>
---
> <value>60000</value>
> </property>
> 
> <!--
> /**
> <property>
> <name>hbase.regionserver.codecs</name>
> <value>lzo,gz</value>
50a57,58
> **/
> -->

Above is the only change today .


2) hbase log:
2012-11-22 07:26:19,431 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=m145:2181,m143:2181,m146:2181 sessionTimeout=70000 watcher=regionserver:6$


I don't know why but it works now. It seems that hbase somehow could not read in hbase-site.xml
correctly.


Thanks




On 22 Nov 2012, at 7:51 AM, Jean-Marc Spaggiari wrote:

> Can you do JPS on your master and look at the logs too?
> 
> Another think, can you try with hbase.zookeeper.quorum instead of
> hbase.ZooKeeper.quorum?
> 
> 2012/11/21, ac@hsk.hk <ac@hsk.hk>:
>> Hi,
>> 
>> Here are my HBase configuration and test:
>> 
>> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml
>> <property>
>> <name>hbase.ZooKeeper.quorum</name>
>> <value>m146,m145,m143</value>
>> </property>
>> 
>> <property>
>> <name>zookeeper.session.timeout</name>
>> <value>60000</value>
>> </property>
>> 
>> 
>> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh
>> export HBASE_MANAGES_ZK=false
>> 
>> 
>> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the
>> connection, it worked
>> [zk: m145,m146,m143(CONNECTED) 0]
>> 
>> 
>> 4) from the logs, I found that the connectString was odd, the RegionServer
>> did not use the setting of "hbase.ZooKeeper.quorum" in conf/hbase-site.xml,
>> it seemed that it always used the default and tried to connect
>> "localhost:2181" in the distributed cluster:
>> 
>> 	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=localhost:2181 sessionTimeout=60000
>> watcher=regionserver:60020
>> 	...
>> 	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server localhost/127.0.0.1:2181. Will not attempt to
>> authenticate using SASL (Unable to locate a login configura$
>> 	...
>> 	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
>> for server null, unexpected error, closing socket connection and attempting
>> reconnect java.net.ConnectException: Connection refused
>> 	...  (remark: it tried above 3 times, then had FATAL error as follows)
>> 
>> 	2012-11-21 17:21:57,846 ERROR
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>> Received unexpected KeeperException, re-throwing exception
>> 	...
>> 	2012-11-21 17:21:57,847 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>> ...
>> 
>> 
>> 
>> Please help.
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:
>> 
>>> Hi,
>>> 
>>> What do you have on your HBase configuration? Are you passing the name
>>> of the Quorum servers?
>>> $ cat conf/hbase-site.xml
>>> ......
>>> </property>
>>>   <property>
>>>     <name>hbase.zookeeper.quorum</name>
>>>     <value>cube,latitude,node3</value>
>>>     <description>Comma separated list of servers in the ZooKeeper
>>> Quorum.
>>>     For example,
>>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>>>     By default this is set to localhost for local and pseudo-distributed
>>> modes
>>>     of operation. For a fully-distributed setup, this should be set to a
>>> full
>>>     list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
>>> hbase-env.sh
>>>     this is the list of servers which we will start/stop ZooKeeper on.
>>>     </description>
>>>   </property>
>>> .....
>>> 
>>> 2012/11/21, ac@hsk.hk <ac@hsk.hk>:
>>>> Hi,
>>>> 
>>>> 
>>>> I have the following line in /etc/hosts in all servers, should I keep it
>>>> or
>>>> comment it out or ...?
>>>> 
>>>> 127.0.0.1       localhost
>>>> 
>>>> Please help.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> 
>>>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> Please help!!
>>>>> 
>>>>> HBase version: 0.94
>>>>> ZooKeeper: 3.4.4
>>>>> 
>>>>> One of the regional servers stopped very quickly after HBASE is
>>>>> started:
>>>>> 
>>>>> ### Check JPS after HBASE cluster was started, could find the
>>>>> HRegionServer process (*** there is no any ZooKeeper instance running
>>>>> in
>>>>> this server ***)
>>>>> $ jps
>>>>> 24767 Jps
>>>>> 18418 TaskTracker
>>>>> 24678 HRegionServer
>>>>> 18156 DataNode
>>>>> 
>>>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>>>> $ jps
>>>>> 18418 TaskTracker
>>>>> 24784 Jps
>>>>> 18156 DataNode
>>>>> 
>>>>> 
>>>>> ### Here is the setting in hbase-site.xml ( enabled
>>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>>>> <property>
>>>>> <name>hbase.cluster.distributed</name>
>>>>> <value>true</value>
>>>>> </property>
>>>>> 
>>>>> <property>
>>>>> <name>hbase.ZooKeeper.quorum</name>
>>>>> <value>m146,m145,m143</value>
>>>>> </property>
>>>>> 
>>>>> <property>
>>>>> <name>zookeeper.session.timeout</name>
>>>>> <value>60000</value>
>>>>> </property>
>>>>> 
>>>>> 
>>>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>>>> ZooKeeper
>>>>> export HBASE_MANAGES_ZK=false
>>>>> 
>>>>> 
>>>>> ###This server can connect to the 3 ZooKeepers,
>>>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk: m145,m146,m143(CONNECTED)
>>>>> 0]
>>>>> 
>>>>> 
>>>>> ### checked the hbase log file, found something odd,  seemed that it
>>>>> tried
>>>>> to connect local ZooKeeper
>>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>>> watcher=regionserver:60020
>>>>> 
>>>>> 2012-11-21 17:31:33,254 WARN
>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>> ZooKeeper exception:
>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>> 
>>>>> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter:
>>>>> Sleeping 2000ms before retry #1...
>>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>> session timed out, have not heard from server in 60010ms for sessionid
>>>>> 0x0, closing socket connection and attempting reconnect
>>>>> 
>>>>> 2012-11-21 17:32:33,362 WARN
>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>> ZooKeeper exception:
>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>> 
>>>>> ......
>>>>> 
>>>>> 2012-11-21 17:34:33,570 ERROR
>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>>> exists
>>>>> failed after 3 retries
>>>>> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>>>> 2012-11-21 17:34:33,573 ERROR
>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>>>> Received unexpected KeeperException, re-throwing exception
>>>>> 2012-11-21 17:34:33,573 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>> server
>>>>> ......
>>>>> 2012-11-21 17:34:33,576 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>>> loaded coprocessors are: []
>>>>> 
>>>>> 2012-11-21 17:34:36,580 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>> server
>>>>> m144,60020,1353490232962: Initialization of RS failed.  Hence aborting
>>>>> RS.
>>>>> java.io.IOException: Received the shutdown message while waiting.
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>>>> 	at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>>> 2012-11-21 17:34:36,581 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>>> loaded coprocessors are: []
>>>>> 
>>>>> 
>>>>> Please help!
>>>>> QUESTION: Is it a bug and I need to check something else?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message