hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ac@hsk.hk" ...@hsk.hk>
Subject Re: A region server stopped (timeout after trying to connect local Zookeeper)
Date Thu, 22 Nov 2012 01:53:19 GMT
Hi JM, 

Thank you!

it is case sensitive indeed, a simple change of  'z' brings back ALL RegionServers (and a
'Z' could bring down all too), I spent few hours on other areas and hadn't realized this 'Z'
effect.

Thanks again.
 

On 22 Nov 2012, at 8:39 AM, Jean-Marc Spaggiari wrote:

> I think the MAIN difference is the uppercase on the property... Seems
> that hbase-site.xml is case sensitive (which seems to be normal in
> Java and unix world).
> 
> You might want to retry by putting back the uppercase to see if this
> was the issue.
> 
> JM
> 
> 2012/11/21, ac@hsk.hk <ac@hsk.hk>:
>> Hi
>> 
>> I changed the order of ZooKeepers in the value of hbase.zookeeper.quorum,
>> from "m146,m145,m143" to "m143,m145,m146", set timeout from 60000 to 70000,
>> and commented out lzo property.  it works now, here is the diff
>> 
>> 1) $ diff hbase-site.xml hbase-site.xml.xxx
>> 41,44c41,43
>> <
>> < <property>
>> < <name>hbase.zookeeper.quorum</name>
>> < <value>m143,m145,m146</value>
>> ---
>>> <property>
>>> <name>hbase.ZooKeeper.quorum</name>
>>> <value>m146,m145,m143</value>
>> 49c48,55
>> < <value>70000</value>
>> ---
>>> <value>60000</value>
>>> </property>
>>> 
>>> <!--
>>> /**
>>> <property>
>>> <name>hbase.regionserver.codecs</name>
>>> <value>lzo,gz</value>
>> 50a57,58
>>> **/
>>> -->
>> 
>> Above is the only change today .
>> 
>> 
>> 2) hbase log:
>> 2012-11-22 07:26:19,431 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=m145:2181,m143:2181,m146:2181
>> sessionTimeout=70000 watcher=regionserver:6$
>> 
>> 
>> I don't know why but it works now. It seems that hbase somehow could not
>> read in hbase-site.xml correctly.
>> 
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> On 22 Nov 2012, at 7:51 AM, Jean-Marc Spaggiari wrote:
>> 
>>> Can you do JPS on your master and look at the logs too?
>>> 
>>> Another think, can you try with hbase.zookeeper.quorum instead of
>>> hbase.ZooKeeper.quorum?
>>> 
>>> 2012/11/21, ac@hsk.hk <ac@hsk.hk>:
>>>> Hi,
>>>> 
>>>> Here are my HBase configuration and test:
>>>> 
>>>> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml
>>>> <property>
>>>> <name>hbase.ZooKeeper.quorum</name>
>>>> <value>m146,m145,m143</value>
>>>> </property>
>>>> 
>>>> <property>
>>>> <name>zookeeper.session.timeout</name>
>>>> <value>60000</value>
>>>> </property>
>>>> 
>>>> 
>>>> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh
>>>> export HBASE_MANAGES_ZK=false
>>>> 
>>>> 
>>>> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the
>>>> connection, it worked
>>>> [zk: m145,m146,m143(CONNECTED) 0]
>>>> 
>>>> 
>>>> 4) from the logs, I found that the connectString was odd, the
>>>> RegionServer
>>>> did not use the setting of "hbase.ZooKeeper.quorum" in
>>>> conf/hbase-site.xml,
>>>> it seemed that it always used the default and tried to connect
>>>> "localhost:2181" in the distributed cluster:
>>>> 
>>>> 	2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>> watcher=regionserver:60020
>>>> 	...
>>>> 	2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server localhost/127.0.0.1:2181. Will not attempt
>>>> to
>>>> authenticate using SASL (Unable to locate a login configura$
>>>> 	...
>>>> 	2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session
>>>> 0x0
>>>> for server null, unexpected error, closing socket connection and
>>>> attempting
>>>> reconnect java.net.ConnectException: Connection refused
>>>> 	...  (remark: it tried above 3 times, then had FATAL error as follows)
>>>> 
>>>> 	2012-11-21 17:21:57,846 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>>> Received unexpected KeeperException, re-throwing exception
>>>> 	...
>>>> 	2012-11-21 17:21:57,847 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server
>>>> ...
>>>> 
>>>> 
>>>> 
>>>> Please help.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> What do you have on your HBase configuration? Are you passing the name
>>>>> of the Quorum servers?
>>>>> $ cat conf/hbase-site.xml
>>>>> ......
>>>>> </property>
>>>>>  <property>
>>>>>    <name>hbase.zookeeper.quorum</name>
>>>>>    <value>cube,latitude,node3</value>
>>>>>    <description>Comma separated list of servers in the ZooKeeper
>>>>> Quorum.
>>>>>    For example,
>>>>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>>>>>    By default this is set to localhost for local and
>>>>> pseudo-distributed
>>>>> modes
>>>>>    of operation. For a fully-distributed setup, this should be set to
>>>>> a
>>>>> full
>>>>>    list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
>>>>> hbase-env.sh
>>>>>    this is the list of servers which we will start/stop ZooKeeper on.
>>>>>    </description>
>>>>>  </property>
>>>>> .....
>>>>> 
>>>>> 2012/11/21, ac@hsk.hk <ac@hsk.hk>:
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> I have the following line in /etc/hosts in all servers, should I
keep
>>>>>> it
>>>>>> or
>>>>>> comment it out or ...?
>>>>>> 
>>>>>> 127.0.0.1       localhost
>>>>>> 
>>>>>> Please help.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 21 Nov 2012, at 7:16 PM, ac@hsk.hk wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> 
>>>>>>> Please help!!
>>>>>>> 
>>>>>>> HBase version: 0.94
>>>>>>> ZooKeeper: 3.4.4
>>>>>>> 
>>>>>>> One of the regional servers stopped very quickly after HBASE
is
>>>>>>> started:
>>>>>>> 
>>>>>>> ### Check JPS after HBASE cluster was started, could find the
>>>>>>> HRegionServer process (*** there is no any ZooKeeper instance
running
>>>>>>> in
>>>>>>> this server ***)
>>>>>>> $ jps
>>>>>>> 24767 Jps
>>>>>>> 18418 TaskTracker
>>>>>>> 24678 HRegionServer
>>>>>>> 18156 DataNode
>>>>>>> 
>>>>>>> ### Wait a while and checked JPS again,  HRegionServer process
gone
>>>>>>> $ jps
>>>>>>> 18418 TaskTracker
>>>>>>> 24784 Jps
>>>>>>> 18156 DataNode
>>>>>>> 
>>>>>>> 
>>>>>>> ### Here is the setting in hbase-site.xml ( enabled
>>>>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>>>>>> <property>
>>>>>>> <name>hbase.cluster.distributed</name>
>>>>>>> <value>true</value>
>>>>>>> </property>
>>>>>>> 
>>>>>>> <property>
>>>>>>> <name>hbase.ZooKeeper.quorum</name>
>>>>>>> <value>m146,m145,m143</value>
>>>>>>> </property>
>>>>>>> 
>>>>>>> <property>
>>>>>>> <name>zookeeper.session.timeout</name>
>>>>>>> <value>60000</value>
>>>>>>> </property>
>>>>>>> 
>>>>>>> 
>>>>>>> ### hbase-env.sh also tells HBASE not to manage local instance
of
>>>>>>> ZooKeeper
>>>>>>> export HBASE_MANAGES_ZK=false
>>>>>>> 
>>>>>>> 
>>>>>>> ###This server can connect to the 3 ZooKeepers,
>>>>>>> ./zkCli.sh -server m145,m146,m143  	==>  [zk:
>>>>>>> m145,m146,m143(CONNECTED)
>>>>>>> 0]
>>>>>>> 
>>>>>>> 
>>>>>>> ### checked the hbase log file, found something odd,  seemed
that it
>>>>>>> tried
>>>>>>> to connect local ZooKeeper
>>>>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper:
>>>>>>> Initiating
>>>>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>>>>> watcher=regionserver:60020
>>>>>>> 
>>>>>>> 2012-11-21 17:31:33,254 WARN
>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>> ZooKeeper exception:
>>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>>>> 
>>>>>>> 2012-11-21 17:31:33,254 INFO
>>>>>>> org.apache.hadoop.hbase.util.RetryCounter:
>>>>>>> Sleeping 2000ms before retry #1...
>>>>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn:
Client
>>>>>>> session timed out, have not heard from server in 60010ms for
>>>>>>> sessionid
>>>>>>> 0x0, closing socket connection and attempting reconnect
>>>>>>> 
>>>>>>> 2012-11-21 17:32:33,362 WARN
>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>> ZooKeeper exception:
>>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>>>> 
>>>>>>> ......
>>>>>>> 
>>>>>>> 2012-11-21 17:34:33,570 ERROR
>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>>>>> exists
>>>>>>> failed after 3 retries
>>>>>>> 2012-11-21 17:34:33,571 WARN
>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>>>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>>>>>> 2012-11-21 17:34:33,573 ERROR
>>>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
>>>>>>> regionserver:60020
>>>>>>> Received unexpected KeeperException, re-throwing exception
>>>>>>> 2012-11-21 17:34:33,573 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region
>>>>>>> server
>>>>>>> ......
>>>>>>> 2012-11-21 17:34:33,576 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>>>>> abort:
>>>>>>> loaded coprocessors are: []
>>>>>>> 
>>>>>>> 2012-11-21 17:34:36,580 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region
>>>>>>> server
>>>>>>> m144,60020,1353490232962: Initialization of RS failed.  Hence
>>>>>>> aborting
>>>>>>> RS.
>>>>>>> java.io.IOException: Received the shutdown message while waiting.
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>>>>>> 	at
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>>>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>>>>> 2012-11-21 17:34:36,581 FATAL
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>>>>> abort:
>>>>>>> loaded coprocessors are: []
>>>>>>> 
>>>>>>> 
>>>>>>> Please help!
>>>>>>> QUESTION: Is it a bug and I need to check something else?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message