hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eluiggi <eduardolui...@gmail.com>
Subject ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
Date Mon, 17 Nov 2014 19:21:52 GMT
Hi,

I have an hbase (0.96.1.1-cdh5.0.2) cluster on AWS managed by Cloudera with
4 region servers and 1 zookeeper server. The zookeeper server is running on
the same node as the hbase master. The problem I'm facing is that 3/4 region
servers are down because they can't connect to the zookeeper. The only
region server that stays up is the one running on the same node as the
master and zookeeper. Below is the relevant section of one of the failing
region server logs.

2014-11-14 15:46:59,871 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection,  connectString=ip-10-146-188-157.ec2.internal:2181
sessionTimeout=60000 watcher=regionserver:60020,    
quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase
2014-11-14 15:46:59,915 INFO
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process 
identifier=regionserver:60020 connecting to ZooKeeper
ensemble=ip-10-146-188-157.ec2.internal:2181
2014-11-14 15:46:59,920 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:47:00,649 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook
thread: Shutdownhook:regionserver60020
2014-11-14 15:47:59,948 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60041ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:48:00,067 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:48:00,072 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 1000ms before retry #0...
2014-11-14 15:48:01,067 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:49:00,123 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60057ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:49:00,224 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:49:00,224 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 2000ms before retry #1...
2014-11-14 15:49:01,224 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:50:00,259 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60035ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:50:00,360 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:50:00,360 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 4000ms before retry #2...
2014-11-14 15:50:01,360 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:51:00,408 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60048ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:51:00,509 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:51:00,509 INFO org.apache.hadoop.hbase.util.RetryCounter:
Sleeping 8000ms before retry #3...
2014-11-14 15:51:01,509 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:52:00,559 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 60051ms for sessionid 0x0, closing
socket connection and attempting reconnect
2014-11-14 15:52:00,659 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, 
exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode =  ConnectionLoss for /hbase/master
2014-11-14 15:52:00,660 ERROR
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
failed after 4 attempts
2014-11-14 15:52:00,661 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
regionserver:60020,   quorum=ip-10-146-188-157.ec2.internal:2181,
baseZNode=/hbase Unable to set watcher on znode  /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss  for  /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,687 ERROR
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:   regionserver:60020,
quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Received
unexpected   KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,692 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
0.0.0.0,60020,1415998019646: Unexpected exception during initialization,
aborting
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at    
org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)

The hbase-site.xml fraction dealing with zookeeper is.
<property>
  <name>zookeeper.znode.parent</name>
  <value>/hbase</value>
</property>
<property>
  <name>zookeeper.znode.rootserver</name>
  <value>root-region-server</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>ip-10-146-188-157.ec2.internal</value>
</property>
<property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2181</value>
</property>

The /etc/hosts for each of the nodes is:
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6


Following some other threads I have removed the limit on the number of
connections, increased the timeout value, and explicitly added the hosts to
/etc/hosts on the region server and master nodes. None of these have helped
so far. 

Any help will be greatly appreciated.



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034.html
Sent from the HBase User mailing list archive at Nabble.com.

Mime
View raw message