hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameya Kantikar <am...@groupon.com>
Subject Under Heavy Write Load + Replication On : Brings All My Region Servers Dead
Date Thu, 18 Apr 2013 05:38:17 GMT
I am running Hbase 0.94.2 from cloudera cdh4.2. (10 machine cluster)

Under heavy write load, and when replication is on, all my region servers
are going down.
I checked with cloudera version, it has HBASE-2611 bug patched in the
version I am using, so not sure whats going on. Here is the stack:

2013-04-18 01:47:33,423 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Atomically moving relevance-hbase5-snc1.snc1,60020,1366247910200's hlogs to
my queue

2013-04-18 01:47:33,424 DEBUG
org.apache.hadoop.hbase.replication.ReplicationZookeeper:  The multi list
size is: 1

2013-04-18 01:47:33,425 WARN
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception in
copyQueuesFromRSUsingMulti:

org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
Directory not empty

        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:125)

        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:925)

        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:901)

        at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:538)

        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1457)

        at
org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705)

        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:585)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)


Followed by

2013-04-18 01:47:36,043 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
relevance-hbase2-snc1.snc1,60020,1366247745434: Writing replication status


I checked by turning replication off, and everything seems fine. I can
reproduce this bug almost every time I run my write heavy job.


Here is the complete log:

http://pastebin.com/da0m475T



Any ideas?


Ameya

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message