hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "amit.mor.mail@gmail.com" <amit.mor.m...@gmail.com>
Subject RS crash upon replication
Date Wed, 22 May 2013 20:27:21 GMT
Hi,

This is bad ... and happened twice: I had my replication-slave cluster
offlined. I performed quite a massive Merge operation on it and after a
couple of hours it had finished and I returned it back online. At the same
time, the replication-master RS machines crashed (see first crash
http://pastebin.com/1msNZ2tH) with the first exception being:

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
NoNode for
/hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
        at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
        at
org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)

Before restarting the crashed RS's, I have applied a 'stop_replication'
cmd. Then fired up the RS's again. They've started o.k. but once I've hit
'start_replication' they have crashed once again. The second crash log
http://pastebin.com/8Nb5epJJ has the same initial exception
(org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode). I've started the crash region servers again
without replication and currently all is well, but I need to start
replication asap.

Does anyone have an idea what's going on and how can I solve it ?

Thanks,
Amit

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message