hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "amit.mor.mail@gmail.com" <amit.mor.m...@gmail.com>
Subject Re: RS crash upon replication
Date Wed, 22 May 2013 20:46:25 GMT
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
[1]
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]

I'm on hbase-0.94.2-cdh4.2.1

Thanks


On Wed, May 22, 2013 at 11:40 PM, Varun Sharma <varun@pinterest.com> wrote:

> Also what version of HBase are you running ?
>
>
> On Wed, May 22, 2013 at 1:38 PM, Varun Sharma <varun@pinterest.com> wrote:
>
> > Basically,
> >
> > You had va-p-hbase-02 crash - that caused all the replication related
> data
> > in zookeeper to be moved to va-p-hbase-01 and have it take over for
> > replicating 02's logs. Now each region server also maintains an in-memory
> > state of whats in ZK, it seems like when you start up 01, its trying to
> > replicate the 02 logs underneath but its failing to because that data is
> > not in ZK. This is somewhat weird...
> >
> > Can you open the zookeepeer shell and do
> >
> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> >
> > And give the output ?
> >
> >
> > On Wed, May 22, 2013 at 1:27 PM, amit.mor.mail@gmail.com <
> > amit.mor.mail@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> This is bad ... and happened twice: I had my replication-slave cluster
> >> offlined. I performed quite a massive Merge operation on it and after a
> >> couple of hours it had finished and I returned it back online. At the
> same
> >> time, the replication-master RS machines crashed (see first crash
> >> http://pastebin.com/1msNZ2tH) with the first exception being:
> >>
> >> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> >> NoNode for
> >>
> >>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> >>         at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> >>         at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
> >>         at
> >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
> >>         at
> >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
> >>         at
> >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
> >>
> >> Before restarting the crashed RS's, I have applied a 'stop_replication'
> >> cmd. Then fired up the RS's again. They've started o.k. but once I've
> hit
> >> 'start_replication' they have crashed once again. The second crash log
> >> http://pastebin.com/8Nb5epJJ has the same initial exception
> >> (org.apache.zookeeper.KeeperException$NoNodeException:
> >> KeeperErrorCode = NoNode). I've started the crash region servers again
> >> without replication and currently all is well, but I need to start
> >> replication asap.
> >>
> >> Does anyone have an idea what's going on and how can I solve it ?
> >>
> >> Thanks,
> >> Amit
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message