hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: RS crash upon replication
Date Wed, 22 May 2013 21:20:22 GMT
Basically

ls /hbase/rs and what do you see for va-p-02-d ?


On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <varun@pinterest.com> wrote:

> Can you do ls /hbase/rs and see what you get for 02-d - instead of looking
> in /replication/, could you look in /hbase/replication/rs - I want to see
> if the timestamps are matching or not ?
>
> Varun
>
>
> On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <varun@pinterest.com> wrote:
>
>> I see - so looks okay - there's just a lot of deep nesting in there - if
>> you look into these you nodes by doing ls - you should see a bunch of
>> WAL(s) which still need to be replicated...
>>
>> Varun
>>
>>
>> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <varun@pinterest.com>wrote:
>>
>>> 2013-05-22 15:31:25,929 WARN
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
>>> ZooKeeper exception:
>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired for *
>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
>>> *
>>> *
>>> *
>>> *01->[01->02->02]->01*
>>>
>>> *Looks like a bunch of cascading failures causing this deep nesting... *
>>>
>>>
>>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor <amit.mor.mail@gmail.com>wrote:
>>>
>>>> empty return:
>>>>
>>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
>>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>>>> []
>>>>
>>>>
>>>>
>>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <varun@pinterest.com>
>>>> wrote:
>>>>
>>>> > Do an "ls" not a get here and give the output ?
>>>> >
>>>> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>>>> >
>>>> >
>>>> > On Wed, May 22, 2013 at 1:53 PM, amit.mor.mail@gmail.com <
>>>> > amit.mor.mail@gmail.com> wrote:
>>>> >
>>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
>>>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>>>> > >
>>>> > > cZxid = 0x60281c1de
>>>> > > ctime = Wed May 22 15:11:17 EDT 2013
>>>> > > mZxid = 0x60281c1de
>>>> > > mtime = Wed May 22 15:11:17 EDT 2013
>>>> > > pZxid = 0x60281c1de
>>>> > > cversion = 0
>>>> > > dataVersion = 0
>>>> > > aclVersion = 0
>>>> > > ephemeralOwner = 0x0
>>>> > > dataLength = 0
>>>> > > numChildren = 0
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <yuzhihong@gmail.com>
>>>> wrote:
>>>> > >
>>>> > > > What does this command show you ?
>>>> > > >
>>>> > > > get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>>>> > > >
>>>> > > > Cheers
>>>> > > >
>>>> > > > On Wed, May 22, 2013 at 1:46 PM, amit.mor.mail@gmail.com <
>>>> > > > amit.mor.mail@gmail.com> wrote:
>>>> > > >
>>>> > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
>>>> > > > > [1]
>>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
>>>> > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>>>> > > > > []
>>>> > > > >
>>>> > > > > I'm on hbase-0.94.2-cdh4.2.1
>>>> > > > >
>>>> > > > > Thanks
>>>> > > > >
>>>> > > > >
>>>> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma <
>>>> varun@pinterest.com>
>>>> > > > > wrote:
>>>> > > > >
>>>> > > > > > Also what version of HBase are you running ?
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma <
>>>> varun@pinterest.com
>>>> > >
>>>> > > > > wrote:
>>>> > > > > >
>>>> > > > > > > Basically,
>>>> > > > > > >
>>>> > > > > > > You had va-p-hbase-02 crash - that caused all
the
>>>> replication
>>>> > > related
>>>> > > > > > data
>>>> > > > > > > in zookeeper to be moved to va-p-hbase-01 and
have it take
>>>> over
>>>> > for
>>>> > > > > > > replicating 02's logs. Now each region server
also
>>>> maintains an
>>>> > > > > in-memory
>>>> > > > > > > state of whats in ZK, it seems like when you
start up 01,
>>>> its
>>>> > > trying
>>>> > > > to
>>>> > > > > > > replicate the 02 logs underneath but its failing
to because
>>>> that
>>>> > > data
>>>> > > > > is
>>>> > > > > > > not in ZK. This is somewhat weird...
>>>> > > > > > >
>>>> > > > > > > Can you open the zookeepeer shell and do
>>>> > > > > > >
>>>> > > > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
>>>> > > > > > >
>>>> > > > > > > And give the output ?
>>>> > > > > > >
>>>> > > > > > >
>>>> > > > > > > On Wed, May 22, 2013 at 1:27 PM, amit.mor.mail@gmail.com
<
>>>> > > > > > > amit.mor.mail@gmail.com> wrote:
>>>> > > > > > >
>>>> > > > > > >> Hi,
>>>> > > > > > >>
>>>> > > > > > >> This is bad ... and happened twice: I had
my
>>>> replication-slave
>>>> > > > cluster
>>>> > > > > > >> offlined. I performed quite a massive Merge
operation on
>>>> it and
>>>> > > > after
>>>> > > > > a
>>>> > > > > > >> couple of hours it had finished and I returned
it back
>>>> online.
>>>> > At
>>>> > > > the
>>>> > > > > > same
>>>> > > > > > >> time, the replication-master RS machines
crashed (see first
>>>> > crash
>>>> > > > > > >> http://pastebin.com/1msNZ2tH) with the
first exception
>>>> being:
>>>> > > > > > >>
>>>> > > > > > >> org.apache.zookeeper.KeeperException$NoNodeException:
>>>> > > > KeeperErrorCode
>>>> > > > > =
>>>> > > > > > >> NoNode for
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > > >
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>> > > > > > >>         at
>>>> > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
>>>> > > > > > >>         at
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
>>>> > > > > > >>
>>>> > > > > > >> Before restarting the crashed RS's, I have
applied a
>>>> > > > > 'stop_replication'
>>>> > > > > > >> cmd. Then fired up the RS's again. They've
started o.k.
>>>> but once
>>>> > > > I've
>>>> > > > > > hit
>>>> > > > > > >> 'start_replication' they have crashed once
again. The
>>>> second
>>>> > crash
>>>> > > > log
>>>> > > > > > >> http://pastebin.com/8Nb5epJJ has the same
initial
>>>> exception
>>>> > > > > > >> (org.apache.zookeeper.KeeperException$NoNodeException:
>>>> > > > > > >> KeeperErrorCode = NoNode). I've started
the crash region
>>>> servers
>>>> > > > again
>>>> > > > > > >> without replication and currently all is
well, but I need
>>>> to
>>>> > start
>>>> > > > > > >> replication asap.
>>>> > > > > > >>
>>>> > > > > > >> Does anyone have an idea what's going on
and how can I
>>>> solve it
>>>> > ?
>>>> > > > > > >>
>>>> > > > > > >> Thanks,
>>>> > > > > > >> Amit
>>>> > > > > > >>
>>>> > > > > > >
>>>> > > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message