hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Mor <amit.mor.m...@gmail.com>
Subject Re: RS crash upon replication
Date Wed, 22 May 2013 22:00:13 GMT
Yes, I have checked the source files of the 0.94.2-cdh4.2.1 jar and
HBASE-8207 issues are present in the source codes, namely:

String[] parts = peerClusterZnode.split("-");


On Thu, May 23, 2013 at 12:42 AM, Amit Mor <amit.mor.mail@gmail.com> wrote:

> yes, indeed - hyphens are part of the host name (annoying legacy stuff in
> my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
> backported by Cloudera into their flavor of 0.94.2, but
> the mysterious occurrence of the percent sign in zkcli (ls
> /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
> might be a sign for such problem. How deep should my rmr in zkcli (an
> example would be most welcomed :) be ? I have no serious problem running
> copyTable with a time period corresponding to the outage and then to start
> the sync back again. One question though, how did it cause a crash ?
>
>
> On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <varun@pinterest.com>wrote:
>
>> I believe there were cascading failures which got these deep nodes
>> containing still to be replicated WAL(s) - I suspect there is either some
>> parsing bug or something which is causing the replication source to not
>> work - also which version are you using - does it have
>> https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens
>> in
>> our paths. One way to get back up is to delete these nodes but then you
>> lose data in these WAL(s)...
>>
>>
>> On Wed, May 22, 2013 at 2:22 PM, Amit Mor <amit.mor.mail@gmail.com>
>> wrote:
>>
>> >  va-p-hbase-02-d,60020,1369249862401
>> >
>> >
>> > On Thu, May 23, 2013 at 12:20 AM, Varun Sharma <varun@pinterest.com>
>> > wrote:
>> >
>> > > Basically
>> > >
>> > > ls /hbase/rs and what do you see for va-p-02-d ?
>> > >
>> > >
>> > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <varun@pinterest.com>
>> > wrote:
>> > >
>> > > > Can you do ls /hbase/rs and see what you get for 02-d - instead of
>> > > looking
>> > > > in /replication/, could you look in /hbase/replication/rs - I want
>> to
>> > see
>> > > > if the timestamps are matching or not ?
>> > > >
>> > > > Varun
>> > > >
>> > > >
>> > > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <varun@pinterest.com>
>> > > wrote:
>> > > >
>> > > >> I see - so looks okay - there's just a lot of deep nesting in
>> there -
>> > if
>> > > >> you look into these you nodes by doing ls - you should see a bunch
>> of
>> > > >> WAL(s) which still need to be replicated...
>> > > >>
>> > > >> Varun
>> > > >>
>> > > >>
>> > > >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <varun@pinterest.com
>> > > >wrote:
>> > > >>
>> > > >>> 2013-05-22 15:31:25,929 WARN
>> > > >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>> > > transient
>> > > >>> ZooKeeper exception:
>> > > >>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >>> KeeperErrorCode = Session expired for *
>> > > >>>
>> > >
>> >
>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
>> > > >>> *
>> > > >>> *
>> > > >>> *
>> > > >>> *01->[01->02->02]->01*
>> > > >>>
>> > > >>> *Looks like a bunch of cascading failures causing this deep
>> > nesting...
>> > > *
>> > > >>>
>> > > >>>
>> > > >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor <
>> amit.mor.mail@gmail.com
>> > > >wrote:
>> > > >>>
>> > > >>>> empty return:
>> > > >>>>
>> > > >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
>> > > >>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> > > >>>> []
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <
>> varun@pinterest.com
>> > >
>> > > >>>> wrote:
>> > > >>>>
>> > > >>>> > Do an "ls" not a get here and give the output ?
>> > > >>>> >
>> > > >>>> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> > > >>>> >
>> > > >>>> >
>> > > >>>> > On Wed, May 22, 2013 at 1:53 PM, amit.mor.mail@gmail.com
<
>> > > >>>> > amit.mor.mail@gmail.com> wrote:
>> > > >>>> >
>> > > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3]
get
>> > > >>>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> > > >>>> > >
>> > > >>>> > > cZxid = 0x60281c1de
>> > > >>>> > > ctime = Wed May 22 15:11:17 EDT 2013
>> > > >>>> > > mZxid = 0x60281c1de
>> > > >>>> > > mtime = Wed May 22 15:11:17 EDT 2013
>> > > >>>> > > pZxid = 0x60281c1de
>> > > >>>> > > cversion = 0
>> > > >>>> > > dataVersion = 0
>> > > >>>> > > aclVersion = 0
>> > > >>>> > > ephemeralOwner = 0x0
>> > > >>>> > > dataLength = 0
>> > > >>>> > > numChildren = 0
>> > > >>>> > >
>> > > >>>> > >
>> > > >>>> > >
>> > > >>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <
>> yuzhihong@gmail.com>
>> > > >>>> wrote:
>> > > >>>> > >
>> > > >>>> > > > What does this command show you ?
>> > > >>>> > > >
>> > > >>>> > > > get
>> > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> > > >>>> > > >
>> > > >>>> > > > Cheers
>> > > >>>> > > >
>> > > >>>> > > > On Wed, May 22, 2013 at 1:46 PM, amit.mor.mail@gmail.com
<
>> > > >>>> > > > amit.mor.mail@gmail.com> wrote:
>> > > >>>> > > >
>> > > >>>> > > > > ls
>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
>> > > >>>> > > > > [1]
>> > > >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED)
2] ls
>> > > >>>> > > > >
>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> > > >>>> > > > > []
>> > > >>>> > > > >
>> > > >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1
>> > > >>>> > > > >
>> > > >>>> > > > > Thanks
>> > > >>>> > > > >
>> > > >>>> > > > >
>> > > >>>> > > > > On Wed, May 22, 2013 at 11:40 PM,
Varun Sharma <
>> > > >>>> varun@pinterest.com>
>> > > >>>> > > > > wrote:
>> > > >>>> > > > >
>> > > >>>> > > > > > Also what version of HBase are
you running ?
>> > > >>>> > > > > >
>> > > >>>> > > > > >
>> > > >>>> > > > > > On Wed, May 22, 2013 at 1:38
PM, Varun Sharma <
>> > > >>>> varun@pinterest.com
>> > > >>>> > >
>> > > >>>> > > > > wrote:
>> > > >>>> > > > > >
>> > > >>>> > > > > > > Basically,
>> > > >>>> > > > > > >
>> > > >>>> > > > > > > You had va-p-hbase-02 crash
- that caused all the
>> > > >>>> replication
>> > > >>>> > > related
>> > > >>>> > > > > > data
>> > > >>>> > > > > > > in zookeeper to be moved
to va-p-hbase-01 and have it
>> > take
>> > > >>>> over
>> > > >>>> > for
>> > > >>>> > > > > > > replicating 02's logs. Now
each region server also
>> > > >>>> maintains an
>> > > >>>> > > > > in-memory
>> > > >>>> > > > > > > state of whats in ZK, it
seems like when you start up
>> > 01,
>> > > >>>> its
>> > > >>>> > > trying
>> > > >>>> > > > to
>> > > >>>> > > > > > > replicate the 02 logs underneath
but its failing to
>> > > because
>> > > >>>> that
>> > > >>>> > > data
>> > > >>>> > > > > is
>> > > >>>> > > > > > > not in ZK. This is somewhat
weird...
>> > > >>>> > > > > > >
>> > > >>>> > > > > > > Can you open the zookeepeer
shell and do
>> > > >>>> > > > > > >
>> > > >>>> > > > > > > ls
>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
>> > > >>>> > > > > > >
>> > > >>>> > > > > > > And give the output ?
>> > > >>>> > > > > > >
>> > > >>>> > > > > > >
>> > > >>>> > > > > > > On Wed, May 22, 2013 at
1:27 PM,
>> > amit.mor.mail@gmail.com<
>> > > >>>> > > > > > > amit.mor.mail@gmail.com>
wrote:
>> > > >>>> > > > > > >
>> > > >>>> > > > > > >> Hi,
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >> This is bad ... and
happened twice: I had my
>> > > >>>> replication-slave
>> > > >>>> > > > cluster
>> > > >>>> > > > > > >> offlined. I performed
quite a massive Merge
>> operation
>> > on
>> > > >>>> it and
>> > > >>>> > > > after
>> > > >>>> > > > > a
>> > > >>>> > > > > > >> couple of hours it had
finished and I returned it
>> back
>> > > >>>> online.
>> > > >>>> > At
>> > > >>>> > > > the
>> > > >>>> > > > > > same
>> > > >>>> > > > > > >> time, the replication-master
RS machines crashed
>> (see
>> > > first
>> > > >>>> > crash
>> > > >>>> > > > > > >> http://pastebin.com/1msNZ2tH)
with the first
>> exception
>> > > >>>> being:
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >>
>> org.apache.zookeeper.KeeperException$NoNodeException:
>> > > >>>> > > > KeeperErrorCode
>> > > >>>> > > > > =
>> > > >>>> > > > > > >> NoNode for
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >>
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> > > >
>> > > >>>>
>> > org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> > >
>> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >>
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> >
>> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> >
>> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> >
>> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >>
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >>
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >>
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
>> > > >>>> > > > > > >>         at
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >>
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >> Before restarting the
crashed RS's, I have applied a
>> > > >>>> > > > > 'stop_replication'
>> > > >>>> > > > > > >> cmd. Then fired up the
RS's again. They've started
>> o.k.
>> > > >>>> but once
>> > > >>>> > > > I've
>> > > >>>> > > > > > hit
>> > > >>>> > > > > > >> 'start_replication'
they have crashed once again.
>> The
>> > > >>>> second
>> > > >>>> > crash
>> > > >>>> > > > log
>> > > >>>> > > > > > >> http://pastebin.com/8Nb5epJJ
has the same initial
>> > > >>>> exception
>> > > >>>> > > > > > >>
>> (org.apache.zookeeper.KeeperException$NoNodeException:
>> > > >>>> > > > > > >> KeeperErrorCode = NoNode).
I've started the crash
>> > region
>> > > >>>> servers
>> > > >>>> > > > again
>> > > >>>> > > > > > >> without replication
and currently all is well, but I
>> > need
>> > > >>>> to
>> > > >>>> > start
>> > > >>>> > > > > > >> replication asap.
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >> Does anyone have an
idea what's going on and how
>> can I
>> > > >>>> solve it
>> > > >>>> > ?
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >> Thanks,
>> > > >>>> > > > > > >> Amit
>> > > >>>> > > > > > >>
>> > > >>>> > > > > > >
>> > > >>>> > > > > > >
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > > >>>
>> > > >>>
>> > > >>
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message