hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: RS crash upon replication
Date Thu, 23 May 2013 07:33:16 GMT
Actually, it seems like something else was wrong here - the servers came up
just fine on trying again - so could not really reproduce the issue.

Amit: Did you try patching 8207 ?

Varun


On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha <hv.csuoa@gmail.com>wrote:

> That sounds like a bug for sure. Could you create a jira with logs/znode
> dump/steps to reproduce it?
>
> Thanks,
> himanshu
>
>
> On Wed, May 22, 2013 at 5:01 PM, Varun Sharma <varun@pinterest.com> wrote:
>
> > It seems I can reproduce this - I did a few rolling restarts and got
> > screwed with NoNode exceptions - I am running 0.94.7 which has the fix
> but
> > my nodes don't contain hyphens - nodes are no longer coming back up...
> >
> > Thanks
> > Varun
> >
> >
> > On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha <hv.csuoa@gmail.com
> > >wrote:
> >
> > > I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't have
> > it.
> > >
> > > With hyphens in the name, ReplicationSource gets confused and tried to
> > set
> > > data in a znode which doesn't exist.
> > >
> > > Thanks,
> > > Himanshu
> > >
> > >
> > > On Wed, May 22, 2013 at 2:42 PM, Amit Mor <amit.mor.mail@gmail.com>
> > wrote:
> > >
> > > > yes, indeed - hyphens are part of the host name (annoying legacy
> stuff
> > in
> > > > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
> > > > backported by Cloudera into their flavor of 0.94.2, but
> > > > the mysterious occurrence of the percent sign in zkcli (ls
> > > >
> > > >
> > >
> >
> /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
> > > > might be a sign for such problem. How deep should my rmr in zkcli (an
> > > > example would be most welcomed :) be ? I have no serious problem
> > running
> > > > copyTable with a time period corresponding to the outage and then to
> > > start
> > > > the sync back again. One question though, how did it cause a crash ?
> > > >
> > > >
> > > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <varun@pinterest.com>
> > > > wrote:
> > > >
> > > > > I believe there were cascading failures which got these deep nodes
> > > > > containing still to be replicated WAL(s) - I suspect there is
> either
> > > some
> > > > > parsing bug or something which is causing the replication source
to
> > not
> > > > > work - also which version are you using - does it have
> > > > > https://issues.apache.org/jira/browse/HBASE-8207 - since you use
> > > hyphens
> > > > > in
> > > > > our paths. One way to get back up is to delete these nodes but then
> > you
> > > > > lose data in these WAL(s)...
> > > > >
> > > > >
> > > > > On Wed, May 22, 2013 at 2:22 PM, Amit Mor <amit.mor.mail@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > >  va-p-hbase-02-d,60020,1369249862401
> > > > > >
> > > > > >
> > > > > > On Thu, May 23, 2013 at 12:20 AM, Varun Sharma <
> > varun@pinterest.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Basically
> > > > > > >
> > > > > > > ls /hbase/rs and what do you see for va-p-02-d ?
> > > > > > >
> > > > > > >
> > > > > > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <
> > varun@pinterest.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Can you do ls /hbase/rs and see what you get for 02-d
-
> instead
> > > of
> > > > > > > looking
> > > > > > > > in /replication/, could you look in /hbase/replication/rs
- I
> > > want
> > > > to
> > > > > > see
> > > > > > > > if the timestamps are matching or not ?
> > > > > > > >
> > > > > > > > Varun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <
> > > varun@pinterest.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> I see - so looks okay - there's just a lot of
deep nesting
> in
> > > > there
> > > > > -
> > > > > > if
> > > > > > > >> you look into these you nodes by doing ls - you
should see a
> > > bunch
> > > > > of
> > > > > > > >> WAL(s) which still need to be replicated...
> > > > > > > >>
> > > > > > > >> Varun
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma
<
> > > > varun@pinterest.com
> > > > > > > >wrote:
> > > > > > > >>
> > > > > > > >>> 2013-05-22 15:31:25,929 WARN
> > > > > > > >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > Possibly
> > > > > > > transient
> > > > > > > >>> ZooKeeper exception:
> > > > > > > >>>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > >>> KeeperErrorCode = Session expired for *
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> > > > > > > >>> *
> > > > > > > >>> *
> > > > > > > >>> *
> > > > > > > >>> *01->[01->02->02]->01*
> > > > > > > >>>
> > > > > > > >>> *Looks like a bunch of cascading failures
causing this deep
> > > > > > nesting...
> > > > > > > *
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor
<
> > > > amit.mor.mail@gmail.com
> > > > > > > >wrote:
> > > > > > > >>>
> > > > > > > >>>> empty return:
> > > > > > > >>>>
> > > > > > > >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED)
10] ls
> > > > > > > >>>>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > > > > > > >>>> []
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On Thu, May 23, 2013 at 12:05 AM, Varun
Sharma <
> > > > > varun@pinterest.com
> > > > > > >
> > > > > > > >>>> wrote:
> > > > > > > >>>>
> > > > > > > >>>> > Do an "ls" not a get here and give
the output ?
> > > > > > > >>>> >
> > > > > > > >>>> > ls
> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > > > > > > >>>> >
> > > > > > > >>>> >
> > > > > > > >>>> > On Wed, May 22, 2013 at 1:53 PM,
> amit.mor.mail@gmail.com<
> > > > > > > >>>> > amit.mor.mail@gmail.com> wrote:
> > > > > > > >>>> >
> > > > > > > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED)
3] get
> > > > > > > >>>> > >
> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > > > > > > >>>> > >
> > > > > > > >>>> > > cZxid = 0x60281c1de
> > > > > > > >>>> > > ctime = Wed May 22 15:11:17
EDT 2013
> > > > > > > >>>> > > mZxid = 0x60281c1de
> > > > > > > >>>> > > mtime = Wed May 22 15:11:17
EDT 2013
> > > > > > > >>>> > > pZxid = 0x60281c1de
> > > > > > > >>>> > > cversion = 0
> > > > > > > >>>> > > dataVersion = 0
> > > > > > > >>>> > > aclVersion = 0
> > > > > > > >>>> > > ephemeralOwner = 0x0
> > > > > > > >>>> > > dataLength = 0
> > > > > > > >>>> > > numChildren = 0
> > > > > > > >>>> > >
> > > > > > > >>>> > >
> > > > > > > >>>> > >
> > > > > > > >>>> > > On Wed, May 22, 2013 at 11:49
PM, Ted Yu <
> > > > yuzhihong@gmail.com
> > > > > >
> > > > > > > >>>> wrote:
> > > > > > > >>>> > >
> > > > > > > >>>> > > > What does this command
show you ?
> > > > > > > >>>> > > >
> > > > > > > >>>> > > > get
> > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > > > > > > >>>> > > >
> > > > > > > >>>> > > > Cheers
> > > > > > > >>>> > > >
> > > > > > > >>>> > > > On Wed, May 22, 2013 at
1:46 PM,
> > > amit.mor.mail@gmail.com<
> > > > > > > >>>> > > > amit.mor.mail@gmail.com>
wrote:
> > > > > > > >>>> > > >
> > > > > > > >>>> > > > > ls
> > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> > > > > > > >>>> > > > > [1]
> > > > > > > >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED)
2] ls
> > > > > > > >>>> > > > >
> > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > > > > > > >>>> > > > > []
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > > > Thanks
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > > > On Wed, May 22, 2013
at 11:40 PM, Varun Sharma <
> > > > > > > >>>> varun@pinterest.com>
> > > > > > > >>>> > > > > wrote:
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > > > > Also what version
of HBase are you running ?
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > > > On Wed, May 22,
2013 at 1:38 PM, Varun Sharma <
> > > > > > > >>>> varun@pinterest.com
> > > > > > > >>>> > >
> > > > > > > >>>> > > > > wrote:
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > > > > Basically,
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > > You had
va-p-hbase-02 crash - that caused all
> > the
> > > > > > > >>>> replication
> > > > > > > >>>> > > related
> > > > > > > >>>> > > > > > data
> > > > > > > >>>> > > > > > > in zookeeper
to be moved to va-p-hbase-01 and
> > have
> > > > it
> > > > > > take
> > > > > > > >>>> over
> > > > > > > >>>> > for
> > > > > > > >>>> > > > > > > replicating
02's logs. Now each region server
> > also
> > > > > > > >>>> maintains an
> > > > > > > >>>> > > > > in-memory
> > > > > > > >>>> > > > > > > state of
whats in ZK, it seems like when you
> > start
> > > > up
> > > > > > 01,
> > > > > > > >>>> its
> > > > > > > >>>> > > trying
> > > > > > > >>>> > > > to
> > > > > > > >>>> > > > > > > replicate
the 02 logs underneath but its
> failing
> > > to
> > > > > > > because
> > > > > > > >>>> that
> > > > > > > >>>> > > data
> > > > > > > >>>> > > > > is
> > > > > > > >>>> > > > > > > not in ZK.
This is somewhat weird...
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > > Can you
open the zookeepeer shell and do
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > > ls
> > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > > And give
the output ?
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > > On Wed,
May 22, 2013 at 1:27 PM,
> > > > > > amit.mor.mail@gmail.com<
> > > > > > > >>>> > > > > > > amit.mor.mail@gmail.com>
wrote:
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > >> Hi,
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >> This
is bad ... and happened twice: I had my
> > > > > > > >>>> replication-slave
> > > > > > > >>>> > > > cluster
> > > > > > > >>>> > > > > > >> offlined.
I performed quite a massive Merge
> > > > operation
> > > > > > on
> > > > > > > >>>> it and
> > > > > > > >>>> > > > after
> > > > > > > >>>> > > > > a
> > > > > > > >>>> > > > > > >> couple
of hours it had finished and I
> returned
> > it
> > > > > back
> > > > > > > >>>> online.
> > > > > > > >>>> > At
> > > > > > > >>>> > > > the
> > > > > > > >>>> > > > > > same
> > > > > > > >>>> > > > > > >> time,
the replication-master RS machines
> > crashed
> > > > (see
> > > > > > > first
> > > > > > > >>>> > crash
> > > > > > > >>>> > > > > > >> http://pastebin.com/1msNZ2tH)
with the first
> > > > > exception
> > > > > > > >>>> being:
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >>
> > > > org.apache.zookeeper.KeeperException$NoNodeException:
> > > > > > > >>>> > > > KeeperErrorCode
> > > > > > > >>>> > > > > =
> > > > > > > >>>> > > > > > >> NoNode
for
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > >
> > > > > > > >>>> > >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > >
> > > > > > > >>>>
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > >
> > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > >
> > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > >
> > > > > > > >>>> > >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> >
> > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> >
> > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> >
> > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > >
> > > > > > > >>>> > >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > >
> > > > > > > >>>> > >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > >
> > > > > > > >>>> > >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
> > > > > > > >>>> > > > > > >>    
    at
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > >
> > > > > > > >>>> > >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >> Before
restarting the crashed RS's, I have
> > > applied
> > > > a
> > > > > > > >>>> > > > > 'stop_replication'
> > > > > > > >>>> > > > > > >> cmd.
Then fired up the RS's again. They've
> > > started
> > > > > o.k.
> > > > > > > >>>> but once
> > > > > > > >>>> > > > I've
> > > > > > > >>>> > > > > > hit
> > > > > > > >>>> > > > > > >> 'start_replication'
they have crashed once
> > again.
> > > > The
> > > > > > > >>>> second
> > > > > > > >>>> > crash
> > > > > > > >>>> > > > log
> > > > > > > >>>> > > > > > >> http://pastebin.com/8Nb5epJJ
has the same
> > > initial
> > > > > > > >>>> exception
> > > > > > > >>>> > > > > > >>
> > > > > (org.apache.zookeeper.KeeperException$NoNodeException:
> > > > > > > >>>> > > > > > >> KeeperErrorCode
= NoNode). I've started the
> > crash
> > > > > > region
> > > > > > > >>>> servers
> > > > > > > >>>> > > > again
> > > > > > > >>>> > > > > > >> without
replication and currently all is
> well,
> > > but
> > > > I
> > > > > > need
> > > > > > > >>>> to
> > > > > > > >>>> > start
> > > > > > > >>>> > > > > > >> replication
asap.
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >> Does
anyone have an idea what's going on and
> > how
> > > > can
> > > > > I
> > > > > > > >>>> solve it
> > > > > > > >>>> > ?
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >> Thanks,
> > > > > > > >>>> > > > > > >> Amit
> > > > > > > >>>> > > > > > >>
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > > >
> > > > > > > >>>> > > > > >
> > > > > > > >>>> > > > >
> > > > > > > >>>> > > >
> > > > > > > >>>> > >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message