hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: RS crash upon replication
Date Wed, 22 May 2013 21:32:00 GMT
I believe there were cascading failures which got these deep nodes
containing still to be replicated WAL(s) - I suspect there is either some
parsing bug or something which is causing the replication source to not
work - also which version are you using - does it have
https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in
our paths. One way to get back up is to delete these nodes but then you
lose data in these WAL(s)...


On Wed, May 22, 2013 at 2:22 PM, Amit Mor <amit.mor.mail@gmail.com> wrote:

>  va-p-hbase-02-d,60020,1369249862401
>
>
> On Thu, May 23, 2013 at 12:20 AM, Varun Sharma <varun@pinterest.com>
> wrote:
>
> > Basically
> >
> > ls /hbase/rs and what do you see for va-p-02-d ?
> >
> >
> > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <varun@pinterest.com>
> wrote:
> >
> > > Can you do ls /hbase/rs and see what you get for 02-d - instead of
> > looking
> > > in /replication/, could you look in /hbase/replication/rs - I want to
> see
> > > if the timestamps are matching or not ?
> > >
> > > Varun
> > >
> > >
> > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <varun@pinterest.com>
> > wrote:
> > >
> > >> I see - so looks okay - there's just a lot of deep nesting in there -
> if
> > >> you look into these you nodes by doing ls - you should see a bunch of
> > >> WAL(s) which still need to be replicated...
> > >>
> > >> Varun
> > >>
> > >>
> > >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <varun@pinterest.com
> > >wrote:
> > >>
> > >>> 2013-05-22 15:31:25,929 WARN
> > >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > transient
> > >>> ZooKeeper exception:
> > >>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >>> KeeperErrorCode = Session expired for *
> > >>>
> >
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> > >>> *
> > >>> *
> > >>> *
> > >>> *01->[01->02->02]->01*
> > >>>
> > >>> *Looks like a bunch of cascading failures causing this deep
> nesting...
> > *
> > >>>
> > >>>
> > >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor <amit.mor.mail@gmail.com
> > >wrote:
> > >>>
> > >>>> empty return:
> > >>>>
> > >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
> > >>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > >>>> []
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <varun@pinterest.com
> >
> > >>>> wrote:
> > >>>>
> > >>>> > Do an "ls" not a get here and give the output ?
> > >>>> >
> > >>>> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > >>>> >
> > >>>> >
> > >>>> > On Wed, May 22, 2013 at 1:53 PM, amit.mor.mail@gmail.com <
> > >>>> > amit.mor.mail@gmail.com> wrote:
> > >>>> >
> > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
> > >>>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > >>>> > >
> > >>>> > > cZxid = 0x60281c1de
> > >>>> > > ctime = Wed May 22 15:11:17 EDT 2013
> > >>>> > > mZxid = 0x60281c1de
> > >>>> > > mtime = Wed May 22 15:11:17 EDT 2013
> > >>>> > > pZxid = 0x60281c1de
> > >>>> > > cversion = 0
> > >>>> > > dataVersion = 0
> > >>>> > > aclVersion = 0
> > >>>> > > ephemeralOwner = 0x0
> > >>>> > > dataLength = 0
> > >>>> > > numChildren = 0
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <yuzhihong@gmail.com>
> > >>>> wrote:
> > >>>> > >
> > >>>> > > > What does this command show you ?
> > >>>> > > >
> > >>>> > > > get
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > >>>> > > >
> > >>>> > > > Cheers
> > >>>> > > >
> > >>>> > > > On Wed, May 22, 2013 at 1:46 PM, amit.mor.mail@gmail.com
<
> > >>>> > > > amit.mor.mail@gmail.com> wrote:
> > >>>> > > >
> > >>>> > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> > >>>> > > > > [1]
> > >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2]
ls
> > >>>> > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > >>>> > > > > []
> > >>>> > > > >
> > >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1
> > >>>> > > > >
> > >>>> > > > > Thanks
> > >>>> > > > >
> > >>>> > > > >
> > >>>> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma
<
> > >>>> varun@pinterest.com>
> > >>>> > > > > wrote:
> > >>>> > > > >
> > >>>> > > > > > Also what version of HBase are you running
?
> > >>>> > > > > >
> > >>>> > > > > >
> > >>>> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun
Sharma <
> > >>>> varun@pinterest.com
> > >>>> > >
> > >>>> > > > > wrote:
> > >>>> > > > > >
> > >>>> > > > > > > Basically,
> > >>>> > > > > > >
> > >>>> > > > > > > You had va-p-hbase-02 crash - that
caused all the
> > >>>> replication
> > >>>> > > related
> > >>>> > > > > > data
> > >>>> > > > > > > in zookeeper to be moved to va-p-hbase-01
and have it
> take
> > >>>> over
> > >>>> > for
> > >>>> > > > > > > replicating 02's logs. Now each region
server also
> > >>>> maintains an
> > >>>> > > > > in-memory
> > >>>> > > > > > > state of whats in ZK, it seems like
when you start up
> 01,
> > >>>> its
> > >>>> > > trying
> > >>>> > > > to
> > >>>> > > > > > > replicate the 02 logs underneath
but its failing to
> > because
> > >>>> that
> > >>>> > > data
> > >>>> > > > > is
> > >>>> > > > > > > not in ZK. This is somewhat weird...
> > >>>> > > > > > >
> > >>>> > > > > > > Can you open the zookeepeer shell
and do
> > >>>> > > > > > >
> > >>>> > > > > > > ls
> > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> > >>>> > > > > > >
> > >>>> > > > > > > And give the output ?
> > >>>> > > > > > >
> > >>>> > > > > > >
> > >>>> > > > > > > On Wed, May 22, 2013 at 1:27 PM,
> amit.mor.mail@gmail.com<
> > >>>> > > > > > > amit.mor.mail@gmail.com> wrote:
> > >>>> > > > > > >
> > >>>> > > > > > >> Hi,
> > >>>> > > > > > >>
> > >>>> > > > > > >> This is bad ... and happened
twice: I had my
> > >>>> replication-slave
> > >>>> > > > cluster
> > >>>> > > > > > >> offlined. I performed quite a
massive Merge operation
> on
> > >>>> it and
> > >>>> > > > after
> > >>>> > > > > a
> > >>>> > > > > > >> couple of hours it had finished
and I returned it back
> > >>>> online.
> > >>>> > At
> > >>>> > > > the
> > >>>> > > > > > same
> > >>>> > > > > > >> time, the replication-master
RS machines crashed (see
> > first
> > >>>> > crash
> > >>>> > > > > > >> http://pastebin.com/1msNZ2tH)
with the first exception
> > >>>> being:
> > >>>> > > > > > >>
> > >>>> > > > > > >> org.apache.zookeeper.KeeperException$NoNodeException:
> > >>>> > > > KeeperErrorCode
> > >>>> > > > > =
> > >>>> > > > > > >> NoNode for
> > >>>> > > > > > >>
> > >>>> > > > > > >>
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > > >
> > >>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > >>>> > > > > > >>         at
> > >>>> > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > > > > > >>
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > > > > > >>
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > > > > > >>
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > > > > > >>
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
> > >>>> > > > > > >>         at
> > >>>> > > > > > >>
> > >>>> > > > > > >>
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
> > >>>> > > > > > >>
> > >>>> > > > > > >> Before restarting the crashed
RS's, I have applied a
> > >>>> > > > > 'stop_replication'
> > >>>> > > > > > >> cmd. Then fired up the RS's again.
They've started o.k.
> > >>>> but once
> > >>>> > > > I've
> > >>>> > > > > > hit
> > >>>> > > > > > >> 'start_replication' they have
crashed once again. The
> > >>>> second
> > >>>> > crash
> > >>>> > > > log
> > >>>> > > > > > >> http://pastebin.com/8Nb5epJJ
has the same initial
> > >>>> exception
> > >>>> > > > > > >> (org.apache.zookeeper.KeeperException$NoNodeException:
> > >>>> > > > > > >> KeeperErrorCode = NoNode). I've
started the crash
> region
> > >>>> servers
> > >>>> > > > again
> > >>>> > > > > > >> without replication and currently
all is well, but I
> need
> > >>>> to
> > >>>> > start
> > >>>> > > > > > >> replication asap.
> > >>>> > > > > > >>
> > >>>> > > > > > >> Does anyone have an idea what's
going on and how can I
> > >>>> solve it
> > >>>> > ?
> > >>>> > > > > > >>
> > >>>> > > > > > >> Thanks,
> > >>>> > > > > > >> Amit
> > >>>> > > > > > >>
> > >>>> > > > > > >
> > >>>> > > > > > >
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message