hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: RS crash upon replication
Date Thu, 23 May 2013 16:53:01 GMT
But wouldn't a copy table b/w timestamps bring you back since the mutations
are all timestamp based we should okay ? Basically doing a copy table which
supersedes the downtime interval ?


On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> fwiw stop_replication is a kill switch, not a general way to start and
> stop replicating, and start_replication may put you in an inconsistent
> state:
>
> hbase(main):001:0> help 'stop_replication'
> Stops all the replication features. The state in which each
> stream stops in is undetermined.
> WARNING:
> start/stop replication is only meant to be used in critical load
> situations.
>
> On Thu, May 23, 2013 at 1:17 AM, Amit Mor <amit.mor.mail@gmail.com> wrote:
> > No the server came out fine just because after the crash (RS's - the
> > masters were still running), I immediately pulled the breaks with
> > stop_replication. Then I start the RS's and they came back fine (not
> > replicating). Once I hit 'start_replication' again they had crashed for
> the
> > second time. Eventually I deleted the heavily nested replication znodes
> and
> > the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
> > with Cloudera Manager Parcels thing and I'm still trying to figure out
> how
> > to replace their jars with mine in a clean and non intrusive way
> >
> >
> > On Thu, May 23, 2013 at 10:33 AM, Varun Sharma <varun@pinterest.com>
> wrote:
> >
> >> Actually, it seems like something else was wrong here - the servers
> came up
> >> just fine on trying again - so could not really reproduce the issue.
> >>
> >> Amit: Did you try patching 8207 ?
> >>
> >> Varun
> >>
> >>
> >> On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha <
> hv.csuoa@gmail.com
> >> >wrote:
> >>
> >> > That sounds like a bug for sure. Could you create a jira with
> logs/znode
> >> > dump/steps to reproduce it?
> >> >
> >> > Thanks,
> >> > himanshu
> >> >
> >> >
> >> > On Wed, May 22, 2013 at 5:01 PM, Varun Sharma <varun@pinterest.com>
> >> wrote:
> >> >
> >> > > It seems I can reproduce this - I did a few rolling restarts and got
> >> > > screwed with NoNode exceptions - I am running 0.94.7 which has the
> fix
> >> > but
> >> > > my nodes don't contain hyphens - nodes are no longer coming back
> up...
> >> > >
> >> > > Thanks
> >> > > Varun
> >> > >
> >> > >
> >> > > On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha <
> >> hv.csuoa@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
> >> have
> >> > > it.
> >> > > >
> >> > > > With hyphens in the name, ReplicationSource gets confused and
> tried
> >> to
> >> > > set
> >> > > > data in a znode which doesn't exist.
> >> > > >
> >> > > > Thanks,
> >> > > > Himanshu
> >> > > >
> >> > > >
> >> > > > On Wed, May 22, 2013 at 2:42 PM, Amit Mor <
> amit.mor.mail@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > yes, indeed - hyphens are part of the host name (annoying
legacy
> >> > stuff
> >> > > in
> >> > > > > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea
if
> 0.94.6
> >> was
> >> > > > > backported by Cloudera into their flavor of 0.94.2, but
> >> > > > > the mysterious occurrence of the percent sign in zkcli (ls
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
> >> > > > > might be a sign for such problem. How deep should my rmr
in
> zkcli
> >> (an
> >> > > > > example would be most welcomed :) be ? I have no serious
problem
> >> > > running
> >> > > > > copyTable with a time period corresponding to the outage
and
> then
> >> to
> >> > > > start
> >> > > > > the sync back again. One question though, how did it cause
a
> crash
> >> ?
> >> > > > >
> >> > > > >
> >> > > > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <
> >> varun@pinterest.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > I believe there were cascading failures which got these
deep
> >> nodes
> >> > > > > > containing still to be replicated WAL(s) - I suspect
there is
> >> > either
> >> > > > some
> >> > > > > > parsing bug or something which is causing the replication
> source
> >> to
> >> > > not
> >> > > > > > work - also which version are you using - does it have
> >> > > > > > https://issues.apache.org/jira/browse/HBASE-8207 -
since you
> use
> >> > > > hyphens
> >> > > > > > in
> >> > > > > > our paths. One way to get back up is to delete these
nodes but
> >> then
> >> > > you
> >> > > > > > lose data in these WAL(s)...
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, May 22, 2013 at 2:22 PM, Amit Mor <
> >> amit.mor.mail@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > >  va-p-hbase-02-d,60020,1369249862401
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, May 23, 2013 at 12:20 AM, Varun Sharma
<
> >> > > varun@pinterest.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Basically
> >> > > > > > > >
> >> > > > > > > > ls /hbase/rs and what do you see for va-p-02-d
?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma
<
> >> > > varun@pinterest.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Can you do ls /hbase/rs and see what
you get for 02-d -
> >> > instead
> >> > > > of
> >> > > > > > > > looking
> >> > > > > > > > > in /replication/, could you look in
> /hbase/replication/rs
> >> - I
> >> > > > want
> >> > > > > to
> >> > > > > > > see
> >> > > > > > > > > if the timestamps are matching or not
?
> >> > > > > > > > >
> >> > > > > > > > > Varun
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Wed, May 22, 2013 at 2:17 PM, Varun
Sharma <
> >> > > > varun@pinterest.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> I see - so looks okay - there's
just a lot of deep
> nesting
> >> > in
> >> > > > > there
> >> > > > > > -
> >> > > > > > > if
> >> > > > > > > > >> you look into these you nodes by
doing ls - you should
> >> see a
> >> > > > bunch
> >> > > > > > of
> >> > > > > > > > >> WAL(s) which still need to be replicated...
> >> > > > > > > > >>
> >> > > > > > > > >> Varun
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> On Wed, May 22, 2013 at 2:16 PM,
Varun Sharma <
> >> > > > > varun@pinterest.com
> >> > > > > > > > >wrote:
> >> > > > > > > > >>
> >> > > > > > > > >>> 2013-05-22 15:31:25,929 WARN
> >> > > > > > > > >>>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > > Possibly
> >> > > > > > > > transient
> >> > > > > > > > >>> ZooKeeper exception:
> >> > > > > > > > >>>
> >> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > >>> KeeperErrorCode = Session expired
for *
> >> > > > > > > > >>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> >> > > > > > > > >>> *
> >> > > > > > > > >>> *
> >> > > > > > > > >>> *
> >> > > > > > > > >>> *01->[01->02->02]->01*
> >> > > > > > > > >>>
> >> > > > > > > > >>> *Looks like a bunch of cascading
failures causing this
> >> deep
> >> > > > > > > nesting...
> >> > > > > > > > *
> >> > > > > > > > >>>
> >> > > > > > > > >>>
> >> > > > > > > > >>> On Wed, May 22, 2013 at 2:09
PM, Amit Mor <
> >> > > > > amit.mor.mail@gmail.com
> >> > > > > > > > >wrote:
> >> > > > > > > > >>>
> >> > > > > > > > >>>> empty return:
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED)
10] ls
> >> > > > > > > > >>>>
> >> > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> []
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> On Thu, May 23, 2013 at
12:05 AM, Varun Sharma <
> >> > > > > > varun@pinterest.com
> >> > > > > > > >
> >> > > > > > > > >>>> wrote:
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> > Do an "ls" not a get
here and give the output ?
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> > ls
> >> > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> > On Wed, May 22, 2013
at 1:53 PM,
> >> > amit.mor.mail@gmail.com<
> >> > > > > > > > >>>> > amit.mor.mail@gmail.com>
wrote:
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED)
3] get
> >> > > > > > > > >>>> > >
> >> > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > cZxid = 0x60281c1de
> >> > > > > > > > >>>> > > ctime = Wed May
22 15:11:17 EDT 2013
> >> > > > > > > > >>>> > > mZxid = 0x60281c1de
> >> > > > > > > > >>>> > > mtime = Wed May
22 15:11:17 EDT 2013
> >> > > > > > > > >>>> > > pZxid = 0x60281c1de
> >> > > > > > > > >>>> > > cversion = 0
> >> > > > > > > > >>>> > > dataVersion =
0
> >> > > > > > > > >>>> > > aclVersion = 0
> >> > > > > > > > >>>> > > ephemeralOwner
= 0x0
> >> > > > > > > > >>>> > > dataLength = 0
> >> > > > > > > > >>>> > > numChildren =
0
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > On Wed, May 22,
2013 at 11:49 PM, Ted Yu <
> >> > > > > yuzhihong@gmail.com
> >> > > > > > >
> >> > > > > > > > >>>> wrote:
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > > What does
this command show you ?
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > get
> >> > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > Cheers
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > On Wed, May
22, 2013 at 1:46 PM,
> >> > > > amit.mor.mail@gmail.com<
> >> > > > > > > > >>>> > > > amit.mor.mail@gmail.com>
wrote:
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > > ls
> >> > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> >> > > > > > > > >>>> > > > > [1]
> >> > > > > > > > >>>> > > > > [zk:
va-p-zookeeper-01-c:2181(CONNECTED) 2]
> ls
> >> > > > > > > > >>>> > > > >
> >> > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> > > > > []
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > I'm
on hbase-0.94.2-cdh4.2.1
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > Thanks
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > On Wed,
May 22, 2013 at 11:40 PM, Varun
> Sharma <
> >> > > > > > > > >>>> varun@pinterest.com>
> >> > > > > > > > >>>> > > > > wrote:
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > >
Also what version of HBase are you running
> ?
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > > >
On Wed, May 22, 2013 at 1:38 PM, Varun
> Sharma
> >> <
> >> > > > > > > > >>>> varun@pinterest.com
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > > > wrote:
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > > >
> Basically,
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
> You had va-p-hbase-02 crash - that caused
> >> all
> >> > > the
> >> > > > > > > > >>>> replication
> >> > > > > > > > >>>> > > related
> >> > > > > > > > >>>> > > > > >
data
> >> > > > > > > > >>>> > > > > >
> in zookeeper to be moved to va-p-hbase-01
> >> and
> >> > > have
> >> > > > > it
> >> > > > > > > take
> >> > > > > > > > >>>> over
> >> > > > > > > > >>>> > for
> >> > > > > > > > >>>> > > > > >
> replicating 02's logs. Now each region
> >> server
> >> > > also
> >> > > > > > > > >>>> maintains an
> >> > > > > > > > >>>> > > > > in-memory
> >> > > > > > > > >>>> > > > > >
> state of whats in ZK, it seems like when
> you
> >> > > start
> >> > > > > up
> >> > > > > > > 01,
> >> > > > > > > > >>>> its
> >> > > > > > > > >>>> > > trying
> >> > > > > > > > >>>> > > > to
> >> > > > > > > > >>>> > > > > >
> replicate the 02 logs underneath but its
> >> > failing
> >> > > > to
> >> > > > > > > > because
> >> > > > > > > > >>>> that
> >> > > > > > > > >>>> > > data
> >> > > > > > > > >>>> > > > > is
> >> > > > > > > > >>>> > > > > >
> not in ZK. This is somewhat weird...
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
> Can you open the zookeepeer shell and do
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
> ls
> >> > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
> And give the output ?
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
> On Wed, May 22, 2013 at 1:27 PM,
> >> > > > > > > amit.mor.mail@gmail.com<
> >> > > > > > > > >>>> > > > > >
> amit.mor.mail@gmail.com> wrote:
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
>> Hi,
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>> This is bad ... and happened twice: I
> had
> >> my
> >> > > > > > > > >>>> replication-slave
> >> > > > > > > > >>>> > > > cluster
> >> > > > > > > > >>>> > > > > >
>> offlined. I performed quite a massive
> Merge
> >> > > > > operation
> >> > > > > > > on
> >> > > > > > > > >>>> it and
> >> > > > > > > > >>>> > > > after
> >> > > > > > > > >>>> > > > > a
> >> > > > > > > > >>>> > > > > >
>> couple of hours it had finished and I
> >> > returned
> >> > > it
> >> > > > > > back
> >> > > > > > > > >>>> online.
> >> > > > > > > > >>>> > At
> >> > > > > > > > >>>> > > > the
> >> > > > > > > > >>>> > > > > >
same
> >> > > > > > > > >>>> > > > > >
>> time, the replication-master RS machines
> >> > > crashed
> >> > > > > (see
> >> > > > > > > > first
> >> > > > > > > > >>>> > crash
> >> > > > > > > > >>>> > > > > >
>> http://pastebin.com/1msNZ2tH) with the
> >> first
> >> > > > > > exception
> >> > > > > > > > >>>> being:
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > org.apache.zookeeper.KeeperException$NoNodeException:
> >> > > > > > > > >>>> > > > KeeperErrorCode
> >> > > > > > > > >>>> > > > > =
> >> > > > > > > > >>>> > > > > >
>> NoNode for
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>>
> >> > > > > > >
> >> > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > >
> >> > > > > > > >
> >> > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > >
> >> > > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> >
> >> > > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> >
> >> > > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> >
> >> > > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
> >> > > > > > > > >>>> > > > > >
>>         at
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>> Before restarting the crashed RS's, I
> have
> >> > > > applied
> >> > > > > a
> >> > > > > > > > >>>> > > > > 'stop_replication'
> >> > > > > > > > >>>> > > > > >
>> cmd. Then fired up the RS's again.
> They've
> >> > > > started
> >> > > > > > o.k.
> >> > > > > > > > >>>> but once
> >> > > > > > > > >>>> > > > I've
> >> > > > > > > > >>>> > > > > >
hit
> >> > > > > > > > >>>> > > > > >
>> 'start_replication' they have crashed
> once
> >> > > again.
> >> > > > > The
> >> > > > > > > > >>>> second
> >> > > > > > > > >>>> > crash
> >> > > > > > > > >>>> > > > log
> >> > > > > > > > >>>> > > > > >
>> http://pastebin.com/8Nb5epJJ has the
> same
> >> > > > initial
> >> > > > > > > > >>>> exception
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > (org.apache.zookeeper.KeeperException$NoNodeException:
> >> > > > > > > > >>>> > > > > >
>> KeeperErrorCode = NoNode). I've started
> the
> >> > > crash
> >> > > > > > > region
> >> > > > > > > > >>>> servers
> >> > > > > > > > >>>> > > > again
> >> > > > > > > > >>>> > > > > >
>> without replication and currently all is
> >> > well,
> >> > > > but
> >> > > > > I
> >> > > > > > > need
> >> > > > > > > > >>>> to
> >> > > > > > > > >>>> > start
> >> > > > > > > > >>>> > > > > >
>> replication asap.
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>> Does anyone have an idea what's going on
> >> and
> >> > > how
> >> > > > > can
> >> > > > > > I
> >> > > > > > > > >>>> solve it
> >> > > > > > > > >>>> > ?
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>> Thanks,
> >> > > > > > > > >>>> > > > > >
>> Amit
> >> > > > > > > > >>>> > > > > >
>>
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > > >>>
> >> > > > > > > > >>>
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message