hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Replication hosed after simple cluster restart
Date Thu, 14 Mar 2013 03:48:39 GMT
Yeah, lemme sink the RC... We do have a fix.


Consider it sunk.

In the end there are some more issues to discuss anyway.
- Can we avoid RSs taking over queues during a clean shutdown/restart? Without multi we can
actually loose data to replicate this way (one RS is shut down, another takes over and is
itself shut down) - unless I misunderstand.

- Should we stagger the attempts to move the queues for example with a random wait between
0 and 10s, so that not all RSs try at the same time?
- A test for this scenario? (That's probably tricky)


-- Lars



________________________________
 From: Andrew Purtell <apurtell@apache.org>
To: "dev@hbase.apache.org" <dev@hbase.apache.org> 
Sent: Wednesday, March 13, 2013 8:22 PM
Subject: Re: Replication hosed after simple cluster restart
 
If Himanshu (?) can fix it quickly we should try to get it in here IMHO.

On Wednesday, March 13, 2013, Ted Yu wrote:

> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
> HBASE-2611 Handle RS that fails while processing the failure of another one
> (Himanshu Vashishtha)
>
> It went into 0.94.5
> And the feature is off by default:
>
>     <name>hbase.zookeeper.useMulti</name>
>     <value>false</value>
>
> The fact that Lars first reported the following problem meant that no other
> user tried this feature.
>
> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>
> Cheers
>
> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <larsh@apache.org<javascript:;>>
> wrote:
>
> > Hey no problem. It's cool that we found it in a test env. It's probably
> > quite hard to reproduce.
> > This is in 0.94.5 but this feature is off by default.
> >
> > What's the general thought here, should I kill the current 0.94.6 rc for
> > this?
> > My gut says: Yes.
> >
> >
> > I'm also a bit worried about these:
> > 2013-03-14 01:42:42,271 DEBUG
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening
> > log for replication shared-dnds1-12-sfm.ops.sfdc.net
> %2C60020%2C1363220608780.1363220609572
> > at 0
> > 2013-03-14 01:42:42,358 WARN
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
> Got:
> > java.io.EOFException
> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
> >         at
> >
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
> >         at
> > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
> >         at
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
> >         at
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
> >         at
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
> > 2013-03-14 01:42:42,358 WARN
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Waited
> > too long for this file, considering dumping
> > 2013-03-14 01:42:42,358 DEBUG
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Unable
> > to open a reader, sleeping 1000 times 10
> >
> > This happens after bouncing the cluster a 2nd time and these messages
> > repeat every 10s (for hours now). This is a separate problem I think.
> >
> > -- Lars
> >
> >   ------------------------------
> > *From:* Himanshu Vashishtha <hvashish@cs.ualberta.ca <javascript:;>>
> >
> > *To:* dev@hbase.apache.org <javascript:;>; lars hofhansl <
> larsh@apache.org <javascript:;>>
> > *Cc:* Ted Yu <yuzhihong@gmail.com <javascript:;>>
> > *Sent:* Wednesday, March 13, 2013 6:38 PM
> >
> > *Subject:* Re: Replication hosed after simple cluster restart
> >
> > This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
> > might not be able to move later on, resulting in bogus znodes.
> > I'll fix this asap. Weird it didn't happen in my testing earlier.
> > Sorry about this.
> >
> >
> > On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <larsh@apache.org<javascript:;>>
> wrote:
> > > Sorry 0.94.6RC1
> > > (I complain about folks not reporting the version all the time, and
> then
> > I do it too)
> > >
> > >
> > >
> > > ________________________________
> > >  From: Ted Yu <yuzhihong@gmail.com <javascript:;>>
> > > To: dev@hbase.apache.org <javascript:;>; lars hofhansl <
> larsh@apache.org <javascript:;>>
> > > Sent: Wednesday, March 13, 2013 6:17 PM
> > > Subject: Re: Replication hosed after simple cluster restart
> > >
> > >
> > > Did this happen on 0.94.5 ?
> > >
> > > Thanks
> > >
> > >
> > > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <larsh@apache.org<javascript:;>>
> wrote:
> > >
> > > We just ran into an interesting scenario. We restarted a cluster that
> > was setup as a replication source.
> > >>The stop went cleanly.
> > >>
> > >>Upon restart *all* regionservers aborted within a few seconds with
> > variations of these errors:
> > >>http://pastebin.com/3iQVuBqS
> > >>
> > >>This is scary!
> > >>
> > >>-- Lars
> >
> >
> >
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message