hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanshu Vashishtha <hvash...@cs.ualberta.ca>
Subject Re: Replication hosed after simple cluster restart
Date Thu, 14 Mar 2013 03:26:03 GMT
Yes, a patch is there on https://issues.apache.org/jira/browse/HBASE-8099.

On Wed, Mar 13, 2013 at 8:22 PM, Andrew Purtell <apurtell@apache.org> wrote:
> If Himanshu (?) can fix it quickly we should try to get it in here IMHO.
>
> On Wednesday, March 13, 2013, Ted Yu wrote:
>
>> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
>> HBASE-2611 Handle RS that fails while processing the failure of another one
>> (Himanshu Vashishtha)
>>
>> It went into 0.94.5
>> And the feature is off by default:
>>
>>     <name>hbase.zookeeper.useMulti</name>
>>     <value>false</value>
>>
>> The fact that Lars first reported the following problem meant that no other
>> user tried this feature.
>>
>> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>>
>> Cheers
>>
>> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <larsh@apache.org<javascript:;>>
>> wrote:
>>
>> > Hey no problem. It's cool that we found it in a test env. It's probably
>> > quite hard to reproduce.
>> > This is in 0.94.5 but this feature is off by default.
>> >
>> > What's the general thought here, should I kill the current 0.94.6 rc for
>> > this?
>> > My gut says: Yes.
>> >
>> >
>> > I'm also a bit worried about these:
>> > 2013-03-14 01:42:42,271 DEBUG
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Opening
>> > log for replication shared-dnds1-12-sfm.ops.sfdc.net
>> %2C60020%2C1363220608780.1363220609572
>> > at 0
>> > 2013-03-14 01:42:42,358 WARN
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
>> Got:
>> > java.io.EOFException
>> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>> >         at
>> >
>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>> >         at
>> >
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>> >         at
>> >
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>> >         at
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>> >         at
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>> >         at
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
>> > 2013-03-14 01:42:42,358 WARN
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Waited
>> > too long for this file, considering dumping
>> > 2013-03-14 01:42:42,358 DEBUG
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Unable
>> > to open a reader, sleeping 1000 times 10
>> >
>> > This happens after bouncing the cluster a 2nd time and these messages
>> > repeat every 10s (for hours now). This is a separate problem I think.
>> >
>> > -- Lars
>> >
>> >   ------------------------------
>> > *From:* Himanshu Vashishtha <hvashish@cs.ualberta.ca <javascript:;>>
>> >
>> > *To:* dev@hbase.apache.org <javascript:;>; lars hofhansl <
>> larsh@apache.org <javascript:;>>
>> > *Cc:* Ted Yu <yuzhihong@gmail.com <javascript:;>>
>> > *Sent:* Wednesday, March 13, 2013 6:38 PM
>> >
>> > *Subject:* Re: Replication hosed after simple cluster restart
>> >
>> > This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
>> > might not be able to move later on, resulting in bogus znodes.
>> > I'll fix this asap. Weird it didn't happen in my testing earlier.
>> > Sorry about this.
>> >
>> >
>> > On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <larsh@apache.org<javascript:;>>
>> wrote:
>> > > Sorry 0.94.6RC1
>> > > (I complain about folks not reporting the version all the time, and
>> then
>> > I do it too)
>> > >
>> > >
>> > >
>> > > ________________________________
>> > >  From: Ted Yu <yuzhihong@gmail.com <javascript:;>>
>> > > To: dev@hbase.apache.org <javascript:;>; lars hofhansl <
>> larsh@apache.org <javascript:;>>
>> > > Sent: Wednesday, March 13, 2013 6:17 PM
>> > > Subject: Re: Replication hosed after simple cluster restart
>> > >
>> > >
>> > > Did this happen on 0.94.5 ?
>> > >
>> > > Thanks
>> > >
>> > >
>> > > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <larsh@apache.org<javascript:;>>
>> wrote:
>> > >
>> > > We just ran into an interesting scenario. We restarted a cluster that
>> > was setup as a replication source.
>> > >>The stop went cleanly.
>> > >>
>> > >>Upon restart *all* regionservers aborted within a few seconds with
>> > variations of these errors:
>> > >>http://pastebin.com/3iQVuBqS
>> > >>
>> > >>This is scary!
>> > >>
>> > >>-- Lars
>> >
>> >
>> >
>>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Mime
View raw message