Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1021FED9C for ; Thu, 14 Mar 2013 03:49:08 +0000 (UTC) Received: (qmail 18207 invoked by uid 500); 14 Mar 2013 03:49:07 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 18147 invoked by uid 500); 14 Mar 2013 03:49:07 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 18124 invoked by uid 99); 14 Mar 2013 03:49:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2013 03:49:06 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.213.95] (HELO nm8-vm0.bullet.mail.bf1.yahoo.com) (98.139.213.95) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 14 Mar 2013 03:49:01 +0000 Received: from [98.139.212.148] by nm8.bullet.mail.bf1.yahoo.com with NNFMP; 14 Mar 2013 03:48:40 -0000 Received: from [98.139.212.235] by tm5.bullet.mail.bf1.yahoo.com with NNFMP; 14 Mar 2013 03:48:40 -0000 Received: from [127.0.0.1] by omp1044.mail.bf1.yahoo.com with NNFMP; 14 Mar 2013 03:48:40 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 379875.53615.bm@omp1044.mail.bf1.yahoo.com Received: (qmail 19559 invoked by uid 60001); 14 Mar 2013 03:48:40 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1363232919; bh=o6HlIsVNBybXEGB1ImMaGuNdfDnhAf5HlekZE46odH0=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=YVc1ZiAIJG/Y+VAtCuDrXrv8NQgLd6OeMH17sTxpl6ePFwm2bgXNE0hw7YPlb2hlFZMLmj6RoK9cwYtrpRH0ovhB51PnoALbQR8xRRZkFyirsgFRHCk9HlSdpA7FnntzVlbfom1hlEpo4hLjYnIQ57KwLnrWqtLS/yfG69Iny5c= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=A6GZKsxHjffP2XTVV5/taXS9DcgYWjXbAOQOky4WiDSJQIjOCa1jpO2Ei9q5qyRIAOKGHHxPm99EvOBiw6exc9LGq8XSQfiU9zE4jAzOtoLZjpFfpflXfEyfGKD/XfxfJ6x0NyA6sOPplTbwAu0ZBxVWtJKAgcxPxp/kjWBCbrs=; X-YMail-OSG: M3G6od0VM1mprSMs9oofxtFagv4Q3337YHFRZ5V_ls4AZT7 H6HbtHIfAale4LedIwJ7Bd5eKH46zLhKYUZ1Om4ZwQg9uyQfyfxPmDOMz5kJ ST7YOjIJoa_C9TibkW8M4.s7yoaYq.xmNMFl54cSsiNmRzZY5iTolQVrie8V nBkqaFJvSkDSjrdkdWaDHWcNFTIabIAsx.Mf8HsjKt9kQKNMQVLJtihT2Lp5 mYSB4JO92TZwMSixsHiICJohRRahzRtLwfs2RyXiWsLkyIBnu_LFr86RWgWD CXqNmeSe.2pvirmw0rBAnnX08vPHvXLv7OgseYIzG9iEIPlhsERZUFqI6hGE bnDJOQ23I9t_3zjVjDnh..jlX6bVFv9Pha0grwMxGZkD5nGbYvZDyVCM9.JG b1AAUnL9Dzs1_sYPL4VBjkBTIGjtjUuOcR58oJ0k29aUyR9eazC._aqf.Okl N3c7J3cLKTj_ILcYUBvdtymVwOYGn4OkYUNlUolboz40O8AxxCqaxfgrUj2e _dfvGn7fIexSZCokLhb_h4hqkIpWXd4YJ14M.upVt_iWcOMtTVMNdD2_ffaI muROHNmzYF6Dgc0yZNijPMvoojq5_REIZrVO5h2MKvVP6nPXwL1EEAfs.xoT wjxa5Ic.BZGxR2XkF3VqCLV5KkLQ6jVNoZw-- Received: from [24.130.114.129] by web140606.mail.bf1.yahoo.com via HTTP; Wed, 13 Mar 2013 20:48:39 PDT X-Rocket-MIMEInfo: 002.001,WWVhaCwgbGVtbWUgc2luayB0aGUgUkMuLi4gV2UgZG8gaGF2ZSBhIGZpeC4KCgpDb25zaWRlciBpdCBzdW5rLgoKSW4gdGhlIGVuZCB0aGVyZSBhcmUgc29tZSBtb3JlIGlzc3VlcyB0byBkaXNjdXNzIGFueXdheS4KLSBDYW4gd2UgYXZvaWQgUlNzIHRha2luZyBvdmVyIHF1ZXVlcyBkdXJpbmcgYSBjbGVhbiBzaHV0ZG93bi9yZXN0YXJ0PyBXaXRob3V0IG11bHRpIHdlIGNhbiBhY3R1YWxseSBsb29zZSBkYXRhIHRvIHJlcGxpY2F0ZSB0aGlzIHdheSAob25lIFJTIGlzIHNodXQgZG93biwgYW5vdGhlciB0YWsBMAEBAQE- X-RocketYMMF: lhofhansl X-Mailer: YahooMailWebService/0.8.137.519 References: <1363223561.19602.YahooMailNeo@web140606.mail.bf1.yahoo.com> <1363224475.25762.YahooMailNeo@web140604.mail.bf1.yahoo.com> <1363225503.81869.YahooMailNeo@web140606.mail.bf1.yahoo.com> Message-ID: <1363232919.19485.YahooMailNeo@web140606.mail.bf1.yahoo.com> Date: Wed, 13 Mar 2013 20:48:39 -0700 (PDT) From: lars hofhansl Reply-To: lars hofhansl Subject: Re: Replication hosed after simple cluster restart To: "dev@hbase.apache.org" In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="1905101558-839804654-1363232919=:19485" X-Virus-Checked: Checked by ClamAV on apache.org --1905101558-839804654-1363232919=:19485 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Yeah, lemme sink the RC... We do have a fix.=0A=0A=0AConsider it sunk.=0A= =0AIn the end there are some more issues to discuss anyway.=0A- Can we avoi= d RSs taking over queues during a clean shutdown/restart? Without multi we = can actually loose data to replicate this way (one RS is shut down, another= takes over and is itself shut down) - unless I misunderstand.=0A=0A- Shoul= d we stagger the attempts to move the queues for example with a random wait= between 0 and 10s, so that not all RSs try at the same time?=0A- A test fo= r this scenario? (That's probably tricky)=0A=0A=0A-- Lars=0A=0A=0A=0A______= __________________________=0A From: Andrew Purtell =0A= To: "dev@hbase.apache.org" =0ASent: Wednesday, March= 13, 2013 8:22 PM=0ASubject: Re: Replication hosed after simple cluster res= tart=0A =0AIf Himanshu (?) can fix it quickly we should try to get it in he= re IMHO.=0A=0AOn Wednesday, March 13, 2013, Ted Yu wrote:=0A=0A> This was t= he JIRA that introduced copyQueuesFromRSUsingMulti():=0A> HBASE-2611 Handle= RS that fails while processing the failure of another one=0A> (Himanshu Va= shishtha)=0A>=0A> It went into 0.94.5=0A> And the feature is off by default= :=0A>=0A>=A0 =A0 hbase.zookeeper.useMulti=0A>=A0 =A0 = false=0A>=0A> The fact that Lars first reported the following probl= em meant that no other=0A> user tried this feature.=0A>=0A> Hence I think 0= .94.6 RC1 doesn't need to be sunk.=0A>=0A> Cheers=0A>=0A> On Wed, Mar 13, 2= 013 at 6:45 PM, lars hofhansl >=0A> wrote:= =0A>=0A> > Hey no problem. It's cool that we found it in a test env. It's p= robably=0A> > quite hard to reproduce.=0A> > This is in 0.94.5 but this fea= ture is off by default.=0A> >=0A> > What's the general thought here, should= I kill the current 0.94.6 rc for=0A> > this?=0A> > My gut says: Yes.=0A> >= =0A> >=0A> > I'm also a bit worried about these:=0A> > 2013-03-14 01:42:42,= 271 DEBUG=0A> > org.apache.hadoop.hbase.replication.regionserver.Replicatio= nSource:=0A> Opening=0A> > log for replication shared-dnds1-12-sfm.ops.sfdc= .net=0A> %2C60020%2C1363220608780.1363220609572=0A> > at 0=0A> > 2013-03-14= 01:42:42,358 WARN=0A> > org.apache.hadoop.hbase.replication.regionserver.R= eplicationSource: 1=0A> Got:=0A> > java.io.EOFException=0A> >=A0 =A0 =A0 = =A0 at java.io.DataInputStream.readFully(DataInputStream.java:180)=0A> >= =A0 =A0 =A0 =A0 at java.io.DataInputStream.readFully(DataInputStream.java:= 152)=0A> >=A0 =A0 =A0 =A0 at=0A> > org.apache.hadoop.io.SequenceFile$Reade= r.init(SequenceFile.java:1800)=0A> >=A0 =A0 =A0 =A0 at=0A> >=0A> org.apach= e.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)=0A> >=A0= =A0 =A0 =A0 at=0A> > org.apache.hadoop.io.SequenceFile$Reader.(Sequ= enceFile.java:1714)=0A> >=A0 =A0 =A0 =A0 at=0A> > org.apache.hadoop.io.Seq= uenceFile$Reader.(SequenceFile.java:1728)=0A> >=A0 =A0 =A0 =A0 at=0A= > >=0A> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALR= eader.(SequenceFileLogReader.java:55)=0A> >=A0 =A0 =A0 =A0 at=0A> >= =0A> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(Se= quenceFileLogReader.java:177)=0A> >=A0 =A0 =A0 =A0 at=0A> > org.apache.had= oop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)=0A> >=A0 =A0 =A0 = =A0 at=0A> >=0A> org.apache.hadoop.hbase.replication.regionserver.Replicat= ionHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)=0A> >= =A0 =A0 =A0 =A0 at=0A> >=0A> org.apache.hadoop.hbase.replication.regionser= ver.ReplicationSource.openReader(ReplicationSource.java:507)=0A> >=A0 =A0 = =A0 =A0 at=0A> >=0A> org.apache.hadoop.hbase.replication.regionserver.Repl= icationSource.run(ReplicationSource.java:313)=0A> > 2013-03-14 01:42:42,358= WARN=0A> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSou= rce:=0A> Waited=0A> > too long for this file, considering dumping=0A> > 201= 3-03-14 01:42:42,358 DEBUG=0A> > org.apache.hadoop.hbase.replication.region= server.ReplicationSource:=0A> Unable=0A> > to open a reader, sleeping 1000 = times 10=0A> >=0A> > This happens after bouncing the cluster a 2nd time and= these messages=0A> > repeat every 10s (for hours now). This is a separate = problem I think.=0A> >=0A> > -- Lars=0A> >=0A> >=A0 ----------------------= --------=0A> > *From:* Himanshu Vashishtha >=0A> >=0A> > *To:* dev@hbase.apache.org ; lars hofhan= sl <=0A> larsh@apache.org >=0A> > *Cc:* Ted Yu >=0A> > *Sent:* Wednesday, March 13, 2013 6:38 PM=0A> = >=0A> > *Subject:* Re: Replication hosed after simple cluster restart=0A> >= =0A> > This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it= =0A> > might not be able to move later on, resulting in bogus znodes.=0A> >= I'll fix this asap. Weird it didn't happen in my testing earlier.=0A> > So= rry about this.=0A> >=0A> >=0A> > On Wed, Mar 13, 2013 at 6:27 PM, lars hof= hansl >=0A> wrote:=0A> > > Sorry 0.94.6RC1= =0A> > > (I complain about folks not reporting the version all the time, an= d=0A> then=0A> > I do it too)=0A> > >=0A> > >=0A> > >=0A> > > _____________= ___________________=0A> > >=A0 From: Ted Yu >=0A> > > To: dev@hbase.apache.org ; lars hofhansl <=0A> = larsh@apache.org >=0A> > > Sent: Wednesday, March 13, 2013 6:= 17 PM=0A> > > Subject: Re: Replication hosed after simple cluster restart= =0A> > >=0A> > >=0A> > > Did this happen on 0.94.5 ?=0A> > >=0A> > > Thanks= =0A> > >=0A> > >=0A> > > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl >=0A> wrote:=0A> > >=0A> > > We just ran into a= n interesting scenario. We restarted a cluster that=0A> > was setup as a re= plication source.=0A> > >>The stop went cleanly.=0A> > >>=0A> > >>Upon rest= art *all* regionservers aborted within a few seconds with=0A> > variations = of these errors:=0A> > >>http://pastebin.com/3iQVuBqS=0A> > >>=0A> > >>This= is scary!=0A> > >>=0A> > >>-- Lars=0A> >=0A> >=0A> >=0A>=0A=0A=0A-- =0ABes= t regards,=0A=0A=A0 - Andy=0A=0AProblems worthy of attack prove their wort= h by hitting back. - Piet Hein=0A(via Tom White) --1905101558-839804654-1363232919=:19485--