Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E10C6D8A3 for ; Thu, 14 Mar 2013 01:45:30 +0000 (UTC) Received: (qmail 55979 invoked by uid 500); 14 Mar 2013 01:45:30 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 55922 invoked by uid 500); 14 Mar 2013 01:45:30 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 55914 invoked by uid 99); 14 Mar 2013 01:45:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2013 01:45:30 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.212.177] (HELO nm18.bullet.mail.bf1.yahoo.com) (98.139.212.177) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 14 Mar 2013 01:45:25 +0000 Received: from [98.139.212.152] by nm18.bullet.mail.bf1.yahoo.com with NNFMP; 14 Mar 2013 01:45:04 -0000 Received: from [98.139.212.241] by tm9.bullet.mail.bf1.yahoo.com with NNFMP; 14 Mar 2013 01:45:04 -0000 Received: from [127.0.0.1] by omp1050.mail.bf1.yahoo.com with NNFMP; 14 Mar 2013 01:45:04 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 165167.93112.bm@omp1050.mail.bf1.yahoo.com Received: (qmail 88773 invoked by uid 60001); 14 Mar 2013 01:45:04 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1363225503; bh=uuku4ZKETW2zPI4X+0HSUm4N0LypwJ7xZ5aCDyQ/oeQ=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=xAXV3fvIUE2giRccwy+BcM1qOiX4rS/ieFR6QG78gNtyPvK9P0avhLQ3lu9apCblk5TEo54qbGEVdJ5cG9E4Q5XfbndfJKZDdarQjygxbu5BM62+hgU3wOQLu+pR69pi0sdvp4xlEnmugILh0ebYmoW89GGGrZCKBTYqemqxkBU= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=XdrvR+4nunbAJxPNh1oen7sZWMz+i5Aj9T0rYFJd8KptgMI2Z7MW0tjlZkmyCqZAhmPloYUK6jtmKGxppcJtY4HpEh2OPGaQ9u59WdBbWSobVemZP5g+CsaBE+TDXo+ozeXp2JM8C9MJYsYR9ThnTvm/nx3lCVd2iCXTckOLetk=; X-YMail-OSG: NlC4TzYVM1nXR7PR71IJbKC53DIz2Bo9IDozlqbR7DudbUc wrINJMT11nOupkGn1Hf_JndUax.43LhkHmbW1TdgBHgtjNvHBlYqsFtVqTno IGQVcmRquCD7Rm0NJQgs4n2fuSjW040QPrW903YeUdzeopAOdUD3V2cp9X8d LNZeur_Aj1ONH3oYlYYtBIfLJKmScSaZxrVE1gTEKRU.rrD0wZ98uoOJF2ZQ 54htt4JmwAbQDu0sjVKIG.8Z8glgXJ05lmZB9U2itBXt7eK8iS_rZKBJExi. RL9VNN3hgeijqArRspN2wmXBYzwxx.07CjVFT9SxEwdkO_y7TYVRrPERWnNA TBBGTFs8lX_7_vSu3BWgJRVncE24BcQaxgDrSI7eeqg.ewGVW14cnEEO0gg. 1cK9zKqsZidtA.KZup5mb0VfX6kaUMANMzUHl1RWHtCKEf9baI_nKjGtpQlg XbAaMb4.bkNCHefdIqt1Fst3w3a0atNIAf2qhUMXzfM1CbNe05DToGJ8RkaU 39vq_F7SRLwZMrW6p6Zm8KiKhNSYn.xu3V3nfNa7vz6NYMaXNzryAXbshF85 eQyt9ySUrZ27zthNsnFQcuvuoVx.wXe_ZvzD_hYCFlMd9uqHRsJ60cL_2YJd dkmRbVDgmihtQqiFD0DqZRE49DFsylcl.aOMpbMv6Jc1haa0DzTXLqKFGvcX o7SPUGiVfdTmoi_.lJz30lsA- Received: from [204.14.239.221] by web140606.mail.bf1.yahoo.com via HTTP; Wed, 13 Mar 2013 18:45:03 PDT X-Rocket-MIMEInfo: 002.001,SGV5IG5vIHByb2JsZW0uIEl0J3MgY29vbCB0aGF0IHdlIGZvdW5kIGl0IGluIGEgdGVzdCBlbnYuIEl0J3MgcHJvYmFibHkgcXVpdGUgaGFyZCB0byByZXByb2R1Y2UuClRoaXMgaXMgaW4gMC45NC41IGJ1dCB0aGlzIGZlYXR1cmUgaXMgb2ZmIGJ5IGRlZmF1bHQuCgpXaGF0J3MgdGhlIGdlbmVyYWwgdGhvdWdodCBoZXJlLCBzaG91bGQgSSBraWxsIHRoZSBjdXJyZW50IDAuOTQuNiByYyBmb3IgdGhpcz8KTXkgZ3V0IHNheXM6IFllcy4KCgoKSSdtIGFsc28gYSBiaXQgd29ycmllZCBhYm91dCB0aGVzZToKMjABMAEBAQE- X-RocketYMMF: lhofhansl X-Mailer: YahooMailWebService/0.8.137.519 References: <1363223561.19602.YahooMailNeo@web140606.mail.bf1.yahoo.com> <1363224475.25762.YahooMailNeo@web140604.mail.bf1.yahoo.com> Message-ID: <1363225503.81869.YahooMailNeo@web140606.mail.bf1.yahoo.com> Date: Wed, 13 Mar 2013 18:45:03 -0700 (PDT) From: lars hofhansl Reply-To: lars hofhansl Subject: Re: Replication hosed after simple cluster restart To: Himanshu Vashishtha , "dev@hbase.apache.org" Cc: Ted Yu In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="1905101558-1863569917-1363225503=:81869" X-Virus-Checked: Checked by ClamAV on apache.org --1905101558-1863569917-1363225503=:81869 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Hey no problem. It's cool that we found it in a test env. It's probably qui= te hard to reproduce.=0AThis is in 0.94.5 but this feature is off by defaul= t.=0A=0AWhat's the general thought here, should I kill the current 0.94.6 r= c for this?=0AMy gut says: Yes.=0A=0A=0A=0AI'm also a bit worried about the= se:=0A2013-03-14 01:42:42,271 DEBUG org.apache.hadoop.hbase.replication.reg= ionserver.ReplicationSource: Opening log for replication shared-dnds1-12-sf= m.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0=0A2013-03-14 01:4= 2:42,358 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationS= ource: 1 Got: =0Ajava.io.EOFException=0A=A0=A0=A0=A0=A0=A0=A0 at java.io.Da= taInputStream.readFully(DataInputStream.java:180)=0A=A0=A0=A0=A0=A0=A0=A0 a= t java.io.DataInputStream.readFully(DataInputStream.java:152)=0A=A0=A0=A0= =A0=A0=A0=A0 at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.= java:1800)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.io.SequenceFile$Rea= der.initialize(SequenceFile.java:1765)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apach= e.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1714)=0A=A0=A0=A0= =A0=A0=A0=A0 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFil= e.java:1728)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.hbase.regionserve= r.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:55)= =0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.hbase.regionserver.wal.Sequen= ceFileLogReader.init(SequenceFileLogReader.java:177)=0A=A0=A0=A0=A0=A0=A0= =A0 at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:72= 8)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.hbase.replication.regionser= ver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.ja= va:67)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.hbase.replication.regio= nserver.ReplicationSource.openReader(ReplicationSource.java:507)=0A=A0=A0= =A0=A0=A0=A0=A0 at org.apache.hadoop.hbase.replication.regionserver.Replica= tionSource.run(ReplicationSource.java:313)=0A2013-03-14 01:42:42,358 WARN o= rg.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited t= oo long for this file, considering dumping=0A2013-03-14 01:42:42,358 DEBUG = org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable = to open a reader, sleeping 1000 times 10=0A=0A=0AThis happens after bouncin= g the cluster a 2nd time and these messages repeat every 10s (for hours now= ). This is a separate problem I think.=0A=0A=0A-- Lars=0A=0A=0A=0A_________= _______________________=0A From: Himanshu Vashishtha =0ATo: dev@hbase.apache.org; lars hofhansl =0ACc: Ted= Yu =0ASent: Wednesday, March 13, 2013 6:38 PM=0ASubj= ect: Re: Replication hosed after simple cluster restart=0A =0AThis is bad. = Yes, copyQueuesFromRSUsingMulti returns a list which it=0Amight not be able= to move later on, resulting in bogus znodes.=0AI'll fix this asap. Weird i= t didn't happen in my testing earlier.=0ASorry about this.=0A=0AOn Wed, Mar= 13, 2013 at 6:27 PM, lars hofhansl wrote:=0A> Sorry 0.9= 4.6RC1=0A> (I complain about folks not reporting the version all the time, = and then I do it too)=0A>=0A>=0A>=0A> ________________________________=0A>= =A0 From: Ted Yu =0A> To: dev@hbase.apache.org; lars h= ofhansl =0A> Sent: Wednesday, March 13, 2013 6:17 PM=0A> = Subject: Re: Replication hosed after simple cluster restart=0A>=0A>=0A> Did= this happen on 0.94.5 ?=0A>=0A> Thanks=0A>=0A>=0A> On Wed, Mar 13, 2013 at= 6:12 PM, lars hofhansl wrote:=0A>=0A> We just ran into = an interesting scenario. We restarted a cluster that was setup as a replica= tion source.=0A>>The stop went cleanly.=0A>>=0A>>Upon restart *all* regions= ervers aborted within a few seconds with variations of these errors:=0A>>ht= tp://pastebin.com/3iQVuBqS=0A>>=0A>>This is scary!=0A>>=0A>>-- Lars --1905101558-1863569917-1363225503=:81869--