Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0F95B11AA4 for ; Fri, 10 May 2013 05:47:47 +0000 (UTC) Received: (qmail 48323 invoked by uid 500); 10 May 2013 05:47:46 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 47915 invoked by uid 500); 10 May 2013 05:47:43 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 47867 invoked by uid 99); 10 May 2013 05:47:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 May 2013 05:47:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [72.30.239.147] (HELO nm39-vm3.bullet.mail.bf1.yahoo.com) (72.30.239.147) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 May 2013 05:47:37 +0000 Received: from [98.139.212.146] by nm39.bullet.mail.bf1.yahoo.com with NNFMP; 10 May 2013 05:47:15 -0000 Received: from [98.139.212.205] by tm3.bullet.mail.bf1.yahoo.com with NNFMP; 10 May 2013 05:47:15 -0000 Received: from [127.0.0.1] by omp1014.mail.bf1.yahoo.com with NNFMP; 10 May 2013 05:47:15 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 447147.5835.bm@omp1014.mail.bf1.yahoo.com Received: (qmail 17432 invoked by uid 60001); 10 May 2013 05:47:15 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1368164835; bh=mNifu+vpDalXInd6k/mibuuSeSR62dd7HWADER5Y5+k=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=0ur0p4ODbbgurhPuLUMmDTbKjzG6Fy/mkX9i5qqv1ar5e2iZcYqdzYyT90XV+JSF6v4Zy/3KzNDt1Q0pvy/U+HAy+sE907z0FXw8lhp5m4Ky3FXSSi0mxmSX9SqCNdYhTjff1P5o/L8oKthgTxK59dgryNMZUoDuU2Fnsn+NfpE= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=Sae7C6Df/DgdEpCcmtXhqI/ucC033SiVB4MhNDl3JSCaNwiD2k6vax/hATJkRiv9gEl8bQRjlv3Ng/7dIPSEEu8+GNhdU62JFebZMwI6gWGzHDA1QWQTgvMFJGRtDWUwmoDh7i12NRGy+RVtE1If1wHL10raXRCkdaY6LMArGHM=; X-YMail-OSG: 3ZzelbEVM1lOKIlZlNBNUelb7Pv_YWJ9UuOXBG5smG35wvF mX5vxoaHm9Dsa3RcbmvpvyuWhze4quJfwB9QeAef7X2Ey_czbWrd1BJpAaeX QbOWFhItir3v4EgOTONsn_WjLJffEsDdcg15.PTUs5mwt3Bu5g9XYF4.hnyW kcBEKfHamnbMFdq4QQKlFM.wA1vw_NUljhvPS5LaZwSlMADdJuxzkJQLWSjO q1cNg78oYSbRRMCLXs7DLP8sg7NvZXxxcUlGeN7aS8QMa2vdFjnBPr51qWlC bSVTjQAjxRSBe1dHK98bCaIj6PCQYrEWUGm7O3RnS74mhtwa7DqEktgoAj7S 8BLqbvV.fAHGn5tuCOwfSl.RsIc98kXB1CAeJ3uaeqIqV4W8kNpQa7BlfwmL ct5VpTIHTsFH1_M5qFFfCXH8fJ6.oGnujYdT3IffaG2jehNzvPSa9TPACjwB WBrdqiSYbkIvlrJPD41O17Bi0HQbhKRGVJR73q.6YVogJkJ4RKEhgAuedkyA MJKBpfBrCRaR46mXQ8O4- Received: from [24.130.114.129] by web140606.mail.bf1.yahoo.com via HTTP; Thu, 09 May 2013 22:47:14 PDT X-Rocket-MIMEInfo: 002.001,Tm9wZS4gVGhhdCBkb2VzIG5vdCBhcHBlYXIgdG8gYmUgdGhlIHByb2JsZW0uCgoKX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KIEZyb206IEVuaXMgU8O2enR1dGFyIDxlbmlzLnNvekBnbWFpbC5jb20.ClRvOiAiZGV2QGhiYXNlLmFwYWNoZS5vcmciIDxkZXZAaGJhc2UuYXBhY2hlLm9yZz47IGxhcnMgaG9maGFuc2wgPGxhcnNoQGFwYWNoZS5vcmc.IApTZW50OiBUaHVyc2RheSwgTWF5IDksIDIwMTMgMTA6MDEgUE0KU3ViamVjdDogUmU6IEFsbCByZWdpb24gc2VydmVyIGRpZWQgZHVlIHRvICIBMAEBAQE- X-RocketYMMF: lhofhansl X-Mailer: YahooMailWebService/0.8.141.536 References: <1368081563.24364.YahooMailNeo@web140605.mail.bf1.yahoo.com> <1368084192.71836.YahooMailNeo@web140601.mail.bf1.yahoo.com> <1368085284.5783.YahooMailNeo@web140606.mail.bf1.yahoo.com> <1368114531.12947.YahooMailNeo@web140604.mail.bf1.yahoo.com> <1368116196.57721.YahooMailNeo@web140605.mail.bf1.yahoo.com> <1368119020.84034.YahooMailNeo@web140604.mail.bf1.yahoo.com> <1368123230.50177.YahooMailNeo@web140602.mail.bf1.yahoo.com> <1368124118.35096.YahooMailNeo@web140603.mail.bf1.yahoo.com> <1368159953.93614.YahooMailNeo@web140603.mail.bf1.yahoo.com> Message-ID: <1368164834.16359.YahooMailNeo@web140606.mail.bf1.yahoo.com> Date: Thu, 9 May 2013 22:47:14 -0700 (PDT) From: lars hofhansl Reply-To: lars hofhansl Subject: Re: All region server died due to "Parent directory doesn't exist" To: "dev@hbase.apache.org" In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="1905101558-283183563-1368164834=:16359" X-Virus-Checked: Checked by ClamAV on apache.org --1905101558-283183563-1368164834=:16359 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Nope. That does not appear to be the problem.=0A=0A=0A_____________________= ___________=0A From: Enis S=F6ztutar =0ATo: "dev@hbase.= apache.org" ; lars hofhansl =0ASen= t: Thursday, May 9, 2013 10:01 PM=0ASubject: Re: All region server died due= to "Parent directory doesn't exist"=0A =0A=0ABut you see the zookeeper ses= sion timeout events in RS logs, and the master=0Asays that zk session for t= he RS's has expired, right?=0A=0A=0AOn Thu, May 9, 2013 at 9:25 PM, lars ho= fhansl wrote:=0A=0A> Still looking. Stack and Himanshu a= re looking too (tanks again!).=0A>=0A> What I do know is that it has to do = the fencing mechanism during log=0A> splitting.=0A> Until I bounced HDFS an= d ZK (ZK probably being the culprit) each started=0A> RegionServer would im= mediately be fenced off (it's log directory renamed).=0A> Probably by the S= SH.=0A>=0A> It is not clear what caused the first RS to die. While there is= no direct=0A> evidence, from the logs it looks like the log directory was = just suddenly=0A> renamed.=0A>=0A> I'll spend more time in the logs and als= o watch for this happening again.=0A>=0A> We did find another misconfigured= cluster that had some services pointed=0A> at this cluster. It does not lo= ok like that was actually a problem - there=0A> is no evidence in the logs = that this actually caused a problem, but it made=0A> this deploy somewhat "= special".=0A>=0A>=0A> -- Lars=0A>=0A>=0A>=0A> _____________________________= ___=0A>=A0 From: Enis S=F6ztutar =0A> To: "dev@hbase.ap= ache.org" ; lars hofhansl <=0A> larsh@apache.org>=0A>= Sent: Thursday, May 9, 2013 6:10 PM=0A> Subject: Re: All region server die= d due to "Parent directory doesn't exist"=0A>=0A>=0A>=0A> Could we able to = find the root cause?=0A>=0A>=0A>=0A> On Thu, May 9, 2013 at 11:28 AM, lars = hofhansl wrote:=0A>=0A> Good news is that as far as I ca= n tell no data was lost.=0A> >Eventually all logs were split and replayed.= =0A> >=0A> >=0A> >=0A> >-- Lars=0A> >=0A> >=0A> >=0A> >----- Original Messa= ge -----=0A> >=0A> >From: lars hofhansl =0A> >To: HBase D= ev List =0A> >=0A> >Cc:=0A> >Sent: Thursday, May 9, 2= 013 11:13 AM=0A> >Subject: Re: All region server died due to "Parent direct= ory doesn't=0A> exist"=0A> >=0A> >Thanks Stack.=0A> >=0A> >I sent the logs.= =0A> >Also, I have since bounced HDFS and ZK and the problem is gone now (I= can=0A> start RSs again and they stay up). Something got into a weird stat= e.=0A> >=0A> >=0A> >-- Lars=0A> >=0A> >=0A> >=0A> >________________________= ________=0A> >From: Stack =0A> >To: HBase Dev List ; lars hofhansl <=0A> larsh@apache.org>=0A> >Sent: Thursday= , May 9, 2013 10:34 AM=0A> >Subject: Re: All region server died due to "Par= ent directory doesn't=0A> exist"=0A> >=0A> >=0A> >=0A> >Want to send me a r= egionserver log Lars? (off-list)=0A> >St.Ack=0A> >=0A> >=0A> >=0A> >On Thu,= May 9, 2013 at 10:03 AM, lars hofhansl wrote:=0A> >=0A>= >Thanks Ted and Varun.=0A> >>=0A> >>=0A> >>Let me check on the .META. serv= er.=0A> >>=0A> >>=0A> >>The majority (13) of the RSs died within 2 minutes.= The remaining 3 died=0A> over the following 10 minutes.=0A> >>So that woul= d point to general issue. I did not see any ZK issues but=0A> I'll double c= heck.=0A> >>=0A> >>=0A> >>It is just interesting that even now, if I start = and RS it aborts within=0A> a minute or two, because of this issue.=0A> >>= =0A> >>=0A> >>-- Lars=0A> >>=0A> >>=0A> >>----- Original Message -----=0A> = >>From: Ted Yu =0A> >>To: dev@hbase.apache.org=0A> >>= =0A> >>Cc:=0A> >>Sent: Thursday, May 9, 2013 9:51 AM=0A> >>Subject: Re: All= region server died due to "Parent directory doesn't=0A> exist"=0A> >>=0A> = >>Thanks Varun for sharing your experience.=0A> >>=0A> >>Lars:=0A> >>Was th= e server carrying .META. functioning properly around the time when=0A> >>yo= u observed the problem ?=0A> >>=0A> >>Cheers=0A> >>=0A> >>On Thu, May 9, 20= 13 at 9:41 AM, Varun Sharma =0A> wrote:=0A> >>=0A> >>>= I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase=0A= > >>> cluster. I am not sure if you are seeing the exact same issue though.= =0A> We=0A> >>> did not have mass failures at the same time due to this..= =0A> >>>=0A> >>> Thanks=0A> >>> Varun=0A> >>>=0A> >>>=0A> >>> On Thu, May 9= , 2013 at 9:39 AM, Varun Sharma =0A> wrote:=0A> >>>=0A= > >>> > Btw, I am not 100 % sure but I have some seen something like this= =0A> before:=0A> >>> >=0A> >>> > 1) ZK connection flakiness causes ephemera= l nodes to expire=0A> >>> > 2) Master detects failure and renames the logs = into a splitting=0A> directory=0A> >>> > - this is intentional so that in c= ase that region server comes back=0A> up,=0A> >>> it=0A> >>> > cannot write= to the logs being split=0A> >>> > 3) Region server dies because the log is= renamed=0A> >>> >=0A> >>> > So, the yanking away of files is done by the H= Base master and is=0A> expected=0A> >>> > if the master feels the server is= dead. We found that the Region=0A> server=0A> >>> > logs DFS exceptions li= ke crazy (1000s of them) in that case and we=0A> always=0A> >>> > suspected= that this is some kind of DFS error but when we really go=0A> upto=0A> >>>= > the point where it started, we found some zookeeper session issues.=0A> = >>> >=0A> >>> > We had two cases of this - either super high load or NTP/no= clock=0A> >>> > synchronization b/w the clusters causing this issue for us= .=0A> >>> >=0A> >>> > Thanks=0A> >>> > Varun=0A> >>> >=0A> >>> >=0A> >>> > = On Thu, May 9, 2013 at 9:16 AM, lars hofhansl =0A> wrote:= =0A> >>> >=0A> >>> >> Thanks Ted. I'll do the same.=0A> >>> >>=0A> >>> >>= =0A> >>> >> ----- Original Message -----=0A> >>> >> From: Ted Yu =0A> >>> >> To: dev@hbase.apache.org; lars hofhansl =0A> >>> >> Cc:=0A> >>> >> Sent: Thursday, May 9, 2013 9:07 AM=0A> >>= > >> Subject: Re: All region server died due to "Parent directory doesn't= =0A> >>> >> exist"=0A> >>> >>=0A> >>> >> I went through the patch for HBASE= -7824 one more time and didn't=0A> find=0A> >>> >> direct correlation to th= e issue Lars reported.=0A> >>> >>=0A> >>> >> I am going over the other JIRA= s in Lars' list.=0A> >>> >>=0A> >>> >> Cheers=0A> >>> >>=0A> >>> >> On Thu,= May 9, 2013 at 8:48 AM, lars hofhansl =0A> wrote:=0A> >>= > >>=0A> >>> >> > I will try. I do not think this is the issue, though.=0A>= >>> >> >=0A> >>> >> > The master is up in my case.=0A> >>> >> > Right now = the cluster is in a state where each region server=0A> aborts=0A> >>> >> it= self=0A> >>> >> > shortly after being started (which coincides with having = it's log=0A> >>> >> directory=0A> >>> >> > renamed to ...-splitting).=0A> >= >> >> >=0A> >>> >> >=0A> >>> >> > This is a test cluster and I could just s= tart from scratch... This=0A> >>> >> appears=0A> >>> >> > to be a serious e= nough problem, though, and I would like to track=0A> down=0A> >>> >> the=0A= > >>> >> > issue.=0A> >>> >> >=0A> >>> >> > -- Lars=0A> >>> >> >=0A> >>> >>= >=0A> >>> >> >=0A> >>> >> > ----- Original Message -----=0A> >>> >> > From= : Ted Yu =0A> >>> >> > To: "dev@hbase.apache.org" =0A> >>> >> > Cc: "dev@hbase.apache.org" =0A> >>> >> > Sent: Thursday, May 9, 2013 2:04 AM=0A> >>> >> > Subjec= t: Re: All region server died due to "Parent directory=0A> doesn't=0A> >>> = >> exist"=0A> >>> >> >=0A> >>> >> > The config came from hbase-7824.=0A> >>= > >> >=0A> >>> >> > There are other JIRAs in Lars' list which are related t= o log=0A> >>> splitting.=0A> >>> >> >=0A> >>> >> > I think more investigati= on is needed.=0A> >>> >> >=0A> >>> >> > Cheers=0A> >>> >> >=0A> >>> >> > On= May 9, 2013, at 1:59 AM, Andrew Purtell =0A> >>> wrot= e:=0A> >>> >> >=0A> >>> >> > > So that is HBASE-7824, right?=0A> >>> >> > >= =0A> >>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu =0A> wrote:=0A> >>> >> > >=0A> >>> >> > >> hbase.master.wait.for.log.split= ting=0A> >>> >> > >=0A> >>> >> > >=0A> >>> >> > >=0A> >>> >> > >=0A> >>> >>= > > --=0A> >>> >> > > Best regards,=0A> >>> >> > >=0A> >>> >> > >=A0 - An= dy=0A> >>> >> > >=0A> >>> >> > > Problems worthy of attack prove their wort= h by hitting back. -=0A> Piet=0A> >>> >> Hein=0A> >>> >> > > (via Tom White= )=0A> >>> >> >=0A> >>> >> >=0A> >>> >>=0A> >>> >>=0A> >>> >=0A> >>>=0A> >>= =0A> >>=0A> >=0A> --1905101558-283183563-1368164834=:16359--