hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: All region server died due to "Parent directory doesn't exist"
Date Thu, 09 May 2013 18:13:50 GMT
Thanks Stack.

I sent the logs.
Also, I have since bounced HDFS and ZK and the problem is gone now (I can start RSs again
and they stay up). Something got into a weird state.


-- Lars



________________________________
 From: Stack <stack@duboce.net>
To: HBase Dev List <dev@hbase.apache.org>; lars hofhansl <larsh@apache.org> 
Sent: Thursday, May 9, 2013 10:34 AM
Subject: Re: All region server died due to "Parent directory doesn't exist"
 


Want to send me a regionserver log Lars? (off-list)
St.Ack



On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <larsh@apache.org> wrote:

Thanks Ted and Varun.
>
>
>Let me check on the .META. server.
>
>
>The majority (13) of the RSs died within 2 minutes. The remaining 3 died over the following
10 minutes.
>So that would point to general issue. I did not see any ZK issues but I'll double check.
>
>
>It is just interesting that even now, if I start and RS it aborts within a minute or two,
because of this issue.
>
>
>-- Lars
>
>
>----- Original Message -----
>From: Ted Yu <yuzhihong@gmail.com>
>To: dev@hbase.apache.org
>
>Cc:
>Sent: Thursday, May 9, 2013 9:51 AM
>Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>Thanks Varun for sharing your experience.
>
>Lars:
>Was the server carrying .META. functioning properly around the time when
>you observed the problem ?
>
>Cheers
>
>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <varun@pinterest.com> wrote:
>
>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
>> cluster. I am not sure if you are seeing the exact same issue though. We
>> did not have mass failures at the same time due to this..
>>
>> Thanks
>> Varun
>>
>>
>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <varun@pinterest.com> wrote:
>>
>> > Btw, I am not 100 % sure but I have some seen something like this before:
>> >
>> > 1) ZK connection flakiness causes ephemeral nodes to expire
>> > 2) Master detects failure and renames the logs into a splitting directory
>> > - this is intentional so that in case that region server comes back up,
>> it
>> > cannot write to the logs being split
>> > 3) Region server dies because the log is renamed
>> >
>> > So, the yanking away of files is done by the HBase master and is expected
>> > if the master feels the server is dead. We found that the Region server
>> > logs DFS exceptions like crazy (1000s of them) in that case and we always
>> > suspected that this is some kind of DFS error but when we really go upto
>> > the point where it started, we found some zookeeper session issues.
>> >
>> > We had two cases of this - either super high load or NTP/no clock
>> > synchronization b/w the clusters causing this issue for us.
>> >
>> > Thanks
>> > Varun
>> >
>> >
>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <larsh@apache.org> wrote:
>> >
>> >> Thanks Ted. I'll do the same.
>> >>
>> >>
>> >> ----- Original Message -----
>> >> From: Ted Yu <yuzhihong@gmail.com>
>> >> To: dev@hbase.apache.org; lars hofhansl <larsh@apache.org>
>> >> Cc:
>> >> Sent: Thursday, May 9, 2013 9:07 AM
>> >> Subject: Re: All region server died due to "Parent directory doesn't
>> >> exist"
>> >>
>> >> I went through the patch for HBASE-7824 one more time and didn't find
>> >> direct correlation to the issue Lars reported.
>> >>
>> >> I am going over the other JIRAs in Lars' list.
>> >>
>> >> Cheers
>> >>
>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <larsh@apache.org> wrote:
>> >>
>> >> > I will try. I do not think this is the issue, though.
>> >> >
>> >> > The master is up in my case.
>> >> > Right now the cluster is in a state where each region server aborts
>> >> itself
>> >> > shortly after being started (which coincides with having it's log
>> >> directory
>> >> > renamed to ...-splitting).
>> >> >
>> >> >
>> >> > This is a test cluster and I could just start from scratch... This
>> >> appears
>> >> > to be a serious enough problem, though, and I would like to track down
>> >> the
>> >> > issue.
>> >> >
>> >> > -- Lars
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message -----
>> >> > From: Ted Yu <yuzhihong@gmail.com>
>> >> > To: "dev@hbase.apache.org" <dev@hbase.apache.org>
>> >> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org>
>> >> > Sent: Thursday, May 9, 2013 2:04 AM
>> >> > Subject: Re: All region server died due to "Parent directory doesn't
>> >> exist"
>> >> >
>> >> > The config came from hbase-7824.
>> >> >
>> >> > There are other JIRAs in Lars' list which are related to log
>> splitting.
>> >> >
>> >> > I think more investigation is needed.
>> >> >
>> >> > Cheers
>> >> >
>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurtell@apache.org>
>> wrote:
>> >> >
>> >> > > So that is HBASE-7824, right?
>> >> > >
>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>> >> > >
>> >> > >> hbase.master.wait.for.log.splitting
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > >
>> >> > >   - Andy
>> >> > >
>> >> > > Problems worthy of attack prove their worth by hitting back. -
Piet
>> >> Hein
>> >> > > (via Tom White)
>> >> >
>> >> >
>> >>
>> >>
>> >
>>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message