hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: All region server died due to "Parent directory doesn't exist"
Date Thu, 09 May 2013 16:41:28 GMT
I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
cluster. I am not sure if you are seeing the exact same issue though. We
did not have mass failures at the same time due to this..

Thanks
Varun


On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <varun@pinterest.com> wrote:

> Btw, I am not 100 % sure but I have some seen something like this before:
>
> 1) ZK connection flakiness causes ephemeral nodes to expire
> 2) Master detects failure and renames the logs into a splitting directory
> - this is intentional so that in case that region server comes back up, it
> cannot write to the logs being split
> 3) Region server dies because the log is renamed
>
> So, the yanking away of files is done by the HBase master and is expected
> if the master feels the server is dead. We found that the Region server
> logs DFS exceptions like crazy (1000s of them) in that case and we always
> suspected that this is some kind of DFS error but when we really go upto
> the point where it started, we found some zookeeper session issues.
>
> We had two cases of this - either super high load or NTP/no clock
> synchronization b/w the clusters causing this issue for us.
>
> Thanks
> Varun
>
>
> On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <larsh@apache.org> wrote:
>
>> Thanks Ted. I'll do the same.
>>
>>
>> ----- Original Message -----
>> From: Ted Yu <yuzhihong@gmail.com>
>> To: dev@hbase.apache.org; lars hofhansl <larsh@apache.org>
>> Cc:
>> Sent: Thursday, May 9, 2013 9:07 AM
>> Subject: Re: All region server died due to "Parent directory doesn't
>> exist"
>>
>> I went through the patch for HBASE-7824 one more time and didn't find
>> direct correlation to the issue Lars reported.
>>
>> I am going over the other JIRAs in Lars' list.
>>
>> Cheers
>>
>> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <larsh@apache.org> wrote:
>>
>> > I will try. I do not think this is the issue, though.
>> >
>> > The master is up in my case.
>> > Right now the cluster is in a state where each region server aborts
>> itself
>> > shortly after being started (which coincides with having it's log
>> directory
>> > renamed to ...-splitting).
>> >
>> >
>> > This is a test cluster and I could just start from scratch... This
>> appears
>> > to be a serious enough problem, though, and I would like to track down
>> the
>> > issue.
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ----- Original Message -----
>> > From: Ted Yu <yuzhihong@gmail.com>
>> > To: "dev@hbase.apache.org" <dev@hbase.apache.org>
>> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org>
>> > Sent: Thursday, May 9, 2013 2:04 AM
>> > Subject: Re: All region server died due to "Parent directory doesn't
>> exist"
>> >
>> > The config came from hbase-7824.
>> >
>> > There are other JIRAs in Lars' list which are related to log splitting.
>> >
>> > I think more investigation is needed.
>> >
>> > Cheers
>> >
>> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurtell@apache.org> wrote:
>> >
>> > > So that is HBASE-7824, right?
>> > >
>> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> > >
>> > >> hbase.master.wait.for.log.splitting
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > >
>> > >   - Andy
>> > >
>> > > Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>> > > (via Tom White)
>> >
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message