hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Söztutar <enis....@gmail.com>
Subject Re: All region server died due to "Parent directory doesn't exist"
Date Fri, 10 May 2013 01:10:41 GMT
Could we able to find the root cause?


On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <larsh@apache.org> wrote:

> Good news is that as far as I can tell no data was lost.
> Eventually all logs were split and replayed.
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: lars hofhansl <larsh@apache.org>
> To: HBase Dev List <dev@hbase.apache.org>
> Cc:
> Sent: Thursday, May 9, 2013 11:13 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> Thanks Stack.
>
> I sent the logs.
> Also, I have since bounced HDFS and ZK and the problem is gone now (I can
> start RSs again and they stay up). Something got into a weird state.
>
>
> -- Lars
>
>
>
> ________________________________
> From: Stack <stack@duboce.net>
> To: HBase Dev List <dev@hbase.apache.org>; lars hofhansl <larsh@apache.org
> >
> Sent: Thursday, May 9, 2013 10:34 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>
>
> Want to send me a regionserver log Lars? (off-list)
> St.Ack
>
>
>
> On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <larsh@apache.org> wrote:
>
> Thanks Ted and Varun.
> >
> >
> >Let me check on the .META. server.
> >
> >
> >The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> >So that would point to general issue. I did not see any ZK issues but
> I'll double check.
> >
> >
> >It is just interesting that even now, if I start and RS it aborts within
> a minute or two, because of this issue.
> >
> >
> >-- Lars
> >
> >
> >----- Original Message -----
> >From: Ted Yu <yuzhihong@gmail.com>
> >To: dev@hbase.apache.org
> >
> >Cc:
> >Sent: Thursday, May 9, 2013 9:51 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >Thanks Varun for sharing your experience.
> >
> >Lars:
> >Was the server carrying .META. functioning properly around the time when
> >you observed the problem ?
> >
> >Cheers
> >
> >On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <varun@pinterest.com> wrote:
> >
> >> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> >> cluster. I am not sure if you are seeing the exact same issue though. We
> >> did not have mass failures at the same time due to this..
> >>
> >> Thanks
> >> Varun
> >>
> >>
> >> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <varun@pinterest.com>
> wrote:
> >>
> >> > Btw, I am not 100 % sure but I have some seen something like this
> before:
> >> >
> >> > 1) ZK connection flakiness causes ephemeral nodes to expire
> >> > 2) Master detects failure and renames the logs into a splitting
> directory
> >> > - this is intentional so that in case that region server comes back
> up,
> >> it
> >> > cannot write to the logs being split
> >> > 3) Region server dies because the log is renamed
> >> >
> >> > So, the yanking away of files is done by the HBase master and is
> expected
> >> > if the master feels the server is dead. We found that the Region
> server
> >> > logs DFS exceptions like crazy (1000s of them) in that case and we
> always
> >> > suspected that this is some kind of DFS error but when we really go
> upto
> >> > the point where it started, we found some zookeeper session issues.
> >> >
> >> > We had two cases of this - either super high load or NTP/no clock
> >> > synchronization b/w the clusters causing this issue for us.
> >> >
> >> > Thanks
> >> > Varun
> >> >
> >> >
> >> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <larsh@apache.org>
> wrote:
> >> >
> >> >> Thanks Ted. I'll do the same.
> >> >>
> >> >>
> >> >> ----- Original Message -----
> >> >> From: Ted Yu <yuzhihong@gmail.com>
> >> >> To: dev@hbase.apache.org; lars hofhansl <larsh@apache.org>
> >> >> Cc:
> >> >> Sent: Thursday, May 9, 2013 9:07 AM
> >> >> Subject: Re: All region server died due to "Parent directory doesn't
> >> >> exist"
> >> >>
> >> >> I went through the patch for HBASE-7824 one more time and didn't find
> >> >> direct correlation to the issue Lars reported.
> >> >>
> >> >> I am going over the other JIRAs in Lars' list.
> >> >>
> >> >> Cheers
> >> >>
> >> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <larsh@apache.org>
> wrote:
> >> >>
> >> >> > I will try. I do not think this is the issue, though.
> >> >> >
> >> >> > The master is up in my case.
> >> >> > Right now the cluster is in a state where each region server aborts
> >> >> itself
> >> >> > shortly after being started (which coincides with having it's
log
> >> >> directory
> >> >> > renamed to ...-splitting).
> >> >> >
> >> >> >
> >> >> > This is a test cluster and I could just start from scratch...
This
> >> >> appears
> >> >> > to be a serious enough problem, though, and I would like to track
> down
> >> >> the
> >> >> > issue.
> >> >> >
> >> >> > -- Lars
> >> >> >
> >> >> >
> >> >> >
> >> >> > ----- Original Message -----
> >> >> > From: Ted Yu <yuzhihong@gmail.com>
> >> >> > To: "dev@hbase.apache.org" <dev@hbase.apache.org>
> >> >> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org>
> >> >> > Sent: Thursday, May 9, 2013 2:04 AM
> >> >> > Subject: Re: All region server died due to "Parent directory
> doesn't
> >> >> exist"
> >> >> >
> >> >> > The config came from hbase-7824.
> >> >> >
> >> >> > There are other JIRAs in Lars' list which are related to log
> >> splitting.
> >> >> >
> >> >> > I think more investigation is needed.
> >> >> >
> >> >> > Cheers
> >> >> >
> >> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurtell@apache.org>
> >> wrote:
> >> >> >
> >> >> > > So that is HBASE-7824, right?
> >> >> > >
> >> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> >> >> > >
> >> >> > >> hbase.master.wait.for.log.splitting
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Best regards,
> >> >> > >
> >> >> > >   - Andy
> >> >> > >
> >> >> > > Problems worthy of attack prove their worth by hitting back.
-
> Piet
> >> >> Hein
> >> >> > > (via Tom White)
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message