hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: Recovering hbase after a failure
Date Thu, 02 Oct 2014 22:02:48 GMT
I get that isDirectory is not atomic and not the best solution, but at
least can provide an alternative to fail the operation without using the
deprecated API or altering FileSystem. Another possibility is that we could
live with createNonRecursive until FileSystem becomes fully deprecated and
we can migrate to FileContext, perhaps for HBase 3.x? HBASE-11045 goes in
the opposite direction to this but the discussion is in essence the same
problem.

thanks!
esteban.


--
Cloudera, Inc.


On Thu, Oct 2, 2014 at 2:17 PM, Andrew Purtell <apurtell@apache.org> wrote:

> 14 if you count createNewFile :-)
>
> http://search-hadoop.com/m/282AcZLDAp1. Maybe you could tap Andrew or
> Colin
> on the shoulder Esteban?
>
>
> On Thu, Oct 2, 2014 at 2:13 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
>
> > It's not the round trip, it's the atomicity of the operation. Consider a
> > rename happening between the isDirectory call and the subsequent create
> > call. What would you have achieved by introducing the isDirectory check?
> I
> > skimmed the FileSystem javadoc for 2.4.1 and none of the 13
> non-deprecated
> > create methods can provide the same semantics of createNonRecursive,
> shame.
> >
> >
> > On Thu, Oct 2, 2014 at 11:36 AM, Esteban Gutierrez <esteban@cloudera.com
> >
> > wrote:
> >
> >> I'm not sure if we should use the deprecated API, calling isDirectory
> >> shouldn't be that expensive in the NN but it will add another RPC call
> per
> >> flush.
> >>
> >> esteban.
> >>
> >> --
> >> Cloudera, Inc.
> >>
> >>
> >> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <apurtell@apache.org>
> >> wrote:
> >>
> >> > ​On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <buckleyr@oclc.org>
> >> wrote:
> >> >
> >> > > Also, once the original /hbase got mv'd, a few of the region servers
> >> did
> >> > > some flush's before they aborted.   Those RS's actually created a
> new
> >> > > /hbase, with new table directories, but only containing the data
> from
> >> the
> >> > > flush.
> >> >
> >> >
> >> > Sounds like we should be creating flush files with createNonRecursive
> >> (even
> >> > though it's deprecated)
> >> >
> >> >
> >> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <buckleyr@oclc.org>
> wrote:
> >> >
> >> > > FWIW, in case something like this happens to someone else.
> >> > >
> >> > > To recover this, the first thing I tried was to just mv the /hbase
> >> > > directory back.   That doesn’t work.
> >> > >
> >> > > To get back going had to completely shut down and restart.
> >> > >
> >> > > Also, once the original /hbase got mv'd, a few of the region servers
> >> did
> >> > > some flush's before they aborted.   Those RS's actually created a
> new
> >> > > /hbase, with new table directories, but only containing the data
> from
> >> the
> >> > > flush.
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Buckley,Ron
> >> > > Sent: Thursday, October 02, 2014 2:09 PM
> >> > > To: hbase-user
> >> > > Subject: RE: Recovering hbase after a failure
> >> > >
> >> > > Nick,
> >> > >
> >> > > Good ideas.    Compared  file and region counts with our DR site.
> >> >  Things
> >> > > looks OK.  Going to run some rowcounter's too.
> >> > >
> >> > > Feels like we got off easy.
> >> > >
> >> > > Ron
> >> > >
> >> > > -----Original Message-----
> >> > > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> >> > > Sent: Thursday, October 02, 2014 1:27 PM
> >> > > To: hbase-user
> >> > > Subject: Re: Recovering hbase after a failure
> >> > >
> >> > > Hi Ron,
> >> > >
> >> > > Yikes!
> >> > >
> >> > > Do you have any basic metrics regarding the amount of data in the
> >> system
> >> > > -- size of store files before the incident, number of records, &c?
> >> > >
> >> > > You could sift through the HDFS audit log and see if any files that
> >> were
> >> > > there previously have not been restored.
> >> > >
> >> > > -n
> >> > >
> >> > > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <buckleyr@oclc.org>
> >> wrote:
> >> > >
> >> > > > We just had an event where, on our main hbase instance, the /hbase
> >> > > > directory got moved out from under the running system (Human
> error).
> >> > > >
> >> > > > HBase was really unhappy about that, but we were able to recover
> it
> >> > > > fairly easily and get back going.
> >> > > >
> >> > > > As far as I can tell, all the data and tables came back correct.
> >> But,
> >> > > > I'm pretty concerned that there may be some hidden corruption
or
> >> data
> >> > > loss.
> >> > > >
> >> > > > 'hbase hbck'  runs clean and there are no new complaints in the
> >> logs.
> >> > > >
> >> > > > Can anyone think of anything else we should look at?
> >> >
> >>
> >
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message