hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: Recovering hbase after a failure
Date Thu, 02 Oct 2014 18:54:14 GMT
Depending on which version they are using the RS should retrying the
operation to HDFS as we currently do. Eventually clients should be rejected
due maxing out the call queue. The question is for how long should we keep
the RS up until HDFS or the filesystem structure is back. Worst case
scenario we could provide a last resort option to drain the memstore or the
WAL before the RS goes down when there is no filesystem available.

esteban.

--
Cloudera, Inc.


On Thu, Oct 2, 2014 at 11:39 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:

> In this case, didn't the RS creating the directories and flushing the files
> prevent data loss? Had the flush aborted due to lack of directories, that
> flush data would have been lost entirely.
>
> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <apurtell@apache.org>
> wrote:
>
> > ​On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <buckleyr@oclc.org> wrote:
> >
> > > Also, once the original /hbase got mv'd, a few of the region servers
> did
> > > some flush's before they aborted.   Those RS's actually created a new
> > > /hbase, with new table directories, but only containing the data from
> the
> > > flush.
> >
> >
> > Sounds like we should be creating flush files with createNonRecursive
> (even
> > though it's deprecated)
> >
> >
> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <buckleyr@oclc.org> wrote:
> >
> > > FWIW, in case something like this happens to someone else.
> > >
> > > To recover this, the first thing I tried was to just mv the /hbase
> > > directory back.   That doesn’t work.
> > >
> > > To get back going had to completely shut down and restart.
> > >
> > > Also, once the original /hbase got mv'd, a few of the region servers
> did
> > > some flush's before they aborted.   Those RS's actually created a new
> > > /hbase, with new table directories, but only containing the data from
> the
> > > flush.
> > >
> > >
> > > -----Original Message-----
> > > From: Buckley,Ron
> > > Sent: Thursday, October 02, 2014 2:09 PM
> > > To: hbase-user
> > > Subject: RE: Recovering hbase after a failure
> > >
> > > Nick,
> > >
> > > Good ideas.    Compared  file and region counts with our DR site.
> >  Things
> > > looks OK.  Going to run some rowcounter's too.
> > >
> > > Feels like we got off easy.
> > >
> > > Ron
> > >
> > > -----Original Message-----
> > > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> > > Sent: Thursday, October 02, 2014 1:27 PM
> > > To: hbase-user
> > > Subject: Re: Recovering hbase after a failure
> > >
> > > Hi Ron,
> > >
> > > Yikes!
> > >
> > > Do you have any basic metrics regarding the amount of data in the
> system
> > > -- size of store files before the incident, number of records, &c?
> > >
> > > You could sift through the HDFS audit log and see if any files that
> were
> > > there previously have not been restored.
> > >
> > > -n
> > >
> > > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <buckleyr@oclc.org>
> wrote:
> > >
> > > > We just had an event where, on our main hbase instance, the /hbase
> > > > directory got moved out from under the running system (Human error).
> > > >
> > > > HBase was really unhappy about that, but we were able to recover it
> > > > fairly easily and get back going.
> > > >
> > > > As far as I can tell, all the data and tables came back correct. But,
> > > > I'm pretty concerned that there may be some hidden corruption or data
> > > loss.
> > > >
> > > > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> > > >
> > > > Can anyone think of anything else we should look at?
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message