hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <andrew.purt...@gmail.com>
Subject Re: Recovering hbase after a failure
Date Thu, 02 Oct 2014 19:11:16 GMT
Is there not the WAL to handle a failed flush?



> On Oct 2, 2014, at 11:39 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> 
> In this case, didn't the RS creating the directories and flushing the files
> prevent data loss? Had the flush aborted due to lack of directories, that
> flush data would have been lost entirely.
> 
>> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <apurtell@apache.org> wrote:
>> 
>> ​On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <buckleyr@oclc.org> wrote:
>> 
>>> Also, once the original /hbase got mv'd, a few of the region servers did
>>> some flush's before they aborted.   Those RS's actually created a new
>>> /hbase, with new table directories, but only containing the data from the
>>> flush.
>> 
>> 
>> Sounds like we should be creating flush files with createNonRecursive (even
>> though it's deprecated)
>> 
>> 
>>> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <buckleyr@oclc.org> wrote:
>>> 
>>> FWIW, in case something like this happens to someone else.
>>> 
>>> To recover this, the first thing I tried was to just mv the /hbase
>>> directory back.   That doesn’t work.
>>> 
>>> To get back going had to completely shut down and restart.
>>> 
>>> Also, once the original /hbase got mv'd, a few of the region servers did
>>> some flush's before they aborted.   Those RS's actually created a new
>>> /hbase, with new table directories, but only containing the data from the
>>> flush.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Buckley,Ron
>>> Sent: Thursday, October 02, 2014 2:09 PM
>>> To: hbase-user
>>> Subject: RE: Recovering hbase after a failure
>>> 
>>> Nick,
>>> 
>>> Good ideas.    Compared  file and region counts with our DR site.
>> Things
>>> looks OK.  Going to run some rowcounter's too.
>>> 
>>> Feels like we got off easy.
>>> 
>>> Ron
>>> 
>>> -----Original Message-----
>>> From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
>>> Sent: Thursday, October 02, 2014 1:27 PM
>>> To: hbase-user
>>> Subject: Re: Recovering hbase after a failure
>>> 
>>> Hi Ron,
>>> 
>>> Yikes!
>>> 
>>> Do you have any basic metrics regarding the amount of data in the system
>>> -- size of store files before the incident, number of records, &c?
>>> 
>>> You could sift through the HDFS audit log and see if any files that were
>>> there previously have not been restored.
>>> 
>>> -n
>>> 
>>>> On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <buckleyr@oclc.org> wrote:
>>>> 
>>>> We just had an event where, on our main hbase instance, the /hbase
>>>> directory got moved out from under the running system (Human error).
>>>> 
>>>> HBase was really unhappy about that, but we were able to recover it
>>>> fairly easily and get back going.
>>>> 
>>>> As far as I can tell, all the data and tables came back correct. But,
>>>> I'm pretty concerned that there may be some hidden corruption or data
>>> loss.
>>>> 
>>>> 'hbase hbck'  runs clean and there are no new complaints in the logs.
>>>> 
>>>> Can anyone think of anything else we should look at?
>> 
>> 
>> 
>> --
>> Best regards,
>> 
>>   - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>> 

Mime
View raw message