Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com
 designates 209.85.219.41 as permitted sender)
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: 
 <CANZDn9vukr6rRDB8eJFU51SXxRMkU6315yxgzf1o4VJV+Q0XzA@mail.gmail.com>
References: 
 <CANZDn9vukr6rRDB8eJFU51SXxRMkU6315yxgzf1o4VJV+Q0XzA@mail.gmail.com>
Date: Wed, 19 Sep 2012 22:27:04 -0700
Message-ID: 
 <CADcMMgFMCLB3DWrXqeXJYVajXr5OJr9-ckuCspdoMRfx07UpBw@mail.gmail.com>
Subject: Re: IOException: Cannot append; log is closed -- data lost?
From: Stack <stack@duboce.net>
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Tue, Sep 18, 2012 at 11:37 AM, Bryan Beaudreault
<bbeaudreault@hubspot.com> wrote:
> We are running cdh3u2 on a 150 node cluster, where 50 are HBase and 100 are
> map reduce.  The underlying hdfs spans all nodes.
>

This is a 0.90.4 HBase and then some Bryan?

What was the issue serving data that you refer to?  What did it look like?


...
>> 12/09/18 12:34:00 INFO regionserver.HRegionServer: Waiting on 223 regions
>> to close


Looks like we are just hanging on the tail of the HRegionServer exit
until all regions are closed.


>> 12/09/18 12:34:01 ERROR regionserver.HRegionServer:
>> java.io.IOException: Cannot append; log is closed
>>         at


This is odd for sure; as though the WAL is closed but we are still
trying to take on edits.

>
> We saw this for a bunch of regions.  This may or may not be related (thats
> what I'm trying to figure out), but now we are starting to see evidence
> that some data may not have been persisted.


The data that came in and got 'IOException: Cannot append; log is
closed', this data is not persisted.


> For instance we use the result
> of an increment value as the id for another record, and we are starting to
> see the increment return values that we have already used as ids.
>

You know what region has the increment?  Can you trace where this
region in the master log and see where it was deployed across the
restart?  Paste that regionservers log across the restart?


> Does this exception mean that we lost data?  Is there any way to recover it
> if so?  I don't see any hlogs in the .corrupt folder and not sure how to
> proceed.
>

If you got an IOE putting an Increment because you could not write the
WAL, that data didn't make it into HBase for sure.

> One important thing to note is that we were decommissioning 50 of the 150
> datanodes at the time, using the dfs.exclude.hosts setting in the namenode.
>  I thought that was supposed to be a safe operation, so hopefully it didn't
> cause this.
>

Should be fine if done incrementally w/ time in between for
replication to catch up missing replicas.

St.Ack