Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTi=Ozy+hBshqDGZC1nvL-B7hUAhMGOhWaGytofSt@mail.gmail.com>
References: <AANLkTi=Ozy+hBshqDGZC1nvL-B7hUAhMGOhWaGytofSt@mail.gmail.com>
From: Todd Lipcon <todd@cloudera.com>
Date: Sun, 29 Aug 2010 13:29:19 -0700
Message-ID: <AANLkTi=zeYsC6w5BQHSPpGQyWbX3WOYo_PdhFh3xHBy7@mail.gmail.com>
Subject: Re: HBase Regionserver Behavior on Failing Hardware
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=0016361e8838050900048efc3519

--0016361e8838050900048efc3519
Content-Type: text/plain; charset=ISO-8859-1

Hey Nathan,

I just filed a JIRA to attack this general problem:
https://issues.apache.org/jira/browse/HBASE-2940

I think we'll see issues like this more and more as people start to run
HBase on larger and larger clusters.

Thanks
-Todd

On Sun, Aug 29, 2010 at 12:37 PM, Nathan Harkenrider <
nathan.harkenrider@gmail.com> wrote:

> Hello,
>
> I've run into an interesting HBase failover scenario recently and am
> seeking
> some advice on how to work around the problem.
>
> First of all, I'm running CDH2 (0.20.1+169.89) and HBase 0.20.3 on a 70
> node
> cluster. One of the nodes in the cluster appears to have a bad disk or disk
> controller. Hadoop identified the failing node and marked it as dead in the
> HDFS admin page as well as the jobtracker. The node has not completely
> failed since I can ping it, but ssh connections are failing. The
> regionserver process on this same node has apparently not completely failed
> either. The HBase master still thinks it is alive, and the node is
> registered in Zookeeper. Clients hitting regions hosted on this particular
> region server are hanging/timing out, which is less than ideal. Any
> thoughts
> on thoughts on how to configure HBase to be more sensitive to this type of
> error? Also, is there any way short of restarting HBase that I can force
> these regions to be reassigned to another regionserver if I don't have
> physical access (or remote console) to stop the regionserver process on the
> failing node.
>
> The master did not report any errors in its log related to the failing
> node.
> I'm currently waiting on operations to get me the regionserver logs if they
> can be recovered.
>
> Regards,
>
> Nathan Harkenrider
>


-- 
Todd Lipcon
Software Engineer, Cloudera

--0016361e8838050900048efc3519--