Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 58384 invoked from network); 29 Aug 2010 20:30:08 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Aug 2010 20:30:08 -0000 Received: (qmail 16870 invoked by uid 500); 29 Aug 2010 20:30:07 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 16759 invoked by uid 500); 29 Aug 2010 20:30:06 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 16751 invoked by uid 99); 29 Aug 2010 20:30:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Aug 2010 20:30:06 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.160.169] (HELO mail-gy0-f169.google.com) (209.85.160.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Aug 2010 20:30:00 +0000 Received: by gyf3 with SMTP id 3so2320438gyf.14 for ; Sun, 29 Aug 2010 13:29:39 -0700 (PDT) Received: by 10.90.92.20 with SMTP id p20mr3229288agb.115.1283113779267; Sun, 29 Aug 2010 13:29:39 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.119.204 with HTTP; Sun, 29 Aug 2010 13:29:19 -0700 (PDT) In-Reply-To: References: From: Todd Lipcon Date: Sun, 29 Aug 2010 13:29:19 -0700 Message-ID: Subject: Re: HBase Regionserver Behavior on Failing Hardware To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=0016361e8838050900048efc3519 --0016361e8838050900048efc3519 Content-Type: text/plain; charset=ISO-8859-1 Hey Nathan, I just filed a JIRA to attack this general problem: https://issues.apache.org/jira/browse/HBASE-2940 I think we'll see issues like this more and more as people start to run HBase on larger and larger clusters. Thanks -Todd On Sun, Aug 29, 2010 at 12:37 PM, Nathan Harkenrider < nathan.harkenrider@gmail.com> wrote: > Hello, > > I've run into an interesting HBase failover scenario recently and am > seeking > some advice on how to work around the problem. > > First of all, I'm running CDH2 (0.20.1+169.89) and HBase 0.20.3 on a 70 > node > cluster. One of the nodes in the cluster appears to have a bad disk or disk > controller. Hadoop identified the failing node and marked it as dead in the > HDFS admin page as well as the jobtracker. The node has not completely > failed since I can ping it, but ssh connections are failing. The > regionserver process on this same node has apparently not completely failed > either. The HBase master still thinks it is alive, and the node is > registered in Zookeeper. Clients hitting regions hosted on this particular > region server are hanging/timing out, which is less than ideal. Any > thoughts > on thoughts on how to configure HBase to be more sensitive to this type of > error? Also, is there any way short of restarting HBase that I can force > these regions to be reassigned to another regionserver if I don't have > physical access (or remote console) to stop the regionserver process on the > failing node. > > The master did not report any errors in its log related to the failing > node. > I'm currently waiting on operations to get me the regionserver logs if they > can be recovered. > > Regards, > > Nathan Harkenrider > -- Todd Lipcon Software Engineer, Cloudera --0016361e8838050900048efc3519--