Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 45267 invoked from network); 30 Aug 2010 00:56:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Aug 2010 00:56:33 -0000 Received: (qmail 45356 invoked by uid 500); 30 Aug 2010 00:56:33 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 45306 invoked by uid 500); 30 Aug 2010 00:56:32 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 45298 invoked by uid 99); 30 Aug 2010 00:56:32 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Aug 2010 00:56:32 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Aug 2010 00:56:15 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o7U0tr5q001823 for ; Mon, 30 Aug 2010 00:55:53 GMT Message-ID: <28623732.62141283129753530.JavaMail.jira@thor> Date: Sun, 29 Aug 2010 20:55:53 -0400 (EDT) From: "Ted Yu (JIRA)" To: issues@hbase.apache.org Subject: [jira] Commented: (HBASE-2940) Improve behavior under partial failure of region servers In-Reply-To: <4242416.60941283113794553.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904058#action_12904058 ] Ted Yu commented on HBASE-2940: ------------------------------- Since hbase.rootdir points to hadoop namenode, HBase Master can poll hadoop for the live data nodes. If a data node comes down for longer than specified duration and a RS happens to be on the same server, Master can blacklist that RS (assuming there is problem with heartbeat from that RS in the same time period). > Improve behavior under partial failure of region servers > -------------------------------------------------------- > > Key: HBASE-2940 > URL: https://issues.apache.org/jira/browse/HBASE-2940 > Project: HBase > Issue Type: New Feature > Components: master, regionserver > Reporter: Todd Lipcon > > On larger clusters, we often see failure cases where a server is "up" (ie heartbeating) but unable to actually service requests properly (or at a reasonable speed). This can happen for any number of reasons including: > - failing disks or disk controllers respond, but do so very slowly > - the machine is swapping, so everything is still running but much more slowly than expected > - HBase or the DN on the machine has been misconfigured (eg missing lzo libs) so it fails to correctly open regions, perform flushes, etc. > Here are a few proposed features that are worth considering: > 1) Add a "blacklist" or "remote shutdown" functionality to the master. This is useful if the region server is up but for some reason the admin can't ssh in to shut it down (eg the root disk has failed). This feature would allow the admin to issue a command that will shut down any given RS. > 2) Periodically run a "health check" script on the region server node. If the script returns an error code, the RS could shut itself down gracefully and report an error message on the master console. > 3) Allow clients to report back RS-specific errors to the master. This would be useful for monitoring, and we could add heuristics to automatically shut down region servers if they have an elevated error count over some period of time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.