Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 82C4C10619 for ; Fri, 2 Jan 2015 23:36:34 +0000 (UTC) Received: (qmail 1911 invoked by uid 500); 2 Jan 2015 23:36:35 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 1849 invoked by uid 500); 2 Jan 2015 23:36:34 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 1836 invoked by uid 99); 2 Jan 2015 23:36:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jan 2015 23:36:34 +0000 Date: Fri, 2 Jan 2015 23:36:34 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263316#comment-14263316 ] Hudson commented on HBASE-12028: -------------------------------- FAILURE: Integrated in HBase-1.1 #45 (See [https://builds.apache.org/job/HBase-1.1/45/]) HBASE-12028 Abort the RegionServer, when it's handler threads die (Alicia Ying Shu) (enis: rev ecbdc45d3d68d83ee001a56b2735b5f5dc63b3e2) * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java * hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestRpcHandlerException.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java * hbase-common/src/main/resources/hbase-default.xml * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java > Abort the RegionServer, when it's handler threads die > ----------------------------------------------------- > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver > Reporter: Sudarshan Kadambi > Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler threads would exit with StackOverflow errors due to an unchecked recursion-terminating condition. Our clusters demonstrated the same trace. While the patch posted for HBASE-11813 got our clusters to be merry again, the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to have regions assigned it. Clearly, it wouldn't be able to serve reads and writes on those regions. A second issue was that when a user tried to disable or drop a table, the master would try to communicate to the regionserver for region unassignment. Since the same handler threads seem to be used for master <-> RS communication as well, the master ended up hanging on the RS indefinitely. Eventually, the master stopped responding to all table meta-operations. > A handler thread should never exit, and if it does, it seems like the more prudent thing to do would be for the RS to abort. This way, at least recovery can be undertaken and the regions could be reassigned elsewhere. I also think that the master<->RS communication should get its own exclusive threadpool, but I'll wait until this issue has been sufficiently discussed before opening an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)