Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C1C2111E0A for ; Sat, 12 Jul 2014 03:05:05 +0000 (UTC) Received: (qmail 84940 invoked by uid 500); 12 Jul 2014 03:05:05 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 84885 invoked by uid 500); 12 Jul 2014 03:05:05 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 84873 invoked by uid 99); 12 Jul 2014 03:05:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Jul 2014 03:05:05 +0000 Date: Sat, 12 Jul 2014 03:05:05 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-11488) cancelTasks in SubprocedurePool can hang during task error MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-11488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059611#comment-14059611 ] Hudson commented on HBASE-11488: -------------------------------- SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #368 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/368/]) HBASE-11488 cancelTasks in SubprocedurePool can hang during task error (Jerry He) (apurtell: rev 35745021924b7c9c050a57f8b6723759c4aedd79) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/snapshot/RegionServerSnapshotManager.java > cancelTasks in SubprocedurePool can hang during task error > ---------------------------------------------------------- > > Key: HBASE-11488 > URL: https://issues.apache.org/jira/browse/HBASE-11488 > Project: HBase > Issue Type: Bug > Components: snapshots > Affects Versions: 0.96.1, 0.99.0, 0.98.3 > Reporter: Jerry He > Assignee: Jerry He > Priority: Minor > Fix For: 0.99.0, 0.98.4, 2.0.0 > > Attachments: HBASE-11488-0.98.patch, HBASE-11488-master.patch > > > During snapshot on the region server side, if one RegionSnapshotTask throws exception, we will cancel other tasks. > In RegionServerSnapshotManager.SnapshotSubprocedurePool.waitForOutstandingTasks(): > {code} > LOG.debug("Waiting for local region snapshots to finish."); > int sz = futures.size(); > try { > // Using the completion service to process the futures that finish first first. > for (int i = 0; i < sz; i++) { > Future f = taskPool.take(); > f.get(); > if (!futures.remove(f)) { > LOG.warn("unexpected future" + f); > } > LOG.debug("Completed " + (i+1) + "/" + sz + " local region snapshots."); > } > LOG.debug("Completed " + sz + " local region snapshots."); > return true; > } catch (InterruptedException e) { > LOG.warn("Got InterruptedException in SnapshotSubprocedurePool", e); > if (!stopped) { > Thread.currentThread().interrupt(); > throw new ForeignException("SnapshotSubprocedurePool", e); > } > // we are stopped so we can just exit. > } catch (ExecutionException e) { > if (e.getCause() instanceof ForeignException) { > LOG.warn("Rethrowing ForeignException from SnapshotSubprocedurePool", e); > throw (ForeignException)e.getCause(); > } > LOG.warn("Got Exception in SnapshotSubprocedurePool", e); > throw new ForeignException(name, e.getCause()); > } finally { > cancelTasks(); > } > {code} > If f.get() throws ExecutionException (for example, caused by NotServingRegionException), we will call cancelTasks(). > In cancelTasks(): > {code} > ... > // evict remaining tasks and futures from taskPool. > while (!futures.isEmpty()) { > // block to remove cancelled futures; > LOG.warn("Removing cancelled elements from taskPool"); > futures.remove(taskPool.take()); > } > {code} > For example, suppose we have 3 tasks, the first one fails and we get an exception when we do: > {code} > Future f = taskPool.take(); > f.get(); > {code} > We didn't remove the 'f' from the 'futures' list yet, but we already take one from taskPool. > As a result, there are 3 in 'futures' list, but only 2 remain in taskPool. > We'll block on taskPool.take() in the above cancelTasks() code. > The end result is that the procedure will always fail timeout exception in the end. > We could have bailed out earlier with the real cause. -- This message was sent by Atlassian JIRA (v6.2#6252)