Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AC99B113BE for ; Thu, 10 Jul 2014 01:55:05 +0000 (UTC) Received: (qmail 82571 invoked by uid 500); 10 Jul 2014 01:55:04 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 82432 invoked by uid 500); 10 Jul 2014 01:55:04 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 82133 invoked by uid 99); 10 Jul 2014 01:55:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Jul 2014 01:55:04 +0000 Date: Thu, 10 Jul 2014 01:55:04 +0000 (UTC) From: "Jerry He (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-11488) cancelTasks in SubprocedurePool can hang during task error MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Jerry He created HBASE-11488: -------------------------------- Summary: cancelTasks in SubprocedurePool can hang during task error Key: HBASE-11488 URL: https://issues.apache.org/jira/browse/HBASE-11488 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 0.98.3, 0.96.1, 0.99.0 Reporter: Jerry He Assignee: Jerry He Priority: Minor During snapshot on the region server side, if one RegionSnapshotTask throws exception, we will cancel other tasks. In RegionServerSnapshotManager.SnapshotSubprocedurePool.waitForOutstandingTasks(): {code} LOG.debug("Waiting for local region snapshots to finish."); int sz = futures.size(); try { // Using the completion service to process the futures that finish first first. for (int i = 0; i < sz; i++) { Future f = taskPool.take(); f.get(); if (!futures.remove(f)) { LOG.warn("unexpected future" + f); } LOG.debug("Completed " + (i+1) + "/" + sz + " local region snapshots."); } LOG.debug("Completed " + sz + " local region snapshots."); return true; } catch (InterruptedException e) { LOG.warn("Got InterruptedException in SnapshotSubprocedurePool", e); if (!stopped) { Thread.currentThread().interrupt(); throw new ForeignException("SnapshotSubprocedurePool", e); } // we are stopped so we can just exit. } catch (ExecutionException e) { if (e.getCause() instanceof ForeignException) { LOG.warn("Rethrowing ForeignException from SnapshotSubprocedurePool", e); throw (ForeignException)e.getCause(); } LOG.warn("Got Exception in SnapshotSubprocedurePool", e); throw new ForeignException(name, e.getCause()); } finally { cancelTasks(); } {code} If f.get() throws ExecutionException (for example, caused by NotServingRegionException), we will call cancelTasks(). In cancelTasks(): {code} ... // evict remaining tasks and futures from taskPool. while (!futures.isEmpty()) { // block to remove cancelled futures; LOG.warn("Removing cancelled elements from taskPool"); futures.remove(taskPool.take()); } {code} For example, suppose we have 3 tasks, the first one fails and we get an exception when we do: {code} Future f = taskPool.take(); f.get(); {code} We didn't remove the 'f' from the 'futures' list yet, but we already take one from taskPool. As a result, there are 3 in 'futures' list, but only 2 remain in taskPool. We'll block on taskPool.take() in the above cancelTasks() code. The end result is that the procedure will always fail timeout exception in the end. We could have bailed out earlier with the real cause. -- This message was sent by Atlassian JIRA (v6.2#6252)