hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11488) cancelTasks in SubprocedurePool can hang during task error
Date Sat, 12 Jul 2014 01:47:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059586#comment-14059586
] 

Hudson commented on HBASE-11488:
--------------------------------

FAILURE: Integrated in HBase-TRUNK #5289 (See [https://builds.apache.org/job/HBase-TRUNK/5289/])
HBASE-11488 cancelTasks in SubprocedurePool can hang during task error (Jerry He) (enis: rev
c6ddc0336e1fe8c2ebe89db81fe7c8549de7d597)
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/snapshot/RegionServerSnapshotManager.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/procedure/flush/RegionServerFlushTableProcedureManager.java


> cancelTasks in SubprocedurePool can hang during task error
> ----------------------------------------------------------
>
>                 Key: HBASE-11488
>                 URL: https://issues.apache.org/jira/browse/HBASE-11488
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 0.96.1, 0.99.0, 0.98.3
>            Reporter: Jerry He
>            Assignee: Jerry He
>            Priority: Minor
>             Fix For: 0.99.0, 0.98.4, 2.0.0
>
>         Attachments: HBASE-11488-0.98.patch, HBASE-11488-master.patch
>
>
> During snapshot on the region server side, if one RegionSnapshotTask throws exception,
we will cancel other tasks.
> In RegionServerSnapshotManager.SnapshotSubprocedurePool.waitForOutstandingTasks():
> {code}
>       LOG.debug("Waiting for local region snapshots to finish.");
>       int sz = futures.size();
>       try {
>         // Using the completion service to process the futures that finish first first.
>         for (int i = 0; i < sz; i++) {
>           Future<Void> f = taskPool.take();
>           f.get();
>           if (!futures.remove(f)) {
>             LOG.warn("unexpected future" + f);
>           }
>           LOG.debug("Completed " + (i+1) + "/" + sz +  " local region snapshots.");
>         }
>         LOG.debug("Completed " + sz +  " local region snapshots.");
>         return true;
>       } catch (InterruptedException e) {
>         LOG.warn("Got InterruptedException in SnapshotSubprocedurePool", e);
>         if (!stopped) {
>           Thread.currentThread().interrupt();
>           throw new ForeignException("SnapshotSubprocedurePool", e);
>         }
>         // we are stopped so we can just exit.
>       } catch (ExecutionException e) {
>         if (e.getCause() instanceof ForeignException) {
>           LOG.warn("Rethrowing ForeignException from SnapshotSubprocedurePool", e);
>           throw (ForeignException)e.getCause();
>         }
>         LOG.warn("Got Exception in SnapshotSubprocedurePool", e);
>         throw new ForeignException(name, e.getCause());
>       } finally {
>         cancelTasks();
>       }
> {code}
> If  f.get() throws ExecutionException (for example, caused by NotServingRegionException),
we will call cancelTasks().
> In cancelTasks():
> {code}
>      ...
>      // evict remaining tasks and futures from taskPool.
>      while (!futures.isEmpty()) {
>         // block to remove cancelled futures;
>         LOG.warn("Removing cancelled elements from taskPool");
>         futures.remove(taskPool.take());
>       }
> {code}
> For example, suppose we have 3 tasks, the first one fails and we get an exception when
we do:
> {code}
>           Future<Void> f = taskPool.take();
>           f.get();
> {code}
> We didn't remove the 'f' from the 'futures' list yet, but we already take one from taskPool.
> As a result, there are 3 in 'futures' list, but only 2 remain in taskPool.
> We'll block on taskPool.take() in the above cancelTasks() code.
> The end result is that the procedure will always fail timeout exception in the end. 
> We could have bailed out earlier with the real cause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message