reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pei Jiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1949) Closing ThreadPoolStage before tasks are finished
Date Thu, 14 Dec 2017 01:34:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290203#comment-16290203
] 

Pei Jiang commented on REEF-1949:
---------------------------------

To follow up, if the driver spends non-negligible time doing cleanup, 
{code}
        public void OnNext(ICompletedTask completedTask)
        {
            Logger.Log(Level.Info, "Received ICompletedTask {0}.", completedTask.Id);
            DisposeActivieContext();
            Cleanup();
        }

        private void DisposeActivieContext()
        {
            lock (_lock)
            {
                _context.Dispose();
            }
        }

        private void Cleanup()
        {
            Logger.Log(Level.Info, "Shutting down driver...");
            lock (_lock)
            {
                Thread.Sleep(5000);
                if (_disposed)
                {
                    return;
                }
                _disposed = true;
            }
        }
{code}
the EvaluatorCompleted event will not be handled in time as the executor, which is single-threaded,
is still busy processing the cleanup. After the 1-second timeout reached, the executor is
shut down forcibly, so the EvaluatorCompleted event task is dropped. Any idea on how to fix
this?

> Closing ThreadPoolStage before tasks are finished
> -------------------------------------------------
>
>                 Key: REEF-1949
>                 URL: https://issues.apache.org/jira/browse/REEF-1949
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Driver
>    Affects Versions: 0.17
>            Reporter: Pei Jiang
>         Attachments: ReefDriverDebug.zip
>
>
> In EvaluatorManager.onEvaluatorDone(),
> {code}
> // This relies on the dispatcher to call the CompletedEvaluator handler.
> this.messageDispatcher.onEvaluatorCompleted(new CompletedEvaluatorImpl(this.evaluatorId));

> // This will close the dispatcher, which in turns shut down the executor in ThreadPoolStage.
> this.close(); 
> {code}
> Since in onEvaluatorCompleted the message sending task is submitted to an executor, there
is no guarantee that the CompletedEvaluator message will be sent before the termination of
the executor in this.close() call. When this happens, the CompletedEvaluator handler will
not be triggered so the driver will think that some evaluators are alive and hence hang.
> Relevant logs:
> {code}
> Nov 01, 2017 11:05:57 PM org.apache.reef.wake.impl.ThreadPoolStage close
> SEVERE: Closing ThreadPoolStage EvaluatorMessageDispatcher:container_1508975419755_0006_01_000004:
Executor did not terminate in 1,000 ms. Dropping 2 tasks
> Nov 01, 2017 11:05:57 PM org.apache.reef.wake.impl.ThreadPoolStage close
> SEVERE: Closing ThreadPoolStage EvaluatorMessageDispatcher:container_1508975419755_0006_01_000004:
Executor failed to terminate.
> End of LogType:driver.stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message