reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1244) Group Communication does not close down properly at the end if reej job
Date Fri, 11 Mar 2016 02:05:40 GMT

    [ https://issues.apache.org/jira/browse/REEF-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190354#comment-15190354
] 

Markus Weimer commented on REEF-1244:
-------------------------------------

Maybe we need to orchestrate a shutdown from the Driver? As in: The Driver decides in which
order to shutdown network connections and we make all of that explicit. The course of events
would be:

  # The Driver sends a signal to the Evaluator it wants to shut down as well as the neighboring
Evaluators in the topology.
  # The Evaluators drain there network connections to one another and send a final {{TERMINATED}}
message or such.
  # Then, they all shut down the network connection, which should yield at most expected exceptions.

We can then identify failure scenarios by not having received the {{TERMINATED}} message.

I'm making this up as I go, maybe [~bgchun] can help, given that he knows infinitely more
about network protocols than me?

> Group Communication does not close down properly at the end if reej job
> -----------------------------------------------------------------------
>
>                 Key: REEF-1244
>                 URL: https://issues.apache.org/jira/browse/REEF-1244
>             Project: REEF
>          Issue Type: Bug
>          Components: GroupCommunications
>    Affects Versions: 0.13
>         Environment: C#
>            Reporter: Dhruv Mahajan
>            Assignee: Dhruv Mahajan
>             Fix For: 0.13
>
>
> Currently, when we want to shut down evaluator, the dispose function of group communications
will be called. However, there is a race condition that occurs. For example, suppose evaluator
e1 calls dispose and closes the stream with evaluator e2. Then if e2 is in ReadAsync() function
of the stream, we will get a failure since Dispose() function in e2 is still not called. Moreover,
the Dispose() function in e2 will try to close the already closed stream again. 
> Some of these scenarios are handled by catching Exceptions and ignoring them but some
are not captured and throw errors which leads to driver and reef job failing.
> The aim of this JIRA is to identify all these closing scenarios and handle them appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message