reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (REEF-1679) Evaluator shouldn't go to recovery mode if there is no reconnect logic provided
Date Wed, 30 Nov 2016 01:56:58 GMT

     [ https://issues.apache.org/jira/browse/REEF-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julia reassigned REEF-1679:
---------------------------

    Assignee: Julia

> Evaluator shouldn't go to recovery mode if there is no reconnect logic provided
> -------------------------------------------------------------------------------
>
>                 Key: REEF-1679
>                 URL: https://issues.apache.org/jira/browse/REEF-1679
>             Project: REEF
>          Issue Type: Improvement
>          Components: REEF-Common, REEF.NET Evaluator
>            Reporter: Mariia Mykhailova
>            Assignee: Julia
>
> Current behavior of .NET Evaluator is as follows: if evaluator can't send heartbeat to
driver 3 times in row (which takes about 8 seconds), it considers driver dead/unreachable
and enters recovery mode. However, if the code doesn't provide logic for handling reconnects,
{{IDriverConnection}} uses default implementation {{MissingDriverConnection}}, which promptly
throws {{NotImplementedException}}. The evaluator continues to try sending heartbeats which
(in recovery mode already) continue to throw exception, so the evaluator loses any chance
to reconnect to the driver and just hangs there indefinitely.
> We should fix this by checking whether there is a non-default implementation bound for
{{IDriverConnection}}. If there is one, we should enter recovery mode as before. But if there
is none, we know that there's no point going to recovery; instead we should try to talk to
driver some more, and then fail evaluator to avoid wasting resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message