hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
Date Tue, 16 Oct 2018 18:59:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652248#comment-16652248

Eric Yang commented on YARN-8489:

[~leftnoteasy] {quote}master and ps are not depends on each other for launch time{quote}

While the launch statement is correct, but it is not true for Tensorflow run time.  For master
(jupyter notebook) to send any workload to parameter server, parameter server must be running.
 There is an implicit dependency that can be defined for master depends on ps to improve usability.

{quote}And once ps failed, we should mark job is failed as well.{quote}

Parameter server is on the critical path, but it is not completely true that one ps fail,
we may want to abort the service.  The running job needs to be terminated, but mapping Tensorflow
task to YARN container is a problematic design.  I am most concerned about this in submarine
implementation of Tensorflow.  Especially, the people sit in front of jupyter notebook can
observe that parameter server has failed, and use other parameter servers and continue to
work.  It would be bad user experience, if jupyter notebook and all work suddenly disappear
when one ps server failed.  It may be nice to have a method to clean up the service, when
the single critical component has failed.  By using yarn app -destroy, this can happen at
the time that user is ready to make a change, instead of losing all state right away to keep
system clean.  Dominant component logic nor the plugin approach are not the right methods
to address the design problem in submarine working model because AM state machine is currently
incomplete, any plugin to override AM state machine seems like pouring gas on flames.

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
> Existing YARN service support termination policy for different restart policies. For
example ALWAYS means service will not be terminated. And NEVER means if all component terminated,
service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better names. But
in simple, it means, a dominant component which final state will determine job's final state
regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to final state,
no matter if it is succeeded or failed, we should terminate ps/tensorboard/workers. And the
mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple component, some
component is not restartable. For such services, if a component is failed, we should mark
the whole service to failed. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message