hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
Date Tue, 16 Oct 2018 22:25:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652547#comment-16652547

Eric Yang commented on YARN-8489:

[~leftnoteasy] Notebook can communicate to ps or workers via grpc the same.  The example was
trying to grpc access to a worker instead of making assumption that notebook is PS.  PS helps
to build the task that workers are going to execute more efficiently.  Data scientist specify
the cluster spec in notebook, parameter server partitions the models and tasks to increase
workers effectiveness.   

We digressed from original goal of this JIRA.  My point is dependency expression and refine
YARN service state machine can achieve what you are proposing with additional switch.  Additional
switch may have unforeseen consequence to existing operations.  For example, what happen if
during upgrade the dominant component is offline.  Should the service terminate and clean
up?  How about flex dominant component to lesser nodes?  What is the order to evaluate dominant
component and component dependencies?  How to handle restart policy in place of dominant component?
 It would be helpful to draw a state diagram to explain the proposal to see if this idea is
worth pursuing. 

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
> Existing YARN service support termination policy for different restart policies. For
example ALWAYS means service will not be terminated. And NEVER means if all component terminated,
service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better names. But
in simple, it means, a dominant component which final state will determine job's final state
regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to final state,
no matter if it is succeeded or failed, we should terminate ps/tensorboard/workers. And the
mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple component, some
component is not restartable. For such services, if a component is failed, we should mark
the whole service to failed. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message