hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
Date Tue, 16 Oct 2018 20:50:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652420#comment-16652420
] 

Eric Yang commented on YARN-8489:
---------------------------------

[~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service.
I don't hear open source community like jupyter has support of this (connecting to a running
distributed TF job and use it as executor). And I didn't see TF claims to support this or
plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and this is [explained|https://www.tensorflow.org/extend/architecture]
in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document.


Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
    "principal_name" : "hbase/_HOST@EXAMPLE.COM",
    "keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
    {
      "name": "jupyter",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
        }
      },
      "restart_policy": "NEVER"
    },
    {
      "name": "ps",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "launch_command": "python ps.py",
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
        }
      },
      "restart_policy": "NEVER"
    },
    {
      "name": "worker",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "launch_command": "python worker.py",
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
        }
      },
      "restart_policy": "NEVER"
    }
  ]
}
{code}

ps.py
{code}
server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)
server.join()
{code}

In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker7.example.com:2222") as sess:
  for _ in range(10000):
    sess.run(train_op)
{code}

Isn't this the easiest way to iterate in notebook without going through ps/worker setup per
iteration?  The only thing that user needs to write is worker.py which is use case driven.
 Am I missing something?

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
>
> Existing YARN service support termination policy for different restart policies. For
example ALWAYS means service will not be terminated. And NEVER means if all component terminated,
service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better names. But
in simple, it means, a dominant component which final state will determine job's final state
regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to final state,
no matter if it is succeeded or failed, we should terminate ps/tensorboard/workers. And the
mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple component, some
component is not restartable. For such services, if a component is failed, we should mark
the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message