hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gour Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8080) YARN native service should support component restart policy
Date Thu, 29 Mar 2018 23:59:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419984#comment-16419984
] 

Gour Saha commented on YARN-8080:
---------------------------------

[~leftnoteasy], few comments:
h5. ServiceMaster.java

remove unnecessary imports
h5. api.records.Component.java
{code:java}
  @SerializedName("restartPolicy")
  private RestartPolicyEnum restartPolicy = RestartPolicyEnum.ALWAYS;
{code}
Change the above to -
{code:java}
  @JsonProperty(“restart_policy")
  @XmlElement(name = “restart_policy")
  private RestartPolicyEnum restartPolicy = RestartPolicyEnum.ALWAYS;
{code}
This is probably why [~eyang] is not able to test your patch.
h5. ComponentInstance.java

1.
Do you want to move ProcessTerminationHandler to ServiceUtils? It is a helpful handler which
other classes might be able to use.

2.
Instead of using the keyword "finalized" we should use "finished". Finalized is being used
in the context of upgrade. So, instead of terminateServiceIfAllComponentsFinalized we can
say terminateServiceIfAllComponentsFinished.

3.
{code:java}
    if (event.getStatus() != null
        && event.getStatus().getExitStatus() == ContainerExitStatus.SUCCESS) {
      succeeded = true;
    }
{code}
I am not sure if this will cover docker containers since the exit status of a docker container
might not be the same as the exit status of the actual application running inside the container.
I haven’t tested this though. Did you get a chance to test this patch with DOCKER artifacts?

4.
{code:java}
      if (nSucceeded + nFailed < comp.getComponentSpec()
          .getNumberOfContainers()) {
        shouldTerminate = false;
        break;
      }
{code}
Will this cover the scenario of flex? A component's no of containers might be flexed down
after the service has been running for some time. If the container count is flexed down then
there is a possibility of succeeded+failed be higher than comp.getComponentSpec().getNumberOfContainers()
after the flex call and the service will get terminated although some containers are still
healthy and running? So, will service.component.Component suceededInstances and failedInstances fields
need to be updated on a flex event?

5.
{code:java}
      if (shouldFailService) {
        terminationHandler.terminate(-1);
      }
      // According to component restart policy, handle container restart
      // or finish the service (if all components finalized)
      handleComponentInstanceRelaunch(compInstance, event);
{code}
Shouldn’t handleComponentInstanceRelaunch get called before the “if (shouldFailService)”
block?

> YARN native service should support component restart policy
> -----------------------------------------------------------
>
>                 Key: YARN-8080
>                 URL: https://issues.apache.org/jira/browse/YARN-8080
>             Project: Hadoop YARN
>          Issue Type: Task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Critical
>         Attachments: YARN-8080.001.patch, YARN-8080.002.patch, YARN-8080.003.patch, YARN-8080.005.patch
>
>
> Existing native service assumes the service is long running and never finishes. Containers
will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component specified
by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container exit status.
This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To support
job-like workload (for example Tensorflow training job). If a task exit with code == 0, we
should not restart the task. This can be used by services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when restart_policy
!= ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the same component
won't be affected by the finalized component instance. Service will be terminated once all
instances finalized. 
> 3) For multiple components: Service will be terminated once all components finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message