hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
Date Thu, 16 Jul 2015 07:17:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629348#comment-14629348

Sunil G commented on YARN-3535:

Hi [~rohithsharma] and [~peng.zhang]
After seeing this patch, I feel there may a synchronization problem. Please correct me if
I am wrong.
In ContainerRescheduledTransition code, its been used like
+      container.eventHandler.handle(new ContainerRescheduledEvent(container));
+      new FinishedTransition().transition(container, event);
Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will process the {{recoverResourceRequestForContainer}}
is a separate thread. Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked
and it will be processed for closure for this container. If the Scheduler dispatcher is slower
in processing due to pending event queue length, there are chances that recoverResourceRequest
may not be correct.

I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to WAIT_FOR_REQUEST_RECOVERY
and scheduler can fire back an event to {{RMContainerImpl}} indicate recovery of resource
request is completed. This can move the state forward to KILLED in {{RMContainerImpl}}. 
Please share your thoughts.

>  ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>                 Key: YARN-3535
>                 URL: https://issues.apache.org/jira/browse/YARN-3535
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>            Assignee: Peng Zhang
>            Priority: Critical
>         Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch,
YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.

This message was sent by Atlassian JIRA

View raw message