hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-867) Isolation of failures in aux services
Date Thu, 12 Sep 2013 17:53:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765696#comment-13765696
] 

Xuan Gong commented on YARN-867:
--------------------------------

NEW patch added more transitions in ContainerSEate.EXITED_WITH_FAILURE and ContainerState.DONE.
This patch still handles the AuxServicesEventType.APPLICATION_INIT and handles exceptions
at the container level. 

I thought about moving AuxServicesEventType.APPLICATION_INIT into application. But I do not
think that we will get any benefits. The reasons are :
1. There are two newly events : AuxServicesEvent.CONTAINER_INIT and AuxServicesEvent.CONTAINER_STOP.
We need to handle them at container level.
2. Even if we move AuxServicesEventType.APPLICATION_INIT into application, we will have two
options :
   a. We will not start any containers until all the AuxServices finish their APPLICATION_INIT.
If we choose this, that definitely simplify the problem. When there is any exceptions from
APPLICATION_INIT on AuxServices, just simply kill the applications. But does it make sense
that we need to block all the containers ?
   b. We can let AuxServices do APPLICATION_INIT and container starts at the same time, if
this is the case, we will go to the same process as now. Because, when the container receives
the CONTAINER_EXITED_WITH_FAILURE event, we can not guarantee which state the container is,
maybe at killing state, LOCALIZED state, etc. Any state is possible.

                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch,
YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a service.
For example, sending data to the ShuffleService such that it results any non-IOException will
cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message