hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-867) Isolation of failures in aux services
Date Thu, 12 Sep 2013 17:53:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765696#comment-13765696

Xuan Gong commented on YARN-867:

NEW patch added more transitions in ContainerSEate.EXITED_WITH_FAILURE and ContainerState.DONE.
This patch still handles the AuxServicesEventType.APPLICATION_INIT and handles exceptions
at the container level. 

I thought about moving AuxServicesEventType.APPLICATION_INIT into application. But I do not
think that we will get any benefits. The reasons are :
1. There are two newly events : AuxServicesEvent.CONTAINER_INIT and AuxServicesEvent.CONTAINER_STOP.
We need to handle them at container level.
2. Even if we move AuxServicesEventType.APPLICATION_INIT into application, we will have two
options :
   a. We will not start any containers until all the AuxServices finish their APPLICATION_INIT.
If we choose this, that definitely simplify the problem. When there is any exceptions from
APPLICATION_INIT on AuxServices, just simply kill the applications. But does it make sense
that we need to block all the containers ?
   b. We can let AuxServices do APPLICATION_INIT and container starts at the same time, if
this is the case, we will go to the same process as now. Because, when the container receives
the CONTAINER_EXITED_WITH_FAILURE event, we can not guarantee which state the container is,
maybe at killing state, LOCALIZED state, etc. Any state is possible.

> Isolation of failures in aux services 
> --------------------------------------
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch,
> Today, a malicious application can bring down the NM by sending bad data to a service.
For example, sending data to the ShuffleService such that it results any non-IOException will
cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message