hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-867) Isolation of failures in aux services
Date Thu, 12 Sep 2013 20:57:54 GMT

    [ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765884#comment-13765884
] 

Zhijie Shen commented on YARN-867:
----------------------------------

Sorry to post the broken comment before.

Think about the problem again. Essentially, problem is the implementation of AuxiliaryService
may throw RuntimeException (or other Throwable), and fail the thread of NM dispatcher. Wrapping
the calling statements with try/catch can basically prevent NM failure.
The next task is to handle the throwable from AuxiliaryService. In previous thread, what we
plan to do is to fail the container directly, and let the AM know that the container is failed
due to AUXSERVICE_FAILED. For MR, it may be okay, because without ShuffleHandler, MR jobs
cannot run properly. However, should NM always make the decision to fail the container? I'm
concerned that:
1. NM doesn't know what the AuxiliaryService serves the application and how important it is.
2. NM doesn't know how critical the exception is, or whether it is transit or reproducible.
Therefore, if the application can tolerant the AuxiliaryService failure? For example, if the
AuxiliaryService just does some node-local monitoring work, the application can complete with
the AuxiliaryService not working. Therefore, I'm wondering whether we should leave the decision
to the AM. The application knows how to handle the exception best. NM just need to exposure
the failure of the AuxiliaryService to the application in some method. Thoughts?
                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch,
YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a service.
For example, sending data to the ShuffleService such that it results any non-IOException will
cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message