hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-867) Isolation of failures in aux services
Date Thu, 12 Sep 2013 20:49:53 GMT

    [ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765878#comment-13765878

Zhijie Shen commented on YARN-867:

Think about the problem again. Essentially, problem is the implementation of AuxiliaryService
may throw RuntimeException (or other Throwable), and fail the thread of NM dispatcher. Wrapping
the calling statements with try/catch can basically prevent NM failure.

The next task is to handle the throwable from AuxiliaryService. In previous thread, what we
plan to do is to fail the container directly, and let the AM know that the container is failed
due to AUXSERVICE_FAILED. For MR, it may be okay, because without ShuffleHandler, MR jobs
cannot run properly. However, should NM always make the decision to fail the container? I'm
concerned that:
1. NM doesn't know what the AuxiliaryService serves the application and how important it is.
2. NM doesn't know how critical the exception is, or whether it is transit or reproducible.
Therefore, if the application can toleran
> Isolation of failures in aux services 
> --------------------------------------
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch,
> Today, a malicious application can bring down the NM by sending bad data to a service.
For example, sending data to the ShuffleService such that it results any non-IOException will
cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message