hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-867) Isolation of failures in aux services
Date Wed, 14 Aug 2013 18:27:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740009#comment-13740009
] 

Xuan Gong commented on YARN-867:
--------------------------------

My proposal:
When there is any auxService failure, instead of simply throwing out the exceptions to the
dispatcher, we will catch them and inform the AM. 

Here is how it works:

We will use containerManagementProtocol. Basically, AM will need to send the AuxiliaryServiceCheckRequest
with ApplicationId as parameter frequently (We can set the period as 3s or 5s), and we use
ContainerManagementProtocol to send this request to all the ContainerManager that this AM
knows. Then those ContainerManagers will send the response back with the information whether
there is any AuxiliaryService with this appId is failed, and related diagnositics. 

At ContainerManagerImpl side, for all the registered  AuxServices, if any of them fails, instead
of simp lying throwing out of the exceptions to the dispatcher, we will catch the exceptions,
and save them with appId and exception message into a AuxServiceFailureMap. In that case,
when one containerManager receives  AuxiliaryServiceCheckRequest, it can check AuxServiceFailureMap
with the appId, and send back the response with whether this is any  AuxServices with this
appid fails.

Attached a sample code for this proposal.
                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>
> Today, a malicious application can bring down the NM by sending bad data to a service.
For example, sending data to the ShuffleService such that it results any non-IOException will
cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message