hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-867) Isolation of failures in aux services
Date Wed, 02 Oct 2013 21:57:43 GMT

    [ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784503#comment-13784503
] 

Bikas Saha commented on YARN-867:
---------------------------------

Probably we can ignore the error here since the container has already failed.
{code}
     // From LOCALIZATION_FAILED State
     .addTransition(ContainerState.LOCALIZATION_FAILED,
@@ -180,6 +184,9 @@ public ContainerImpl(Configuration conf, Dispatcher dispatcher,
     .addTransition(ContainerState.LOCALIZATION_FAILED,
         ContainerState.LOCALIZATION_FAILED,
         ContainerEventType.RESOURCE_FAILED)
+    .addTransition(ContainerState.LOCALIZATION_FAILED, ContainerState.EXITED_WITH_FAILURE,
+        ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+        new ExitedWithFailureTransition(false))
{code}

Probably have 1 try catch instead of multiple.

Can we rename AUXSERVICE_FAIL to AUXSERVICE_ERROR since the service probably hasnt failed.

TestAuxService needs an addition for the new code

TestContainer - new test can be made simpler by not mocking AuxServiceHandler and instead
sending the failed event directly like its done for other tests there.

In AuxService.handle(APPLICATION_INIT) and other places like that, where the service does
not exist then we should fail too.

Zhijie, we should err on the side of caution here and fail the container. If we see real use
cases where failure can be ignored then we can make that improvement.

> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch,
YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a service.
For example, sending data to the ShuffleService such that it results any non-IOException will
cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly.




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message