hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4354) Public resource localization fails with NPE
Date Mon, 16 Nov 2015 16:32:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006883#comment-15006883

Junping Du commented on YARN-4354:

bq. I don't think there's anything magical about localization vs. the other things the NM
is doing. The async dispatcher will only exit if an exception leaks up to the top, and when
it does that's a programming error since it doesn't handle an exception properly.
I agree there are no much different in overall. However, back to this case: from a user's
prospective, an occasional NPE localization exception for a resource being cancelled could
be better to be ignored (but get logged) rather than crash the NM. The price of ignoring the
exception here could be potentially leaking file half localized (could be removed later) but
the gain is the NM can be survival and keep working. We should at least provide this trade-off
as a configurable choice to user. Isn't it?

bq.  If we're willing for NPEs in localization to not take down the NM, why are we willing
to do the same if it happens in another NM subsystem that also uses the AsyncDispatcher? IMHO
we should be consistent about the unexpected exception handling.
I am not against to keep consistent for localization event handling with other subsystems,
but not sure if ignoring other exceptional events could potentially cause NM ends up in a
bad state. I think that is motivation we separate SchedulerEventDispatcher from RM dispatcher
for general events with different setting/behavior. No?

> Public resource localization fails with NPE
> -------------------------------------------
>                 Key: YARN-4354
>                 URL: https://issues.apache.org/jira/browse/YARN-4354
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.2
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>             Fix For: 2.7.2
>         Attachments: YARN-4354-branch-2.7.002.patch, YARN-4354-unittest.patch, YARN-4354.001.patch,
> I saw public localization on nodemanagers get stuck because it was constantly rejecting
requests to the thread pool executor.

This message was sent by Atlassian JIRA

View raw message