hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naganarasimha G R (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (YARN-3260) NPE if AM attempts to register before RM processes launch event
Date Thu, 13 Jul 2017 05:18:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Naganarasimha G R updated YARN-3260:
------------------------------------
    Comment: was deleted

(was: Hi [~jlowe],
Had a look at the code and some approaches which i can think of are :
* ApplicationMasterService.registerAppAttempt(ApplicationAttemptId) to be called in RMAppAttemptImpl.AMLaunchedTransition
 instead of RMAppAttemptImpl.AttemptStartedTransition and ensuring that ClientToAMToken and
registerering with ApplicationMasterService in the same block. By doing this we can throw
InvalidApplicationMasterRequestException if AM tries to register to AMS before RMAppAttemptImpl
processes RMAppAttempt LAUNCHED event.
* Was thinking of having MultiThreadedDispatcher for processing APP and AppAttempt events
 similar to the one  in SystemMetricsPublisher.MultiThreadedDispatcher with additional modification
that instead of having {{ "(event.hashCode() & Integer.MAX_VALUE) % dispatchers.size();"}}
we can think of doing it based on applicationId. This can speed up the processing of App events
...

 Was not able to see any other cleaner direct fix for this issue, so was wondering whether
we need to start looking at the reason for "clusters was running behind on processing AsyncDispatcher
events". Were these events were getting delayed to any particular reason? )

> NPE if AM attempts to register before RM processes launch event
> ---------------------------------------------------------------
>
>                 Key: YARN-3260
>                 URL: https://issues.apache.org/jira/browse/YARN-3260
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: YARN-3260.001.patch
>
>
> The RM on one of our clusters was running behind on processing AsyncDispatcher events,
and this caused AMs to fail to register due to an NPE.  The AM was launched and attempting
to register before the RMAppAttemptImpl had processed the LAUNCHED event, and the client to
AM token had not been generated yet.  The NPE occurred because the ApplicationMasterService
tried to encode the missing token.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message