hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-958) NM may miss a heartbeat response from RM resulting into missed finished applications information.
Date Tue, 23 Jul 2013 20:50:47 GMT

     [ https://issues.apache.org/jira/browse/YARN-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Omkar Vinit Joshi updated YARN-958:
-----------------------------------

    Summary: NM may miss a heartbeat response from RM resulting into missed finished applications
information.  (was: NM may miss a heartbeat from RM resulting into missed finished applications
information.)
    
> NM may miss a heartbeat response from RM resulting into missed finished applications
information.
> -------------------------------------------------------------------------------------------------
>
>                 Key: YARN-958
>                 URL: https://issues.apache.org/jira/browse/YARN-958
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Omkar Vinit Joshi
>
> Today whenever RM receives heartbeat from NM it computes new heartbeat response and sends
this response back to NM. Internally this response is sent to RMNodeImpl as an RMNodeEvent
via dispatcher queue. Now if for some reason NM didn't get the older heartbeat then NM will
try to heartbeat again..RM in turn will compute another response (if it has not already handled
the event from queue) and will add this duplicate response on dispatcher queue. Today while
computing response we remove completed applications from RMNodeImpl. Now if NM gets response
without finished applications then it will never realize that those applications finished.
> Solution:-
> * We should synchronously update the newly computed response.
> * lastResponse should be moved out of RMNodeImpl and it should be stored in ResourceTrackerService
itself just like ApplicationMasterService.
> * like YARN-744 we should introduce locking while computing response.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message