hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Badger (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4756) Unnecessary wait in Node Status Updater during reboot
Date Mon, 21 Mar 2016 22:52:25 GMT

     [ https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Badger updated YARN-4756:
------------------------------
    Attachment: YARN-4756.003.patch

[~kasha], I wasn't clear in my original text. The patches in [YARN-4686] do not break any
extra tests. However, while exploring the fixes for those failures, I came across an unnecessary
wait in the NodeStatusUpdater thread, NodeStatusUpdaterImpl:850. When a reboot happens, the
isStopped variable is set to true, but the thread waits until the next heartbeat. The next
heartbeat won't come and so it will wait for a heartbeat timeout. So instead of wasting this
time unnecessarily, I added a notify to wake the thread up and let it know to continue in
the loop, where it would find that isStopped is set to true. 

Adding in this optimization uncovered a race condition in the TestNodeManagerResync test.
The test doesn't wait for the NM to completely reboot before it checks for its updated capabilities.
The only reason that it worked before is because the unnecessary wait in the NodeStatusUpdater
acted as a sleep that masked the race condition. 

I'm uploading a patch that removes the unnecessary wait in the NodeStatusUpdater thread and
also fixes the race condition in TestNodeManagerResync that it uncovers. 

> Unnecessary wait in Node Status Updater during reboot
> -----------------------------------------------------
>
>                 Key: YARN-4756
>                 URL: https://issues.apache.org/jira/browse/YARN-4756
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>         Attachments: YARN-4756.001.patch, YARN-4756.002.patch, YARN-4756.003.patch
>
>
> The startStatusUpdater thread waits for the isStopped variable to be set to true, but
it is waiting for the next heartbeat. During a reboot, the next heartbeat will not come and
so the thread waits for a timeout. Instead, we should notify the thread to continue so that
it can check the isStopped variable and exit without having to wait for a timeout. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message