hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
Date Fri, 22 Mar 2013 15:31:15 GMT

     [ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Nauroth updated YARN-493:

    Attachment: YARN-493.1.patch

This patch addresses the bugs that I found.  I've verified that the tests pass on Mac (does
not have setsid), Ubuntu (does have setsid), and Windows.  Here is an explanation of the changes:

# Discussion on YARN-359 concluded that we should refactor {{getCheckProcessIsAliveCommand}}
and {{getSignalKillCommand}} from {{ContainerExecutor}} back to {{Shell}}.  I'm taking the
opportunity to do it now while we're working on this code.  {{isSetsidSupported}} used to
return true for Windows, with the rationale being that this flag really means "are process
groups supported".  This didn't work out in practice, because there is too much logic that
is very specific to using setsid.  This had been causing the calls to winutils to prepend
a '-' character to the job ID, which is incorrect.
# "winutils task kill" had been terminating the job with exit code 1, but some of the YARN
code depends on seeing a Unix-style exit code from signalled child processes, which is 128
+ signal.  (See {{ContainerLaunch#call}}.)  The Windows {{TerminateJobObject}} API is most
analogous to a kill signal, so I've changed task.c to use 128 + 9 = 137.
# {{TestNodeManagerShutdown}}, {{TestContainerManager}}, and {{TestContainerLaunch}} were
using bash scripts and signals for testing.  I wrote alternatives for Windows that use cmd
and winutils.  Note that there is no equivalent to bash's ability to trap a signal, so on
Windows, the assertions need to check for process existence instead.
# Some test working directories have been shortened by switching from {{Class#getName}} to
{{Class#getSimpleName}}, similar to several prior patches.
# {{TestContainerManager}} had been requesting memory in bytes, but the API actually uses
megabytes.  I'm guessing that the API changed from bytes to MB at some point, but we forgot
to update this test.  This caused a very interesting problem.  {{ContainerImpl#LaunchTransition}}
would apply a conversion from bytes to MB, which would cause an overflow to exactly 0.  Then,
{{ContainersMonitorImpl#isProcessTreeOverLimit}} would see that the container uses > 0
MB and decide to kill it.  This is a race condition that would cause the test to fail unpredictably
on Windows.  I hadn't seen the problem on Mac or Ubuntu, where it seems we were just getting
lucky.  I've changed the test code to use MB.
# {{TestContainerLaunch#setNewEnvironmentHack}} uses reflection to modify the environment
during the test.  I needed to update this code to handle different internal JDK class structure
when running on Windows.

> NodeManager job control logic flaws on Windows
> ----------------------------------------------
>                 Key: YARN-493
>                 URL: https://issues.apache.org/jira/browse/YARN-493
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>             Fix For: 3.0.0
>         Attachments: YARN-493.1.patch
> Both product and test code contain some platform-specific assumptions, such as availability
of bash for executing a command in a container and signals to check existence of a process
and terminate it.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message