hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Mitic (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8732) Address intermittent test failures on Windows
Date Mon, 27 Aug 2012 17:42:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442548#comment-13442548

Ivan Mitic commented on HADOOP-8732:

Root cause:
When you create a child process by using the CreateProcess function call in a multithreaded
environment, the child may inherit handles that were not intended to be inherited (there is
a race condition here). In our case, Hadoop is consistently calling CreateProcess on winutils.exe
and as part of preparations for CreateProcess read/write handles are created on pipes used
to redirect stdout/stderr. In scenario where we create for example one ShortLived and one
LongLived child process, the LongLived process can end up inheriting handles of the ShortLived
process. This will further cause ReadFile on the ShortLived process’ stdout/stderr not to
return until the LongLived process terminates, what is the behavior we observed. 

Pre Windows-Vista, the only way to mitigate the problem was to serialize all calls to CreateProcess.
On Vista and later, there is a way to specify the list of handles that should be inherited

This KB article nicely explains the issue:
http://support.microsoft.com/kb/315939 - PRB: Child Inherits Unintended Handles During CreateProcess

I looked over the OpenJDK implementation for Process#start(), and this is exactly what is
going on. Since we can repro the problem in Oracle JDK, it should be safe to assume that they
have the same issue. 

The suggested workaround is to serialize all calls to CreateProcess. In Java world, this boils
down to synchronizing on Process#start() as this call just delegates to CreateProcess. I tested
this out and it worked out fine.

> Address intermittent test failures on Windows
> ---------------------------------------------
>                 Key: HADOOP-8732
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8732
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
> There are a few tests that fail intermittently on Windows with a timeout error. This
means that the test was actually killed from the outside, and it would continue to run otherwise.

> The following are examples of such tests (there might be others):
>  - TestJobInProgress (this issue reproes pretty consistently in Eclipse on this one)
>  - TestControlledMapReduceJob
>  - TestServiceLevelAuthorization

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message