hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase
Date Thu, 12 Jan 2012 22:49:41 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185313#comment-13185313
] 

Siddharth Seth commented on MAPREDUCE-3596:
-------------------------------------------

The NPE could still happen if the startContainer request to the NM is delayed. The patch fixes
the case where a newly launched container is reported in a NM heartbeat and the RM is still
aware that this container needs to be cleaned up.
If the newly launched container (because of a delayed startContainer) is sent to the RM in
a subsequent heartbeat (after the RM has told it to clean up the container and cleaned up
it's own list of containersToCleanup) - will end up in the same NPE, and the container running
to completion.

One possible option would be to have the NM keep track of containers it needs to clean up
- if it isn't aware of the container yet.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt,
logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99%
map phase. There are some map tasks failed. Also it's not scheduled some of the pending map
tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common

> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec
9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message