hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
Date Thu, 26 Jan 2012 22:59:43 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe updated MAPREDUCE-3738:

    Attachment: livehistdump.txt

Attaching hist:live dump from one of the nodemanagers that had hit the OOM error multiple
times in the log aggregation threads before eventually trying to shut down.  Unfortunately
I don't have a full map dump or stack dump from that process.
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: livehistdump.txt
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError
in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls
AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and
that method internally waits for an atomic boolean to be set by the log aggregation thread
to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception,
the boolean will never be set and the NM hangs during shutdown repeating something like this
every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Waiting for aggregation to complete for application_1326848182580_2806

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message