hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Konwinski (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-5182) Task processes not exiting due to ackQueue bug in DFSClient
Date Fri, 06 Feb 2009 02:27:59 GMT
Task processes not exiting due to ackQueue bug in DFSClient
-----------------------------------------------------------

                 Key: HADOOP-5182
                 URL: https://issues.apache.org/jira/browse/HADOOP-5182
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.20.0, 0.21.0
         Environment: EC2 with Ubuntu Linux 2.6.16 AMI. Hadoop trunk revision 734915
            Reporter: Andy Konwinski


I was running some gridmix tests on a 10 node cluster on EC2 and ran into an issue with unmodified
Hadoop trunk revision (SVN revision#  734915). After running gridmix multiple times, I noticed
several mapreduce jobs stuck in the running state. They remained in that hung state for several
days, while other gridmixes of that size finished in approximately 8 hours, and are still
in that state actually.

I saw that the slave nodes had a bunch of hung task processes running, for tasks that the
JobTracker log said were completed. These were hanging because the SIGTERM handler was waiting
on DFSClient to close existing streams, but this was never finishing because DFSOutputStream
waits on an ackQueue from the datanodes that was apparently getting no acks. The tasks did
finish their work, but the processes hung around.

I'll attached a sample jstack trace - note how the SIGTERM handler is blocked on "thread-5",
which is waiting for a monitor on the DFSOutputStream, but this stream's monitor is held by
main, which is trying to flush the stream (last trace in the file).

Has anyone seen this issue before?

Another thing I saw was cleanup tasks never running (they are stuck in the initializing state
on the web UI and can't be seen as running processes on the nodes). Not sure if that is actually
related.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message