hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xing Shi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5182) Task processes not exiting due to ackQueue bug in DFSClient
Date Thu, 07 May 2009 07:22:30 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706746#action_12706746
] 

Xing Shi commented on HADOOP-5182:
----------------------------------

>From the Child java process's jstack info, the "SIGTERM handler" thread call the org.apache.hadoop.fs.FileSystem$ClientFinalizer,
but it is blocked for the thread naming "Thread-5".
And the "Thread-5" is blocked at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal
called by the "main" thread.
The "main" thread locked the org.apache.hadoop.hdfs.DFSClient$DFSOutputStream at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(),
and the dataQueue hangs waiting some thread to notified.

In normal case, when close the DFSOutputStream, it calls the flushInternal(), and the dataQueue
will be notified by the DataStreamer when DataStreamer get the first packet from dataQueue.
The DFSOutputStream will be unlocked.

Sometimes when the Child java process get the "SIGTERM", and it will call the ClientFinalizer
to FileSystem.closeAll(); => CACHE.closeAll(); => DistributedFileSystem.close(); =>
clientRunning=false;
The DataStreamer thread will exit, because the clientRunning=false, And the DFSOutputStream
now also goes in the closeInternal()=>flushInternal() and wait for notifying the dataQueue.
But now the DataStreamer has exited and now the closed flag is false(because the closed flag
would be set true after flushInternal() in closeInternal()), so the flushInternal will be
hang.

In deed we shoud judge clientRunning to the flushInternal() function.

> Task processes not exiting due to ackQueue bug in DFSClient
> -----------------------------------------------------------
>
>                 Key: HADOOP-5182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5182
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0, 0.21.0
>         Environment: EC2 with Ubuntu Linux 2.6.16 AMI. Hadoop trunk revision 734915
>            Reporter: Andy Konwinski
>         Attachments: jstack_ackqueue_bug
>
>
> I was running some gridmix tests on a 10 node cluster on EC2 and ran into an issue with
unmodified Hadoop trunk revision (SVN revision#  734915). After running gridmix multiple times,
I noticed several mapreduce jobs stuck in the running state. They remained in that hung state
for several days, while other gridmixes of that size finished in approximately 8 hours, and
are still in that state actually.
> I saw that the slave nodes had a bunch of hung task processes running, for tasks that
the JobTracker log said were completed. These were hanging because the SIGTERM handler was
waiting on DFSClient to close existing streams, but this was never finishing because DFSOutputStream
waits on an ackQueue from the datanodes that was apparently getting no acks. The tasks did
finish their work, but the processes hung around.
> I'll attached a sample jstack trace - note how the SIGTERM handler is blocked on "thread-5",
which is waiting for a monitor on the DFSOutputStream, but this stream's monitor is held by
main, which is trying to flush the stream (last trace in the file).
> Has anyone seen this issue before?
> Another thing I saw was cleanup tasks never running (they are stuck in the initializing
state on the web UI and can't be seen as running processes on the nodes). Not sure if that
is actually related.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message