hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate
Date Fri, 02 Nov 2007 16:15:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539627
] 

Christian Kunz commented on HADOOP-2076:
----------------------------------------

Even with 30 jetty threads (instead of default 40) this still happens frequently, generating
a very long tail for the job. I noticed that during that time the nodes running map tasks
are disk bound.
I will now try with 20 jetty threads, but the protocol between Task and TaskTracker should
be fixed to use retries.
I make this a blocker because it slows us down tremendously.

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive
map tasks failures because of connection timeouts at the end of the task (c++ application
using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of
the tasks fail because of the timeout issue, but they spawn other failures because retries
and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump),
and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop
library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump:
Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing
task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message