hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate
Date Thu, 18 Oct 2007 18:33:51 GMT
extensive map tasks failures because of SocketTimeoutException during statusUpdate
----------------------------------------------------------------------------------

                 Key: HADOOP-2076
                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
             Project: Hadoop
          Issue Type: Bug
    Affects Versions: 0.16.0
         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
            Reporter: Christian Kunz


A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map
tasks failures because of connection timeouts at the end of the task (c++ application using
pipes interface completed successfully)
More than 600 tasks failed, slowing down the job because of retries. Only a portion of the
tasks fail because of the timeout issue, but they spawn other failures because retries and
speculatively executed tasks cannot even get  a connection and fail just after a few seconds.

JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.

I attach the log of a task failing because of timeout (which includes a thread dump), and
the log of one task which could not start.

2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics
with processName=MAP, sessionId=
2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop
library
2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:484)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
	at java.lang.Thread.run(Thread.java:619)

2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:484)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
	at java.lang.Thread.run(Thread.java:619)

2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:484)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
	at java.lang.Thread.run(Thread.java:619)

2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication
exception
8 active threads
Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 4128
  Stack:
    sun.management.ThreadImpl.getThreadInfo0(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
    org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
    org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
    org.apache.hadoop.mapred.Task$1.run(Task.java:315)
    java.lang.Thread.run(Thread.java:619)
Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 6403
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
    java.lang.Thread.run(Thread.java:619)
Thread 9 (IPC Client connection to /127.0.0.1:49458):
  State: RUNNABLE
  Blocked count: 21
  Waited count: 2063
  Stack:
    java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.read(SocketInputStream.java:129)
    java.io.FilterInputStream.read(FilterInputStream.java:116)
    org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
    java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    java.io.DataInputStream.readInt(DataInputStream.java:370)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 6402
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 398
  Waited count: 2270
  Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 257
  Waited count: 2269
  Waiting on java.lang.ref.Reference$Lock@1c66ec7
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 1
  Waited count: 10
  Stack:
    java.io.FileInputStream.readBytes(Native Method)
    java.io.FileInputStream.read(FileInputStream.java:199)
    org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
    java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
    java.io.BufferedInputStream.read(BufferedInputStream.java:317)
    java.io.DataInputStream.read(DataInputStream.java:132)
    org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
    org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
    org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
    org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
    org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
    org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
    org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
    org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
    org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
    org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
    org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
    org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
    org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)

2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0


Log of task that could not start:
2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 1 time(s).
2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 2 time(s).
2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 3 time(s).
2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 4 time(s).
2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 5 time(s).
2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 6 time(s).
2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 7 time(s).
2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 8 time(s).
2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 9 time(s).
2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972.
Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message