hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhou, Yunqing" <azure...@gmail.com>
Subject Task Random Fail
Date Wed, 22 Oct 2008 14:05:40 GMT
Recently the tasks on our cluster random failed (both map tasks and reduce
tasks) . When rerun them, they are all ok.
The whole job is a IO-bound job. (250G input and 500G output(map) and
10G(final))
from the jobtracker, I can see the failed job says:
   task_200810220830_0004_m_000653_0
 tip_200810220830_0004_m_000653<http://hadoop5:50030/taskdetails.jsp?jobid=job_200810220830_0004&tipid=tip_200810220830_0004_m_000653>
 vidi-005 <http://vidi-005:50060/>
 FAILED
 java.io.IOException: Task process exit with nonzero status of 65. at
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479) at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
 Last 4KB<http://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0&start=-4097>
Last 8KB<http://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0&start=-8193>
All <http://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0>
and the log says (follow the link in the right-most column):

 Task Logs: 'task_200810220830_0004_m_000653_0'

*stdout logs*

------------------------------


*stderr logs*

------------------------------


*syslog logs*

2008-10-22 19:59:51,640 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2008-10-22 19:59:59,507 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 26
2008-10-22 20:12:25,968 INFO org.apache.hadoop.mapred.TaskRunner:
Communication exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:559)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
	at java.lang.Thread.run(Thread.java:619)

2008-10-22 20:13:29,015 INFO org.apache.hadoop.mapred.TaskRunner:
Communication exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:559)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
	at java.lang.Thread.run(Thread.java:619)

2008-10-22 20:14:32,030 INFO org.apache.hadoop.mapred.TaskRunner:
Communication exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:559)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
	at java.lang.Thread.run(Thread.java:619)

2008-10-22 20:14:32,781 INFO org.apache.hadoop.mapred.TaskRunner:
Process Thread Dump: Communication exception
9 active threads
Thread 13 (Comm thread for task_200810220830_0004_m_000653_0):
  State: RUNNABLE
  Blocked count: 2
  Waited count: 430
  Stack:
    sun.management.ThreadImpl.getThreadInfo0(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
    org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
    org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:168)
    org.apache.hadoop.mapred.Task$1.run(Task.java:338)
    java.lang.Thread.run(Thread.java:619)
Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@16b8f8eb):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 872
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:763)
    java.lang.Thread.run(Thread.java:619)
Thread 11 (IPC Client connection to hadoop5/192.168.4.105:9000):
  State: WAITING
  Blocked count: 0
  Waited count: 2
  Waiting on org.apache.hadoop.ipc.Client$Connection@a2bccb2
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
Thread 9 (IPC Client connection to /127.0.0.1:49078):
  State: RUNNABLE
  Blocked count: 5
  Waited count: 214
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:237)
    org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155)
    org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
    org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
    java.io.FilterInputStream.read(FilterInputStream.java:116)
    org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
    java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    java.io.DataInputStream.readInt(DataInputStream.java:370)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)
Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
  State: TIMED_WAITING
  Blocked count: 1
  Waited count: 890
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:435)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 6
  Waited count: 101
  Waiting on java.lang.ref.ReferenceQueue$Lock@750e687b
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 1
  Waited count: 104
  Waiting on java.lang.ref.Reference$Lock@c73f0d8
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 4
  Waited count: 137
  Stack:
    java.io.DataInputStream.readInt(DataInputStream.java:372)
    org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1973)
    org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(SequenceFile.java:3002)
    org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2760)
    org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2625)
    org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:2859)
    org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511)
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040)
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
    org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

2008-10-22 20:14:32,782 WARN org.apache.hadoop.mapred.TaskRunner: Last
retry, killing task_200810220830_0004_m_000653_0

------------------------------

Has anyone seen such a failure?

System Settings:
RHEL 5.1 x64 ,8G RAM, Athlon 64 x2 4400+
13 machines
hadoop 0.17.1
java version "1.6.0_05"
Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 10.0-b19, mixed mode)

Thanks

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message