hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mice <mice1...@gmail.com>
Subject Re: Task Random Fail
Date Fri, 24 Oct 2008 16:10:13 GMT
How many maximum mappers and reducers did you configure?
It seems your TaskRunner fails to get response.
Maybe you need to try increasing "mapred.job.tracker.handler.count".

2008/10/22, Zhou, Yunqing <azurezyq@gmail.com>:
> Recently the tasks on our cluster random failed (both map tasks and reduce
> tasks) . When rerun them, they are all ok.
> The whole job is a IO-bound job. (250G input and 500G output(map) and
> 10G(final))
> from the jobtracker, I can see the failed job says:
>    task_200810220830_0004_m_000653_0
> tip_200810220830_0004_m_000653<http://hadoop5:50030/taskdetails.jsp?jobid=job_200810220830_0004&tipid=tip_200810220830_0004_m_000653>
>  vidi-005 <http://vidi-005:50060/>
>  FAILED
>  java.io.IOException: Task process exit with nonzero status of 65. at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479) at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
>  Last
> 4KB<http://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0&start=-4097>
> Last
> 8KB<http://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0&start=-8193>
> All <http://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0>
> and the log says (follow the link in the right-most column):
>
>  Task Logs: 'task_200810220830_0004_m_000653_0'
>
> *stdout logs*
>
> ------------------------------
>
>
> *stderr logs*
>
> ------------------------------
>
>
> *syslog logs*
>
> 2008-10-22 19:59:51,640 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2008-10-22 19:59:59,507 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 26
> 2008-10-22 20:12:25,968 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.net.SocketTimeoutException: timed out
> waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:559)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
> 	at java.lang.Thread.run(Thread.java:619)
>
> 2008-10-22 20:13:29,015 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.net.SocketTimeoutException: timed out
> waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:559)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
> 	at java.lang.Thread.run(Thread.java:619)
>
> 2008-10-22 20:14:32,030 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.net.SocketTimeoutException: timed out
> waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:559)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
> 	at java.lang.Thread.run(Thread.java:619)
>
> 2008-10-22 20:14:32,781 INFO org.apache.hadoop.mapred.TaskRunner:
> Process Thread Dump: Communication exception
> 9 active threads
> Thread 13 (Comm thread for task_200810220830_0004_m_000653_0):
>   State: RUNNABLE
>   Blocked count: 2
>   Waited count: 430
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>
> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:168)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:338)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@16b8f8eb):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 872
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:763)
>     java.lang.Thread.run(Thread.java:619)
> Thread 11 (IPC Client connection to hadoop5/192.168.4.105:9000):
>   State: WAITING
>   Blocked count: 0
>   Waited count: 2
>   Waiting on org.apache.hadoop.ipc.Client$Connection@a2bccb2
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> Thread 9 (IPC Client connection to /127.0.0.1:49078):
>   State: RUNNABLE
>   Blocked count: 5
>   Waited count: 214
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:237)
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155)
>     org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
>     org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 1
>   Waited count: 890
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:435)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 6
>   Waited count: 101
>   Waiting on java.lang.ref.ReferenceQueue$Lock@750e687b
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 104
>   Waiting on java.lang.ref.Reference$Lock@c73f0d8
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 4
>   Waited count: 137
>   Stack:
>     java.io.DataInputStream.readInt(DataInputStream.java:372)
>
> org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1973)
>
> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(SequenceFile.java:3002)
>
> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2760)
>
> org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2625)
>
> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:2859)
>     org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511)
>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040)
>     org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
>     org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
>     org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
>
> 2008-10-22 20:14:32,782 WARN org.apache.hadoop.mapred.TaskRunner: Last
> retry, killing task_200810220830_0004_m_000653_0
>
> ------------------------------
>
> Has anyone seen such a failure?
>
> System Settings:
> RHEL 5.1 x64 ,8G RAM, Athlon 64 x2 4400+
> 13 machines
> hadoop 0.17.1
> java version "1.6.0_05"
> Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 10.0-b19, mixed mode)
>
> Thanks
>

Mime
View raw message