hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Smith" <mike.smith....@gmail.com>
Subject reducers stock and hang during copying map output for ever
Date Sat, 02 Jun 2007 17:56:16 GMT
Hi,



I am using hadoop 0.12.3 release. Sometimes the reducer stocks in the middle
of copying map output and hangs there for ever in an infinite loop. The task
will not fail because jobtracker receives heartbeat, but the copying will
not progress. In that case I need to manually kill the task in order to make
the job progress.  Here is the tasktracker log:





2007-06-02 13:33:13,335 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Got 3 known map output location(s); scheduling...

2007-06-02 13:33:13,335 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Scheduled 0 of 3 known outputs (3 slow hosts and 0 dup
hosts)

2007-06-02 13:33:14,034 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:15,044 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:16,054 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:17,064 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:18,074 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:18,343 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Need 3 map output(s)

2007-06-02 13:33:18,344 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Need 3 map output location(s)

2007-06-02 13:33:18,345 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Got 0 new map outputs from jobtracker and 0 map outputs
from previous failures

2007-06-02 13:33:18,345 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Got 3 known map output location(s); scheduling...

2007-06-02 13:33:18,345 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Scheduled 0 of 3 known outputs (3 slow hosts and 0 dup
hosts)

2007-06-02 13:33:19,084 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:20,094 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:21,104 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:22,114 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:23,124 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:23,353 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Need 3 map output(s)

2007-06-02 13:33:23,354 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Need 3 map output location(s)

2007-06-02 13:33:23,355 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Got 0 new map outputs from jobtracker and 0 map outputs
from previous failures

2007-06-02 13:33:23,355 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Got 3 known map output location(s); scheduling...

2007-06-02 13:33:23,355 INFO org.apache.hadoop.mapred.TaskRunner:
task_0023_r_000085_0 Scheduled 0 of 3 known outputs (3 slow hosts and 0 dup
hosts)

2007-06-02 13:33:24,134 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >

2007-06-02 13:33:25,144 INFO org.apache.hadoop.mapred.TaskTracker:
task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) >









I also dumped the core, here is the core dump when the reducer stock:



Full thread dump Java HotSpot(TM) Client VM (1.5.0_11-b03 mixed mode):



"IPC Client connection to jobtracker/jobtracker:50002" daemon prio=1
tid=0x096f29d8 nid=0x76cc in Object.wait() [0x4cfdc000..0x4cfdcf40]

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:474)

        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java
:213)

        - locked <0x4f13b5c0> (a org.apache.hadoop.ipc.Client$Connection)

        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:252)



"SocketListener0-19" prio=1 tid=0x0a61ccf8 nid=0x49ed in Object.wait()
[0x4b08a000..0x4b08af40]

        at java.lang.Object.wait(Native Method)

        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)

        - locked <0x567ba450> (a org.mortbay.util.ThreadPool$PoolThread)



"Map output copy reporter for task task_0023_r_000085_0" daemon prio=1
tid=0x0a3c9ca8 nid=0x49c2 waiting on condition [0x4b28e000..0x4b28f040]

        at java.lang.Thread.sleep(Native Method)

        at org.apache.hadoop.mapred.ReduceTaskRunner$PingTimer.run(
ReduceTaskRunner.java:189)



"MapOutputCopier task_0023_r_000085_0.294" prio=1 tid=0x0971a398 nid=0x49c0
in Object.wait() [0x4af07000..0x4af07f40]

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:474)

        at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(
ReduceTaskRunner.java:267)

        - locked <0x567557b8> (a java.util.ArrayList)



"MapOutputCopier task_0023_r_000085_0.292" prio=1 tid=0x096ff088 nid=0x49be
in Object.wait() [0x4af88000..0x4af88e40]

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:474)

        at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(
ReduceTaskRunner.java:267)

        - locked <0x567557b8> (a java.util.ArrayList)



"MapOutputCopier task_0023_r_000085_0.290" prio=1 tid=0x09bd99f8 nid=0x49bc
in Object.wait() [0x4c1ad000..0x4c1ad140]

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:474)

        at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(
ReduceTaskRunner.java:267)

        - locked <0x567557b8> (a java.util.ArrayList)



"MapOutputCopier task_0023_r_000085_0.288" prio=1 tid=0x0a314ce0 nid=0x49ba
in Object.wait() [0x4bca2000..0x4bca3040]

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:474)

        at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(
ReduceTaskRunner.java:267)

        - locked <0x567557b8> (a java.util.ArrayList)



"MapOutputCopier task_0023_r_000085_0.286" prio=1 tid=0x093d2c18 nid=0x49b8
in Object.wait() [0x4d15f000..0x4d15ff40]

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:474)

        at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(
ReduceTaskRunner.java:267)

        - locked <0x567557b8> (a java.util.ArrayList)



"Thread-27723" prio=1 tid=0x093d29a8 nid=0x49b7 waiting on condition
[0x4b99c000..0x4b99cec0]

        at java.lang.Thread.sleep(Native Method)

        at org.apache.hadoop.mapred.ReduceTaskRunner.prepare(
ReduceTaskRunner.java:572)

        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:143)



"MultiThreadedHttpConnectionManager cleanup" daemon prio=1 tid=0x0860d368
nid=0x7e66 in Object.wait() [0x4d1e0000..0x4d1e1040]

        at java.lang.Object.wait(Native Method)

        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)

        - locked <0x56418ca8> (a java.lang.ref.ReferenceQueue$Lock)

        at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ReferenceQueueThread.run
(MultiThreadedHttpConnectionManager.java:1082)



"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1
tid=0x085c4fb0 nid=0x7e64 waiting on condition [0x4d2e2000..0x4d2e2f40]

        at java.lang.Thread.sleep(Native Method)

        at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java
:401)



"IPC Server handler 1 on 50050" daemon prio=1 tid=0x085bbaf0 nid=0x7e63 in
Object.wait() [0x4d363000..0x4d363ec0]

        at java.lang.Object.wait(Native Method)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:533)

        - locked <0x5640dcd8> (a java.util.LinkedList)



"IPC Server handler 0 on 50050" daemon prio=1 tid=0x085c3fc8 nid=0x7e62 in
Object.wait() [0x4d3e4000..0x4d3e4e40]

        at java.lang.Object.wait(Native Method)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:533)

        - locked <0x5640dcd8> (a java.util.LinkedList)



"IPC Server listener on 50050" daemon prio=1 tid=0x085c3ab0 nid=0x7e61
runnable [0x4d465000..0x4d4661c0]

        at sun.nio.ch.PollArrayWrapper.poll0(Native Method)

        at sun.nio.ch.PollArrayWrapper.poll(PollArrayWrapper.java:100)

        at sun.nio.ch.PollSelectorImpl.doSelect(PollSelectorImpl.java:56)

        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)

        - locked <0x5640de48> (a sun.nio.ch.Util$1)

        - locked <0x5640de58> (a java.util.Collections$UnmodifiableSet)

        - locked <0x5640ddf8> (a sun.nio.ch.PollSelectorImpl)

        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)

        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:253)





"Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]" prio=1
tid=0x0859f278 nid=0x7e5f runnable [0x4d56f000..0x4d5700c0]

        at java.net.PlainSocketImpl.socketAccept(Native Method)

        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)

        - locked <0x564028b8> (a java.net.SocksSocketImpl)

        at java.net.ServerSocket.implAccept(ServerSocket.java:450)

        at java.net.ServerSocket.accept(ServerSocket.java:421)

        at org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java
:432)

        at org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java
:631)



"SessionScavenger" daemon prio=1 tid=0x08599fc8 nid=0x7e5e waiting on
condition [0x4d5f0000..0x4d5f1040]

        at java.lang.Thread.sleep(Native Method)

        at
org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(
AbstractSessionManager.java:587)



"taskCleanup" daemon prio=1 tid=0x082f4f60 nid=0x7e5b waiting on condition
[0x4d701000..0x4d701fc0]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:118)

        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(
AbstractQueuedSynchronizer.java:1767)

        at java.util.concurrent.LinkedBlockingQueue.take(
LinkedBlockingQueue.java:359)

        at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:181)

        at java.lang.Thread.run(Thread.java:595)



"Low Memory Detector" daemon prio=1 tid=0x080c99b8 nid=0x7e59 runnable
[0x00000000..0x00000000]



"CompilerThread0" daemon prio=1 tid=0x080c8408 nid=0x7e58 waiting on
condition [0x00000000..0x4e76b828]



"Signal Dispatcher" daemon prio=1 tid=0x080c7580 nid=0x7e57 waiting on
condition [0x00000000..0x00000000]



"Finalizer" daemon prio=1 tid=0x080bcf00 nid=0x7e53 in Object.wait()
[0x4e86e000..0x4e86e140]

        at java.lang.Object.wait(Native Method)

        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)

        - locked <0x562d2798> (a java.lang.ref.ReferenceQueue$Lock)

        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)

        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)



"Reference Handler" daemon prio=1 tid=0x080bc238 nid=0x7e52 in Object.wait()
[0x4e8ee000..0x4e8ef0c0]

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:474)

        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)

        - locked <0x562d2818> (a java.lang.ref.Reference$Lock)



"main" prio=1 tid=0x0805dc50 nid=0x7e3b in Object.wait()
[0xbfc72000..0xbfc725c8]

        at java.lang.Object.wait(Native Method)

        at org.apache.hadoop.mapred.TaskTracker.offerService(
TaskTracker.java:509)

        - locked <0x56311618> (a [I)

        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:899)

        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1589)



"VM Thread" prio=1 tid=0x080b9540 nid=0x7e51 runnable



"VM Periodic Task Thread" prio=1 tid=0x080cae98 nid=0x7e5a waiting on
condition

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message