Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 46851 invoked from network); 2 Jun 2007 17:56:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Jun 2007 17:56:40 -0000 Received: (qmail 50773 invoked by uid 500); 2 Jun 2007 17:56:44 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 50740 invoked by uid 500); 2 Jun 2007 17:56:44 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 50729 invoked by uid 99); 2 Jun 2007 17:56:44 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jun 2007 10:56:44 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of mike.smith.dev@gmail.com designates 66.249.92.168 as permitted sender) Received: from [66.249.92.168] (HELO ug-out-1314.google.com) (66.249.92.168) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jun 2007 10:56:38 -0700 Received: by ug-out-1314.google.com with SMTP id m2so488139uge for ; Sat, 02 Jun 2007 10:56:16 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type; b=XDVqhO8wWxo8pfZFV/Yav5YmetvSWog3q/CzUaedwbhaWxtsBKnqDdOoHslxkoJnzRIpWO3oVq8fX1ZksVI9yez0ORilbfwiAEX9F7ldbR6DBjR8Ev0A9iey5g7QRljUDSbCByfhaPw5Pm751yMMRKMZSF5FD8PlnR8l1nyiJLM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type; b=KWQOosjzIFFD9NCOl2s3BTwRQb7R8c+S+0U5jyUEah9BDm0NZmUGvgvD7G1rwDMB1LqolC6E7ivFRy7/fOeWqv9nnqTu1IhVUG4i1tuLv3J8PiTNmyseakLvWQ55AwTa2MBKquvhqG6TLHOi5GTqtmcw/xj9GqIVIuCD4fU/pHI= Received: by 10.78.204.1 with SMTP id b1mr1555161hug.1180806976086; Sat, 02 Jun 2007 10:56:16 -0700 (PDT) Received: by 10.78.180.5 with HTTP; Sat, 2 Jun 2007 10:56:16 -0700 (PDT) Message-ID: <93a31b230706021056l4d1d415cj746353d231730e2a@mail.gmail.com> Date: Sat, 2 Jun 2007 10:56:16 -0700 From: "Mike Smith" To: hadoop-dev@lucene.apache.org Subject: reducers stock and hang during copying map output for ever MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_7968_2565654.1180806976055" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_7968_2565654.1180806976055 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi, I am using hadoop 0.12.3 release. Sometimes the reducer stocks in the middle of copying map output and hangs there for ever in an infinite loop. The task will not fail because jobtracker receives heartbeat, but the copying will not progress. In that case I need to manually kill the task in order to make the job progress. Here is the tasktracker log: 2007-06-02 13:33:13,335 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Got 3 known map output location(s); scheduling... 2007-06-02 13:33:13,335 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Scheduled 0 of 3 known outputs (3 slow hosts and 0 dup hosts) 2007-06-02 13:33:14,034 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:15,044 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:16,054 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:17,064 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:18,074 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:18,343 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Need 3 map output(s) 2007-06-02 13:33:18,344 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Need 3 map output location(s) 2007-06-02 13:33:18,345 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Got 0 new map outputs from jobtracker and 0 map outputs from previous failures 2007-06-02 13:33:18,345 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Got 3 known map output location(s); scheduling... 2007-06-02 13:33:18,345 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Scheduled 0 of 3 known outputs (3 slow hosts and 0 dup hosts) 2007-06-02 13:33:19,084 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:20,094 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:21,104 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:22,114 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:23,124 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:23,353 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Need 3 map output(s) 2007-06-02 13:33:23,354 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Need 3 map output location(s) 2007-06-02 13:33:23,355 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Got 0 new map outputs from jobtracker and 0 map outputs from previous failures 2007-06-02 13:33:23,355 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Got 3 known map output location(s); scheduling... 2007-06-02 13:33:23,355 INFO org.apache.hadoop.mapred.TaskRunner: task_0023_r_000085_0 Scheduled 0 of 3 known outputs (3 slow hosts and 0 dup hosts) 2007-06-02 13:33:24,134 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > 2007-06-02 13:33:25,144 INFO org.apache.hadoop.mapred.TaskTracker: task_0023_r_000085_0 0.33295453% reduce > copy (2637 of 2640 at 0.00 MB/s) > I also dumped the core, here is the core dump when the reducer stock: Full thread dump Java HotSpot(TM) Client VM (1.5.0_11-b03 mixed mode): "IPC Client connection to jobtracker/jobtracker:50002" daemon prio=1 tid=0x096f29d8 nid=0x76cc in Object.wait() [0x4cfdc000..0x4cfdcf40] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java :213) - locked <0x4f13b5c0> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:252) "SocketListener0-19" prio=1 tid=0x0a61ccf8 nid=0x49ed in Object.wait() [0x4b08a000..0x4b08af40] at java.lang.Object.wait(Native Method) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522) - locked <0x567ba450> (a org.mortbay.util.ThreadPool$PoolThread) "Map output copy reporter for task task_0023_r_000085_0" daemon prio=1 tid=0x0a3c9ca8 nid=0x49c2 waiting on condition [0x4b28e000..0x4b28f040] at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.ReduceTaskRunner$PingTimer.run( ReduceTaskRunner.java:189) "MapOutputCopier task_0023_r_000085_0.294" prio=1 tid=0x0971a398 nid=0x49c0 in Object.wait() [0x4af07000..0x4af07f40] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run( ReduceTaskRunner.java:267) - locked <0x567557b8> (a java.util.ArrayList) "MapOutputCopier task_0023_r_000085_0.292" prio=1 tid=0x096ff088 nid=0x49be in Object.wait() [0x4af88000..0x4af88e40] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run( ReduceTaskRunner.java:267) - locked <0x567557b8> (a java.util.ArrayList) "MapOutputCopier task_0023_r_000085_0.290" prio=1 tid=0x09bd99f8 nid=0x49bc in Object.wait() [0x4c1ad000..0x4c1ad140] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run( ReduceTaskRunner.java:267) - locked <0x567557b8> (a java.util.ArrayList) "MapOutputCopier task_0023_r_000085_0.288" prio=1 tid=0x0a314ce0 nid=0x49ba in Object.wait() [0x4bca2000..0x4bca3040] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run( ReduceTaskRunner.java:267) - locked <0x567557b8> (a java.util.ArrayList) "MapOutputCopier task_0023_r_000085_0.286" prio=1 tid=0x093d2c18 nid=0x49b8 in Object.wait() [0x4d15f000..0x4d15ff40] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run( ReduceTaskRunner.java:267) - locked <0x567557b8> (a java.util.ArrayList) "Thread-27723" prio=1 tid=0x093d29a8 nid=0x49b7 waiting on condition [0x4b99c000..0x4b99cec0] at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.ReduceTaskRunner.prepare( ReduceTaskRunner.java:572) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:143) "MultiThreadedHttpConnectionManager cleanup" daemon prio=1 tid=0x0860d368 nid=0x7e66 in Object.wait() [0x4d1e0000..0x4d1e1040] at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked <0x56418ca8> (a java.lang.ref.ReferenceQueue$Lock) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ReferenceQueueThread.run (MultiThreadedHttpConnectionManager.java:1082) "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1 tid=0x085c4fb0 nid=0x7e64 waiting on condition [0x4d2e2000..0x4d2e2f40] at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java :401) "IPC Server handler 1 on 50050" daemon prio=1 tid=0x085bbaf0 nid=0x7e63 in Object.wait() [0x4d363000..0x4d363ec0] at java.lang.Object.wait(Native Method) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:533) - locked <0x5640dcd8> (a java.util.LinkedList) "IPC Server handler 0 on 50050" daemon prio=1 tid=0x085c3fc8 nid=0x7e62 in Object.wait() [0x4d3e4000..0x4d3e4e40] at java.lang.Object.wait(Native Method) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:533) - locked <0x5640dcd8> (a java.util.LinkedList) "IPC Server listener on 50050" daemon prio=1 tid=0x085c3ab0 nid=0x7e61 runnable [0x4d465000..0x4d4661c0] at sun.nio.ch.PollArrayWrapper.poll0(Native Method) at sun.nio.ch.PollArrayWrapper.poll(PollArrayWrapper.java:100) at sun.nio.ch.PollSelectorImpl.doSelect(PollSelectorImpl.java:56) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked <0x5640de48> (a sun.nio.ch.Util$1) - locked <0x5640de58> (a java.util.Collections$UnmodifiableSet) - locked <0x5640ddf8> (a sun.nio.ch.PollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) at org.apache.hadoop.ipc.Server$Listener.run(Server.java:253) "Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]" prio=1 tid=0x0859f278 nid=0x7e5f runnable [0x4d56f000..0x4d5700c0] at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384) - locked <0x564028b8> (a java.net.SocksSocketImpl) at java.net.ServerSocket.implAccept(ServerSocket.java:450) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java :432) at org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java :631) "SessionScavenger" daemon prio=1 tid=0x08599fc8 nid=0x7e5e waiting on condition [0x4d5f0000..0x4d5f1040] at java.lang.Thread.sleep(Native Method) at org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run( AbstractSessionManager.java:587) "taskCleanup" daemon prio=1 tid=0x082f4f60 nid=0x7e5b waiting on condition [0x4d701000..0x4d701fc0] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:118) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await( AbstractQueuedSynchronizer.java:1767) at java.util.concurrent.LinkedBlockingQueue.take( LinkedBlockingQueue.java:359) at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:181) at java.lang.Thread.run(Thread.java:595) "Low Memory Detector" daemon prio=1 tid=0x080c99b8 nid=0x7e59 runnable [0x00000000..0x00000000] "CompilerThread0" daemon prio=1 tid=0x080c8408 nid=0x7e58 waiting on condition [0x00000000..0x4e76b828] "Signal Dispatcher" daemon prio=1 tid=0x080c7580 nid=0x7e57 waiting on condition [0x00000000..0x00000000] "Finalizer" daemon prio=1 tid=0x080bcf00 nid=0x7e53 in Object.wait() [0x4e86e000..0x4e86e140] at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked <0x562d2798> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=1 tid=0x080bc238 nid=0x7e52 in Object.wait() [0x4e8ee000..0x4e8ef0c0] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x562d2818> (a java.lang.ref.Reference$Lock) "main" prio=1 tid=0x0805dc50 nid=0x7e3b in Object.wait() [0xbfc72000..0xbfc725c8] at java.lang.Object.wait(Native Method) at org.apache.hadoop.mapred.TaskTracker.offerService( TaskTracker.java:509) - locked <0x56311618> (a [I) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:899) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1589) "VM Thread" prio=1 tid=0x080b9540 nid=0x7e51 runnable "VM Periodic Task Thread" prio=1 tid=0x080cae98 nid=0x7e5a waiting on condition ------=_Part_7968_2565654.1180806976055--