hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yunhong Gu1 <y...@cs.uic.edu>
Subject Re: reduce > copy ?? (was Reduce hangs)
Date Wed, 23 Jan 2008 05:05:59 GMT


This is the Java stack when the problem happens. Not sure if it is a 
deadlock. (Hadoop version 0.15.2)

Thread-25[1] threads
Group system:
   (java.lang.ref.Reference$ReferenceHandler)0x77b                          Reference Handler
                                                        cond. waiting
   (java.lang.ref.Finalizer$FinalizerThread)0x77a                           Finalizer    
                                                            cond. waiting
   (java.lang.Thread)0x779                                                  Signal Dispatcher
                                                        running
Group main:
   (java.lang.Thread)0x1                                                    main         
                                                            cond. waiting
   (java.lang.Thread)0x778                                                  taskCleanup  
                                                            cond. waiting
   (org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger)0x777 SessionScavenger
                                                         running
   (org.mortbay.util.ThreadedServer$Acceptor)0x776                          Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]
       running
   (org.mortbay.util.ThreadPool$PoolThread)0x775                            SocketListener0-0
                                                        cond. waiting
   (org.apache.hadoop.ipc.Server$Listener)0x774                             IPC Server listener
on 34516                                              running
   (org.apache.hadoop.ipc.Server$Handler)0x773                              IPC Server handler
0 on 34516                                             cond. waiting
   (org.apache.hadoop.ipc.Server$Handler)0x772                              IPC Server handler
1 on 34516                                             cond. waiting
   (org.apache.hadoop.ipc.Server$Handler)0x771                              IPC Server handler
2 on 34516                                             cond. waiting
   (org.apache.hadoop.ipc.Server$Handler)0x770                              IPC Server handler
3 on 34516                                             cond. waiting
   (org.apache.hadoop.ipc.Client$ConnectionCuller)0x76f                     org.apache.hadoop.io.ObjectWritable
Connection Culler                     running
   (org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread)0x76e       Map-events fetcher
for all reduce tasks on tracker_ncdm15:/10.0.0.2:34516 running
   (org.apache.hadoop.util.Daemon)0x76d                                     org.apache.hadoop.dfs.DFSClient$LeaseChecker@e0f0ad
                      running
   (org.apache.hadoop.mapred.ReduceTaskRunner)0x76c                         Thread-25    
                                                            cond. waiting
   (java.lang.UNIXProcess$1$1)0x76b                                         process reaper
                                                           running
   (org.apache.hadoop.mapred.ReduceTaskRunner)0x7d0                         Thread-110   
                                                            cond. waiting
   (java.lang.UNIXProcess$1$1)0x7d1                                         process reaper
                                                           running
   (org.apache.hadoop.ipc.Client$Connection)0x7f6                           IPC Client connection
to /10.0.0.1:60001                                  cond. waiting
Thread-25[1] thread 0x7d0
Thread-110[1] where
   [1] java.lang.Object.wait (native method)
   [2] java.lang.Object.wait (Object.java:485)
   [3] java.lang.UNIXProcess.waitFor (UNIXProcess.java:165)
   [4] org.apache.hadoop.mapred.TaskRunner.runChild (TaskRunner.java:477)
   [5] org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:343)

On Tue, 22 Jan 2008, Yunhong Gu1 wrote:

>
> Hi, All
>
> I tried many possible configurations and I think this is the deepest reason I 
> can dig out so far. The whole "Reduce hangs" because one task tracker does 
> not progress at all. It is doing some "reduce > copy" forever, as shown 
> below.
>
> This is very easy to reproduce on my machines (AMD dual dual-core Opteron 
> 3.0GHz, Debian Linux kernel 2.16, running at 32-bit mode).
>
> gu@ncdm-8:~/hadoop-0.15.2/logs$ tail hadoop-gu-tasktracker-ncdm-8.log
> 2008-01-22 18:34:45,591 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:34:48,596 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:34:54,605 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:35:00,619 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:35:03,627 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:35:09,644 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:35:15,661 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:35:18,670 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
> 2008-01-22 18:35:24,687 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 
> MB/s) >
>
>
> On Tue, 22 Jan 2008, Yunhong Gu1 wrote:
>
>> 
>> Thanks, I tried but this probably not the reason. I checked the network
>> connection using "netstat" and the client is actually connected to the
>> correct server address. In addition, "mrbench" works sometime; if it is 
>> network problem, nothing should work at all.
>> 
>> I let the "sort" program ran longer, and get some interesting output, the
>> reduce progress can actually be descreasing and oscillating (see the bottom 
>> part of the output below).
>> 
>> The error information:
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
>> 
>> What can cause "Error reading task outputncdm15"?
>> 
>> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-examples.jar sort 
>> rand randsort
>> Running on 2 nodes to sort from /user/gu/rand into /user/gu/randsort with 8 
>> reduces.
>> Job started: Tue Jan 22 11:56:39 CST 2008
>> 08/01/22 11:56:39 INFO mapred.FileInputFormat: Total input paths to 
>> process: 20
>> 08/01/22 11:56:40 INFO mapred.JobClient: Running job: job_200801221154_0005
>> 08/01/22 11:56:41 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/01/22 11:56:54 INFO mapred.JobClient:  map 1% reduce 0%
>> 08/01/22 11:56:58 INFO mapred.JobClient:  map 2% reduce 0%
>> 08/01/22 11:57:04 INFO mapred.JobClient:  map 3% reduce 0%
>> 08/01/22 11:57:08 INFO mapred.JobClient:  map 5% reduce 0%
>> 08/01/22 11:57:15 INFO mapred.JobClient:  map 7% reduce 0%
>> 08/01/22 11:57:23 INFO mapred.JobClient:  map 8% reduce 0%
>> 08/01/22 11:57:25 INFO mapred.JobClient:  map 9% reduce 0%
>> 08/01/22 11:57:27 INFO mapred.JobClient:  map 10% reduce 0%
>> 08/01/22 11:57:34 INFO mapred.JobClient:  map 11% reduce 0%
>> 08/01/22 11:57:35 INFO mapred.JobClient:  map 12% reduce 0%
>> 08/01/22 11:57:42 INFO mapred.JobClient:  map 13% reduce 0%
>> 08/01/22 11:57:48 INFO mapred.JobClient:  map 14% reduce 0%
>> 08/01/22 11:57:49 INFO mapred.JobClient:  map 15% reduce 0%
>> 08/01/22 11:57:56 INFO mapred.JobClient:  map 17% reduce 0%
>> 08/01/22 11:58:05 INFO mapred.JobClient:  map 19% reduce 0%
>> 08/01/22 11:58:12 INFO mapred.JobClient:  map 20% reduce 0%
>> 08/01/22 11:58:13 INFO mapred.JobClient:  map 20% reduce 1%
>> 08/01/22 11:58:14 INFO mapred.JobClient:  map 21% reduce 1%
>> 08/01/22 11:58:19 INFO mapred.JobClient:  map 22% reduce 1%
>> 08/01/22 11:58:26 INFO mapred.JobClient:  map 23% reduce 1%
>> 08/01/22 11:58:30 INFO mapred.JobClient:  map 24% reduce 1%
>> 08/01/22 11:58:32 INFO mapred.JobClient:  map 25% reduce 1%
>> 08/01/22 11:58:37 INFO mapred.JobClient:  map 26% reduce 1%
>> 08/01/22 11:58:42 INFO mapred.JobClient:  map 27% reduce 1%
>> 08/01/22 11:58:44 INFO mapred.JobClient:  map 28% reduce 1%
>> 08/01/22 11:58:49 INFO mapred.JobClient:  map 29% reduce 1%
>> 08/01/22 11:58:55 INFO mapred.JobClient:  map 30% reduce 1%
>> 08/01/22 11:58:57 INFO mapred.JobClient:  map 31% reduce 1%
>> 08/01/22 11:58:59 INFO mapred.JobClient:  map 32% reduce 1%
>> 08/01/22 11:59:07 INFO mapred.JobClient:  map 33% reduce 1%
>> 08/01/22 11:59:08 INFO mapred.JobClient:  map 34% reduce 1%
>> 08/01/22 11:59:15 INFO mapred.JobClient:  map 35% reduce 1%
>> 08/01/22 11:59:17 INFO mapred.JobClient:  map 36% reduce 2%
>> 08/01/22 11:59:20 INFO mapred.JobClient:  map 37% reduce 2%
>> 08/01/22 11:59:27 INFO mapred.JobClient:  map 38% reduce 2%
>> 08/01/22 11:59:29 INFO mapred.JobClient:  map 39% reduce 2%
>> 08/01/22 11:59:34 INFO mapred.JobClient:  map 40% reduce 2%
>> 08/01/22 11:59:38 INFO mapred.JobClient:  map 41% reduce 2%
>> 08/01/22 11:59:41 INFO mapred.JobClient:  map 42% reduce 2%
>> 08/01/22 11:59:42 INFO mapred.JobClient:  map 43% reduce 2%
>> 08/01/22 11:59:51 INFO mapred.JobClient:  map 44% reduce 2%
>> 08/01/22 11:59:57 INFO mapred.JobClient:  map 46% reduce 2%
>> 08/01/22 12:00:01 INFO mapred.JobClient:  map 47% reduce 2%
>> 08/01/22 12:00:06 INFO mapred.JobClient:  map 48% reduce 2%
>> 08/01/22 12:00:08 INFO mapred.JobClient:  map 49% reduce 2%
>> 08/01/22 12:00:14 INFO mapred.JobClient:  map 50% reduce 2%
>> 08/01/22 12:00:16 INFO mapred.JobClient:  map 51% reduce 2%
>> 08/01/22 12:00:22 INFO mapred.JobClient:  map 52% reduce 2%
>> 08/01/22 12:00:27 INFO mapred.JobClient:  map 53% reduce 3%
>> 08/01/22 12:00:30 INFO mapred.JobClient:  map 54% reduce 3%
>> 08/01/22 12:00:32 INFO mapred.JobClient:  map 55% reduce 3%
>> 08/01/22 12:00:37 INFO mapred.JobClient:  map 56% reduce 3%
>> 08/01/22 12:00:38 INFO mapred.JobClient:  map 57% reduce 3%
>> 08/01/22 12:00:44 INFO mapred.JobClient:  map 58% reduce 3%
>> 08/01/22 12:00:49 INFO mapred.JobClient:  map 59% reduce 3%
>> 08/01/22 12:00:52 INFO mapred.JobClient:  map 60% reduce 3%
>> 08/01/22 12:00:58 INFO mapred.JobClient:  map 61% reduce 3%
>> 08/01/22 12:00:59 INFO mapred.JobClient:  map 62% reduce 3%
>> 08/01/22 12:01:05 INFO mapred.JobClient:  map 63% reduce 3%
>> 08/01/22 12:01:08 INFO mapred.JobClient:  map 64% reduce 3%
>> 08/01/22 12:01:13 INFO mapred.JobClient:  map 65% reduce 3%
>> 08/01/22 12:01:14 INFO mapred.JobClient:  map 66% reduce 3%
>> 08/01/22 12:01:20 INFO mapred.JobClient:  map 67% reduce 3%
>> 08/01/22 12:01:25 INFO mapred.JobClient:  map 68% reduce 3%
>> 08/01/22 12:01:28 INFO mapred.JobClient:  map 69% reduce 3%
>> 08/01/22 12:01:35 INFO mapred.JobClient:  map 71% reduce 4%
>> 08/01/22 12:01:39 INFO mapred.JobClient:  map 72% reduce 4%
>> 08/01/22 12:01:46 INFO mapred.JobClient:  map 74% reduce 4%
>> 08/01/22 12:01:51 INFO mapred.JobClient:  map 75% reduce 4%
>> 08/01/22 12:01:54 INFO mapred.JobClient:  map 76% reduce 4%
>> 08/01/22 12:02:01 INFO mapred.JobClient:  map 77% reduce 4%
>> 08/01/22 12:02:04 INFO mapred.JobClient:  map 78% reduce 4%
>> 08/01/22 12:02:10 INFO mapred.JobClient:  map 79% reduce 4%
>> 08/01/22 12:02:13 INFO mapred.JobClient:  map 80% reduce 4%
>> 08/01/22 12:02:14 INFO mapred.JobClient:  map 81% reduce 4%
>> 08/01/22 12:02:17 INFO mapred.JobClient:  map 82% reduce 4%
>> 08/01/22 12:02:21 INFO mapred.JobClient:  map 83% reduce 4%
>> 08/01/22 12:02:29 INFO mapred.JobClient:  map 84% reduce 4%
>> 08/01/22 12:02:31 INFO mapred.JobClient:  map 85% reduce 4%
>> 08/01/22 12:02:36 INFO mapred.JobClient:  map 86% reduce 4%
>> 08/01/22 12:02:39 INFO mapred.JobClient:  map 87% reduce 4%
>> 08/01/22 12:02:43 INFO mapred.JobClient:  map 88% reduce 4%
>> 08/01/22 12:02:46 INFO mapred.JobClient:  map 89% reduce 4%
>> 08/01/22 12:02:50 INFO mapred.JobClient:  map 90% reduce 4%
>> 08/01/22 12:02:56 INFO mapred.JobClient:  map 91% reduce 5%
>> 08/01/22 12:02:58 INFO mapred.JobClient:  map 92% reduce 5%
>> 08/01/22 12:03:05 INFO mapred.JobClient:  map 93% reduce 5%
>> 08/01/22 12:03:07 INFO mapred.JobClient:  map 94% reduce 5%
>> 08/01/22 12:03:12 INFO mapred.JobClient:  map 96% reduce 5%
>> 08/01/22 12:03:20 INFO mapred.JobClient:  map 97% reduce 5%
>> 08/01/22 12:03:24 INFO mapred.JobClient:  map 98% reduce 5%
>> 08/01/22 12:03:28 INFO mapred.JobClient:  map 99% reduce 5%
>> 08/01/22 12:03:35 INFO mapred.JobClient:  map 100% reduce 5%
>> 08/01/22 12:03:45 INFO mapred.JobClient:  map 100% reduce 6%
>> 08/01/22 12:04:05 INFO mapred.JobClient:  map 100% reduce 7%
>> 08/01/22 12:04:25 INFO mapred.JobClient:  map 100% reduce 8%
>> 08/01/22 12:04:52 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/01/22 12:05:15 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/01/22 12:05:43 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/01/22 12:06:13 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/01/22 12:06:32 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/01/22 12:07:02 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/01/22 12:07:25 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/01/22 12:07:52 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/01/22 13:56:21 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/01/22 13:56:21 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000007_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:08:11 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/01/22 14:08:11 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000005_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 14:08:11 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:08:11 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:20:09 INFO mapred.JobClient:  map 100% reduce 8%
>> 08/01/22 14:20:09 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000004_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 14:20:09 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:20:09 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:23:49 INFO mapred.JobClient:  map 100% reduce 5%
>> 08/01/22 14:23:49 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000006_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 14:23:49 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:23:49 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 15:10:19 INFO mapred.JobClient:  map 100% reduce 4%
>> 08/01/22 15:10:19 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:10:42 INFO mapred.JobClient:  map 100% reduce 5%
>> 08/01/22 15:10:59 INFO mapred.JobClient:  map 100% reduce 6%
>> 08/01/22 15:11:22 INFO mapred.JobClient:  map 100% reduce 7%
>> 08/01/22 15:11:55 INFO mapred.JobClient:  map 100% reduce 8%
>> 08/01/22 15:13:39 INFO mapred.JobClient:  map 100% reduce 7%
>> 08/01/22 15:13:39 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000003_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:14:09 INFO mapred.JobClient:  map 100% reduce 8%
>> 08/01/22 15:14:29 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/01/22 15:14:52 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/01/22 15:15:33 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/01/22 15:37:01 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/01/22 15:37:01 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000002_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:37:13 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/01/22 15:37:33 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/01/22 15:37:53 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/01/22 15:38:23 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/01/22 15:45:51 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/01/22 15:45:51 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000001_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:46:11 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/01/22 15:46:31 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/01/22 15:46:51 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/01/22 15:47:24 INFO mapred.JobClient:  map 100% reduce 16%
>> 
>> 
>> On Tue, 22 Jan 2008, Taeho Kang wrote:
>> 
>>> You said you had two network interfaces... and it might be the source of
>>> your problem.
>>> 
>>> Try disabling one of your network interfaces
>>> Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in
>>> your config file so that datanode knows which network interface to pick 
>>> up.
>>> 
>>> /Taeho
>>> 
>>> On Jan 22, 2008 7:34 AM, Yunhong Gu1 <ygu1@cs.uic.edu> wrote:
>>> 
>>>> 
>>>> Hi, all
>>>> 
>>>> Just to keep this topic updated :) I am still tring to figure what
>>>> happened.
>>>> 
>>>> I found that in my 2-node configuration (namenode and jobtracker on
>>>> node-1, while both are datanodes and tasktrackers). The reduce task may
>>>> sometimes (but rarely) complete for programs that needs small amount of
>>>> CPU time (e.g., mrbench), but for programs with large computation, it
>>>> never finish. When reduces blocks, it always fails at 16%.
>>>> 
>>>> Eventually I will get this error information:
>>>> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
>>>> outputncdm-IPxxxxxxxxx
>>>> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
>>>> outputncdm-IPxxxxxxxxx
>>>> 08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
>>>> task_200801181145_0005_r_000000_1, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>> 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task
>>>> outputConnection timed out
>>>> 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task
>>>> outputConnection timed out
>>>> 08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
>>>> java.io.IOException: Job failed!
>>>> 
>>>> I found that "IPxxxxxxxx" is not the correct network address
>>>> that Hadoop should read result from. The servers I use have 2 network
>>>> interfaces and I am using another one. I explicitly fill the IP addresses
>>>> 10.0.0.x in all the configuration files.
>>>> 
>>>> Might this be the reason of Reduce failure? But the Map phase does work.
>>>> 
>>>> Thanks
>>>> Yunhong
>>>> 
>>>> 
>>>> On Sat, 19 Jan 2008, Yunhong Gu1 wrote:
>>>> 
>>>>> 
>>>>> Oh, so it is the task running on the other node (ncdm-15) fails and
>>>> Hadoop
>>>>> re-run the task on the local node (ncdm-8). (I only have two nodes,
>>>> ncdm-8
>>>>> and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The
>>>> program
>>>>> is also started on ncdm-8).
>>>>> 
>>>>>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>>>>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>>>> fetch-failures
>>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>> outputncdm15
>>>>> 
>>>>> Any ideas why the task would fail? And why it takes so long for Hadoop
>>>> to
>>>>> detect the failure?
>>>>> 
>>>>> Thanks
>>>>> Yunhong
>>>>> 
>>>>> On Sat, 19 Jan 2008, Devaraj Das wrote:
>>>>> 
>>>>>> Hi Yunhong,
>>>>>> As per the output it seems the job ran to successful completion (albeit
>>>> 
>>>>>> with
>>>>>> some failures)...
>>>>>> Devaraj
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Yunhong Gu1 [mailto:ygu1@cs.uic.edu]
>>>>>>> Sent: Saturday, January 19, 2008 8:56 AM
>>>>>>> To: hadoop-user@lucene.apache.org
>>>>>>> Subject: Re: Reduce hangs
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Yes, it looks like HADOOP-1374
>>>>>>> 
>>>>>>> The program actually failed after a while:
>>>>>>> 
>>>>>>> 
>>>>>>> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar
>>>>>>> hadoop-0.15.2-test.jar mrbench
>>>>>>> MRBenchmark.0.0.2
>>>>>>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
>>>>>>> 1 numLines, ASCENDING sortOrder
>>>>>>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
>>>>>>> /benchmarks/MRBench/mr_input/input_-450753747.txt
>>>>>>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
>>>>>>> input=/benchmarks/MRBench/mr_input
>>>>>>> output=/benchmarks/MRBench/mr_output/output_1843693325
>>>>>>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
>>>>>>> paths to process : 1
>>>>>>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
>>>>>>> job_200801181852_0001
>>>>>>> 08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
>>>>>>> 08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>>> 08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
>>>>>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>>>>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>>>>>>> fetch-failures
>>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>>>>> outputncdm15
>>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>>>>> outputncdm15
>>>>>>> 08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
>>>>>>> job_200801181852_0001
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched map tasks=3
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched reduce
tasks=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Data-local map tasks=2
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input records=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output records=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input bytes=2
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output bytes=5
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input groups=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input records=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce output records=1
>>>>>>> DataLines       Maps    Reduces AvgTime (milliseconds)
>>>>>>> 1               2       1       926333
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, 18 Jan 2008, Konstantin Shvachko wrote:
>>>>>>> 
>>>>>>>> Looks like we still have this unsolved mysterious problem:
>>>>>>>> 
>>>>>>>> http://issues.apache.org/jira/browse/HADOOP-1374
>>>>>>>> 
>>>>>>>> Could it be related to HADOOP-1246? Arun?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> --Konstantin
>>>>>>>> 
>>>>>>>> Yunhong Gu1 wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> If someone knows how to fix the problem described below,
>>>>>>> please help
>>>>>>>>> me out. Thanks!
>>>>>>>>> 
>>>>>>>>> I am testing Hadoop on 2-node cluster and the "reduce"
>>>>>>> always hangs
>>>>>>>>> at some stage, even if I use different clusters. My OS
is Debian
>>>>>>>>> Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision
>>>>>>> is 0.15.2.
>>>>>>>>> Java version is 1.5.0_01-b08.
>>>>>>>>> 
>>>>>>>>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar
>>>>>>> mrbench" and
>>>>>>>>> when the map stage finishes, the reduce stage will hang
>>>>>>> somewhere in
>>>>>>>>> the middle, sometimes at 0%. I also tried any other
>>>>>>> mapreduce program
>>>>>>>>> I can find in the example jar package but they all hang.
>>>>>>>>> 
>>>>>>>>> The log file simply print
>>>>>>>>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy
>
>>>>>>>>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy
>
>>>>>>>>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy
>
>>>>>>>>> 
>>>>>>>>> forever.
>>>>>>>>> 
>>>>>>>>> The program does work if I start Hadoop only on single
node.
>>>>>>>>> 
>>>>>>>>> Below is my hadoop-site.xml configuration:
>>>>>>>>> 
>>>>>>>>> <configuration>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>    <name> fs.default.name</name>
>>>>>>>>>    <value>10.0.0.1:60000</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>    <name>mapred.job.tracker</name>
>>>>>>>>>    <value>10.0.0.1:60001</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>    <name>dfs.data.dir</name>
>>>>>>>>>    <value>/raid/hadoop/data</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>    <name>mapred.local.dir</name>
>>>>>>>>>    <value>/raid/hadoop/mapred</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>hadoop.tmp.dir</name>
>>>>>>>>>   <value>/raid/hadoop/tmp</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>mapred.child.java.opts</name>
>>>>>>>>>   <value>-Xmx1024m</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>mapred.tasktracker.tasks.maximum</name>
>>>>>>>>>   <value>4</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <!--
>>>>>>>>> <property>
>>>>>>>>>   <name>mapred.map.tasks</name>
>>>>>>>>>   <value>7</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>mapred.reduce.tasks</name>
>>>>>>>>>   <value>3</value>
>>>>>>>>> </property>
>>>>>>>>> -->
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>fs.inmemory.size.mb</name>
>>>>>>>>>   <value>200</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>dfs.block.size</name>
>>>>>>>>>   <value>134217728</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>io.sort.factor</name>
>>>>>>>>>   <value>100</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>io.sort.mb</name>
>>>>>>>>>   <value>200</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> <property>
>>>>>>>>>   <name>io.file.buffer.size</name>
>>>>>>>>>   <value>131072</value>
>>>>>>>>> </property>
>>>>>>>>> 
>>>>>>>>> </configuration>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Taeho Kang [tkang.blogspot.com]
>>> Software Engineer, NHN Corporation, Korea
>>> 
>> 
>

Mime
View raw message