hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amar Kamat <ama...@yahoo-inc.com>
Subject Re: hadoop 0.15.3 r612257 freezes on reduce task
Date Tue, 29 Jan 2008 12:55:06 GMT
It seems that the reducers were not able to copy the output from the 
mappers (reduce % > 33 means that the copy of map output is over, i.e 
the shuffle phase), waited long for the mappers to recover and then 
finally after waiting for long (this is expected) the Jobtracker killed 
the map and re-executed on the local-machine and hence the job got 
completed. It takes 15 min on an average for a map to get killed by one 
mapper. It seems to be a disk problem on the machine where 
task_200801281756_0002_m_000006_0, task_200801281756_0002_m_000007_0 and 
task_200801281756_0002_m_000008_0 got scheduled. Can you check if the 
disk space/health of these machines?
Amar
Florian Leibert wrote:
> i just saw that the job finally completed. however it took one hour 
> and 45 minutes - for a small job that runs in about 1-2 minutes on a 
> single node (outside hadoop framework). The reduce part took extremely 
> long - the output of the tasktracker shows about 5 of the sample 
> sections for each of the copies (11 copies). so clearly something is 
> limiting the reduce...
>
> Any clues?
>
> Thanks
>
> Florian
>
>
> Jobtracker log:
> ...
> 2008-01-28 19:08:01,981 INFO org.apache.hadoop.mapred.JobInProgress: 
> Failed fetch notification #1 for task task_200801281756_0002_m_000007_0
> 2008-01-28 19:16:00,848 INFO org.apache.hadoop.mapred.JobInProgress: 
> Failed fetch notification #2 for task task_200801281756_0001_m_000007_0
> 2008-01-28 19:32:33,485 INFO org.apache.hadoop.mapred.JobInProgress: 
> Failed fetch notification #2 for task task_200801281756_0002_m_000008_0
> 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: 
> Failed fetch notification #3 for task task_200801281756_0001_m_000007_0
> 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: 
> Too many fetch-failures for output of task: 
> task_200801281756_0001_m_000007_0 ... killing it
> 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from task_200801281756_0001_m_000007_0: Too many fetch-failures
> 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobInProgress: 
> Choosing normal task tip_200801281756_0001_m_000007
> 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker: 
> Adding task 'task_200801281756_0001_m_000007_1' to tip 
> tip_200801281756_0001_m_000007, for tracker 
> 'tracker_localhost:/127.0.0.1:43625'
> 2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'task_200801281756_0001_m_000007_0' from 
> 'tracker_localhost.localdomain:/127.0.0.1:37469'
> 2008-01-28 19:42:14,538 INFO org.apache.hadoop.mapred.JobInProgress: 
> Task 'task_200801281756_0001_m_000007_1' has completed 
> tip_200801281756_0001_m_000007 successfully.
> 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: 
> Failed fetch notification #3 for task task_200801281756_0001_m_000006_0
> 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: 
> Too many fetch-failures for output of task: 
> task_200801281756_0001_m_000006_0 ... killing it
> 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from task_200801281756_0001_m_000006_0: Too many fetch-failures
> 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: 
> Choosing normal task tip_200801281756_0001_m_000006
> 2008-01-28 19:51:44,626 INFO org.apache.hadoop.mapred.JobTracker: 
> Adding task 'task_200801281756_0001_m_000006_1' to tip 
> tip_200801281756_0001_m_000006, for tracker 
> 'tracker_localhost:/127.0.0.1:43625'
> 2008-01-28 19:51:44,671 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'task_200801281756_0001_m_000006_0' from 
> 'tracker_localhost.localdomain:/127.0.0.1:37469'
> 2008-01-28 19:51:46,503 INFO org.apache.hadoop.mapred.JobInProgress: 
> Task 'task_200801281756_0001_m_000006_1' has completed 
> tip_200801281756_0001_m_000006 successfully.
> 2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.TaskRunner: 
> Saved output of task 'task_200801281756_0001_r_000000_0' to 
> hdfs://humanufactory.com:9000/output
> 2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.JobInProgress: 
> Task 'task_200801281756_0001_r_000000_0' has completed 
> tip_200801281756_0001_r_000000 successfully.
> 2008-01-28 19:51:51,779 INFO org.apache.hadoop.mapred.JobInProgress: 
> Job job_200801281756_0001 has completed successfully.
> 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'task_200801281756_0001_m_000006_0' from 
> 'tracker_localhost.localdomain:/127.0.0.1:37469'
> 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'task_200801281756_0001_m_000007_0' from 
> 'tracker_localhost.localdomain:/127.0.0.1:37469'
> 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'task_200801281756_0001_m_000002_0' from 
> 'tracker_localhost:/127.0.0.1:43625'
> 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'task_200801281756_0001_m_000004_0' from 
> 'tracker_localhost:/127.0.0.1:43625'
> ...
>
>
> Tasktracker log:
> 2008-01-28 19:07:02,844 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker: 
> getMapOutput(task_200801281756_0001_m_000006_0,0) failed :
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
> task_200801281756_0001_m_000006_0/file.out.index in any of the 
> configured local directories
>     at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)

>
>     at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)

>
>     at 
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1967) 
>
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>     at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>     at 
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)

>
>     at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>     at 
> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)

>
>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>     at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>     at 
> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>     at 
> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>
> 2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker: 
> Unknown child with bad map output: task_200801281756_0001_m_000006_0. 
> Ignored.
> 2008-01-28 19:07:08,864 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:14,884 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:17,894 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:23,914 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:29,934 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:32,944 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:38,964 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:44,983 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
> 2008-01-28 19:07:47,994 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 
> 0.00 MB/s) >
>
>
>
>
>
>
>
>
>
> On Jan 28, 2008, at 8:43 PM, Amar Kamat wrote:
>
>> Hi,
>> Can you plz attach the Jobtracker and any of the stuck reducer's log? 
>> Is there any way I can reproduce this error?
>> Amar
>> Florian Leibert wrote:
>>> hi,
>>> i setup a cluster with 3 slaves and 1 master. when running the 
>>> wordcount example, hadoop execution halts after the mapping phase. 
>>> It has started the reduce phase and generally gets to 10-20% but 
>>> halts there. I left it running for 30 minutes, no change. Log files 
>>> don't show any errors...
>>> I reformatted the Hadoop FS that i created, access to the files from 
>>> all nodes seems fine.
>>> Has anyone seen a similar behavior?
>>>
>>> Thanks,
>>> Florian
>>>
>>>
>>
>
>


Mime
View raw message