hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florian Leibert <...@headcaselabs.com>
Subject Re: hadoop 0.15.3 r612257 freezes on reduce task
Date Tue, 29 Jan 2008 17:41:33 GMT
Hi,
I got some more logs (from the other nodes) - maybe this leads to  
some conclusions:

###Node 1
2008-01-28 18:07:08,017 ERROR org.apache.hadoop.dfs.DataNode:  
DataXceiver: java.io.IOException: Block blk_6263175697396671978 is  
valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java: 
1257)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock 
(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Unknown Source)

###Node 2
008-01-28 18:07:08,109 WARN org.apache.hadoop.dfs.DataNode: Failed to  
transfer blk_6263175697396671978 to 10.253.14.144:50010 got  
java.io.IOException: operation failed at /10.253.14.144
at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
at java.lang.Thread.run(Thread.java:619)

###Node 3
2008-01-28 17:57:45,120 WARN org.apache.hadoop.dfs.DataNode: Failed  
to transfer blk_-3751486067814847527 to 10.253.19.0:50010 got  
java.io.IOException: operation failed at /10.253.19.0
at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
at java.lang.Thread.run(Thread.java:619)

2008-01-28 17:57:45,372 INFO org.apache.hadoop.dfs.DataNode: Served  
block blk_8941456674455415759 to /10.253.34.241
2008-01-28 17:57:45,383 INFO org.apache.hadoop.dfs.DataNode: Served  
block blk_-3751486067814847527 to /10.253.34.241
2008-01-28 18:07:02,026 INFO org.apache.hadoop.dfs.DataNode: Received  
block blk_-2349720010162881555 from /10.253.14.144
...

Thanks,
Florian

On Jan 29, 2008, at 4:55 AM, Amar Kamat wrote:

> It seems that the reducers were not able to copy the output from  
> the mappers (reduce % > 33 means that the copy of map output is  
> over, i.e the shuffle phase), waited long for the mappers to  
> recover and then finally after waiting for long (this is expected)  
> the Jobtracker killed the map and re-executed on the local-machine  
> and hence the job got completed. It takes 15 min on an average for  
> a map to get killed by one mapper. It seems to be a disk problem on  
> the machine where task_200801281756_0002_m_000006_0,  
> task_200801281756_0002_m_000007_0 and  
> task_200801281756_0002_m_000008_0 got scheduled. Can you check if  
> the disk space/health of these machines?
> Amar
> Florian Leibert wrote:
>> i just saw that the job finally completed. however it took one  
>> hour and 45 minutes - for a small job that runs in about 1-2  
>> minutes on a single node (outside hadoop framework). The reduce  
>> part took extremely long - the output of the tasktracker shows  
>> about 5 of the sample sections for each of the copies (11 copies).  
>> so clearly something is limiting the reduce...
>>
>> Any clues?
>>
>> Thanks
>>
>> Florian
>>
>>
>> Jobtracker log:
>> ...
>> 2008-01-28 19:08:01,981 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Failed fetch notification  
>> #1 for task task_200801281756_0002_m_000007_0
>> 2008-01-28 19:16:00,848 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Failed fetch notification  
>> #2 for task task_200801281756_0001_m_000007_0
>> 2008-01-28 19:32:33,485 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Failed fetch notification  
>> #2 for task task_200801281756_0002_m_000008_0
>> 2008-01-28 19:42:12,822 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Failed fetch notification  
>> #3 for task task_200801281756_0001_m_000007_0
>> 2008-01-28 19:42:12,822 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures  
>> for output of task: task_200801281756_0001_m_000007_0 ... killing it
>> 2008-01-28 19:42:12,823 INFO  
>> org.apache.hadoop.mapred.TaskInProgress: Error from  
>> task_200801281756_0001_m_000007_0: Too many fetch-failures
>> 2008-01-28 19:42:12,823 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Choosing normal task  
>> tip_200801281756_0001_m_000007
>> 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker:  
>> Adding task 'task_200801281756_0001_m_000007_1' to tip  
>> tip_200801281756_0001_m_000007, for tracker 'tracker_localhost:/ 
>> 127.0.0.1:43625'
>> 2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'task_200801281756_0001_m_000007_0' from  
>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>> 2008-01-28 19:42:14,538 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Task  
>> 'task_200801281756_0001_m_000007_1' has completed  
>> tip_200801281756_0001_m_000007 successfully.
>> 2008-01-28 19:51:44,625 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Failed fetch notification  
>> #3 for task task_200801281756_0001_m_000006_0
>> 2008-01-28 19:51:44,625 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures  
>> for output of task: task_200801281756_0001_m_000006_0 ... killing it
>> 2008-01-28 19:51:44,625 INFO  
>> org.apache.hadoop.mapred.TaskInProgress: Error from  
>> task_200801281756_0001_m_000006_0: Too many fetch-failures
>> 2008-01-28 19:51:44,625 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Choosing normal task  
>> tip_200801281756_0001_m_000006
>> 2008-01-28 19:51:44,626 INFO org.apache.hadoop.mapred.JobTracker:  
>> Adding task 'task_200801281756_0001_m_000006_1' to tip  
>> tip_200801281756_0001_m_000006, for tracker 'tracker_localhost:/ 
>> 127.0.0.1:43625'
>> 2008-01-28 19:51:44,671 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'task_200801281756_0001_m_000006_0' from  
>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>> 2008-01-28 19:51:46,503 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Task  
>> 'task_200801281756_0001_m_000006_1' has completed  
>> tip_200801281756_0001_m_000006 successfully.
>> 2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.TaskRunner:  
>> Saved output of task 'task_200801281756_0001_r_000000_0' to hdfs:// 
>> humanufactory.com:9000/output
>> 2008-01-28 19:51:51,768 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Task  
>> 'task_200801281756_0001_r_000000_0' has completed  
>> tip_200801281756_0001_r_000000 successfully.
>> 2008-01-28 19:51:51,779 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Job job_200801281756_0001  
>> has completed successfully.
>> 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'task_200801281756_0001_m_000006_0' from  
>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>> 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'task_200801281756_0001_m_000007_0' from  
>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>> 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'task_200801281756_0001_m_000002_0' from  
>> 'tracker_localhost:/127.0.0.1:43625'
>> 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'task_200801281756_0001_m_000004_0' from  
>> 'tracker_localhost:/127.0.0.1:43625'
>> ...
>>
>>
>> Tasktracker log:
>> 2008-01-28 19:07:02,844 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker:  
>> getMapOutput(task_200801281756_0001_m_000006_0,0) failed :
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not  
>> find task_200801281756_0001_m_000006_0/file.out.index in any of  
>> the configured local directories
>>     at org.apache.hadoop.fs.LocalDirAllocator 
>> $AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
>>     at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead 
>> (LocalDirAllocator.java:138)
>>     at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet 
>> (TaskTracker.java:1967)
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>>     at org.mortbay.jetty.servlet.ServletHolder.handle 
>> (ServletHolder.java:427)
>>     at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch 
>> (WebApplicationHandler.java:475)
>>     at org.mortbay.jetty.servlet.ServletHandler.handle 
>> (ServletHandler.java:567)
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>>     at org.mortbay.jetty.servlet.WebApplicationContext.handle 
>> (WebApplicationContext.java:635)
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>>     at org.mortbay.http.HttpConnection.service(HttpConnection.java: 
>> 814)
>>     at org.mortbay.http.HttpConnection.handleNext 
>> (HttpConnection.java:981)
>>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java: 
>> 831)
>>     at org.mortbay.http.SocketListener.handleConnection 
>> (SocketListener.java:244)
>>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java: 
>> 357)
>>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java: 
>> 534)
>>
>> 2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker:  
>> Unknown child with bad map output:  
>> task_200801281756_0001_m_000006_0. Ignored.
>> 2008-01-28 19:07:08,864 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:14,884 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:17,894 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:23,914 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:29,934 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:32,944 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:38,964 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:44,983 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>> 2008-01-28 19:07:47,994 INFO org.apache.hadoop.mapred.TaskTracker:  
>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of  
>> 11 at 0.00 MB/s) >
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Jan 28, 2008, at 8:43 PM, Amar Kamat wrote:
>>
>>> Hi,
>>> Can you plz attach the Jobtracker and any of the stuck reducer's  
>>> log? Is there any way I can reproduce this error?
>>> Amar
>>> Florian Leibert wrote:
>>>> hi,
>>>> i setup a cluster with 3 slaves and 1 master. when running the  
>>>> wordcount example, hadoop execution halts after the mapping  
>>>> phase. It has started the reduce phase and generally gets to  
>>>> 10-20% but halts there. I left it running for 30 minutes, no  
>>>> change. Log files don't show any errors...
>>>> I reformatted the Hadoop FS that i created, access to the files  
>>>> from all nodes seems fine.
>>>> Has anyone seen a similar behavior?
>>>>
>>>> Thanks,
>>>> Florian
>>>>
>>>>
>>>
>>
>>
>


Mime
View raw message