hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: hadoop 0.15.3 r612257 freezes on reduce task
Date Tue, 29 Jan 2008 17:44:46 GMT
That was the error that we were seeing in our hung reduce tasks. It went 
away for us, and we never figured out why. A number of things happened 
in our environment around the time it went a way.
We shifted to 0.15.2, our cluster moved to a separate switched vlan from 
our main network, we started using different machines for our cluster.

Florian Leibert wrote:
> Hi,
> I got some more logs (from the other nodes) - maybe this leads to some 
> conclusions:
>
> ###Node 1
> 2008-01-28 18:07:08,017 ERROR org.apache.hadoop.dfs.DataNode: 
> DataXceiver: java.io.IOException: Block blk_6263175697396671978 is 
> valid, and cannot be written to.
> at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
> at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)
> at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)
> at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
> at java.lang.Thread.run(Unknown Source)
>
> ###Node 2
> 008-01-28 18:07:08,109 WARN org.apache.hadoop.dfs.DataNode: Failed to 
> transfer blk_6263175697396671978 to 10.253.14.144:50010 got 
> java.io.IOException: operation failed at /10.253.14.144
> at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
> at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
> at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
> at java.lang.Thread.run(Thread.java:619)
>
> ###Node 3
> 2008-01-28 17:57:45,120 WARN org.apache.hadoop.dfs.DataNode: Failed to 
> transfer blk_-3751486067814847527 to 10.253.19.0:50010 got 
> java.io.IOException: operation failed at /10.253.19.0
> at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
> at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
> at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
> at java.lang.Thread.run(Thread.java:619)
>
> 2008-01-28 17:57:45,372 INFO org.apache.hadoop.dfs.DataNode: Served 
> block blk_8941456674455415759 to /10.253.34.241
> 2008-01-28 17:57:45,383 INFO org.apache.hadoop.dfs.DataNode: Served 
> block blk_-3751486067814847527 to /10.253.34.241
> 2008-01-28 18:07:02,026 INFO org.apache.hadoop.dfs.DataNode: Received 
> block blk_-2349720010162881555 from /10.253.14.144
> ...
>
> Thanks,
> Florian
>
> On Jan 29, 2008, at 4:55 AM, Amar Kamat wrote:
>
>> It seems that the reducers were not able to copy the output from the 
>> mappers (reduce % > 33 means that the copy of map output is over, i.e 
>> the shuffle phase), waited long for the mappers to recover and then 
>> finally after waiting for long (this is expected) the Jobtracker 
>> killed the map and re-executed on the local-machine and hence the job 
>> got completed. It takes 15 min on an average for a map to get killed 
>> by one mapper. It seems to be a disk problem on the machine where 
>> task_200801281756_0002_m_000006_0, task_200801281756_0002_m_000007_0 
>> and task_200801281756_0002_m_000008_0 got scheduled. Can you check if 
>> the disk space/health of these machines?
>> Amar
>> Florian Leibert wrote:
>>> i just saw that the job finally completed. however it took one hour 
>>> and 45 minutes - for a small job that runs in about 1-2 minutes on a 
>>> single node (outside hadoop framework). The reduce part took 
>>> extremely long - the output of the tasktracker shows about 5 of the 
>>> sample sections for each of the copies (11 copies). so clearly 
>>> something is limiting the reduce...
>>>
>>> Any clues?
>>>
>>> Thanks
>>>
>>> Florian
>>>
>>>
>>> Jobtracker log:
>>> ...
>>> 2008-01-28 19:08:01,981 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Failed fetch notification #1 for task task_200801281756_0002_m_000007_0
>>> 2008-01-28 19:16:00,848 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Failed fetch notification #2 for task task_200801281756_0001_m_000007_0
>>> 2008-01-28 19:32:33,485 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Failed fetch notification #2 for task task_200801281756_0002_m_000008_0
>>> 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Failed fetch notification #3 for task task_200801281756_0001_m_000007_0
>>> 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Too many fetch-failures for output of task: 
>>> task_200801281756_0001_m_000007_0 ... killing it
>>> 2008-01-28 19:42:12,823 INFO 
>>> org.apache.hadoop.mapred.TaskInProgress: Error from 
>>> task_200801281756_0001_m_000007_0: Too many fetch-failures
>>> 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Choosing normal task tip_200801281756_0001_m_000007
>>> 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Adding task 'task_200801281756_0001_m_000007_1' to tip 
>>> tip_200801281756_0001_m_000007, for tracker 
>>> 'tracker_localhost:/127.0.0.1:43625'
>>> 2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Removed completed task 'task_200801281756_0001_m_000007_0' from 
>>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>>> 2008-01-28 19:42:14,538 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Task 'task_200801281756_0001_m_000007_1' has completed 
>>> tip_200801281756_0001_m_000007 successfully.
>>> 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Failed fetch notification #3 for task task_200801281756_0001_m_000006_0
>>> 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Too many fetch-failures for output of task: 
>>> task_200801281756_0001_m_000006_0 ... killing it
>>> 2008-01-28 19:51:44,625 INFO 
>>> org.apache.hadoop.mapred.TaskInProgress: Error from 
>>> task_200801281756_0001_m_000006_0: Too many fetch-failures
>>> 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Choosing normal task tip_200801281756_0001_m_000006
>>> 2008-01-28 19:51:44,626 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Adding task 'task_200801281756_0001_m_000006_1' to tip 
>>> tip_200801281756_0001_m_000006, for tracker 
>>> 'tracker_localhost:/127.0.0.1:43625'
>>> 2008-01-28 19:51:44,671 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Removed completed task 'task_200801281756_0001_m_000006_0' from 
>>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>>> 2008-01-28 19:51:46,503 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Task 'task_200801281756_0001_m_000006_1' has completed 
>>> tip_200801281756_0001_m_000006 successfully.
>>> 2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.TaskRunner: 
>>> Saved output of task 'task_200801281756_0001_r_000000_0' to 
>>> hdfs://humanufactory.com:9000/output
>>> 2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Task 'task_200801281756_0001_r_000000_0' has completed 
>>> tip_200801281756_0001_r_000000 successfully.
>>> 2008-01-28 19:51:51,779 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Job job_200801281756_0001 has completed successfully.
>>> 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Removed completed task 'task_200801281756_0001_m_000006_0' from 
>>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>>> 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Removed completed task 'task_200801281756_0001_m_000007_0' from 
>>> 'tracker_localhost.localdomain:/127.0.0.1:37469'
>>> 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Removed completed task 'task_200801281756_0001_m_000002_0' from 
>>> 'tracker_localhost:/127.0.0.1:43625'
>>> 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Removed completed task 'task_200801281756_0001_m_000004_0' from 
>>> 'tracker_localhost:/127.0.0.1:43625'
>>> ...
>>>
>>>
>>> Tasktracker log:
>>> 2008-01-28 19:07:02,844 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker: 
>>> getMapOutput(task_200801281756_0001_m_000006_0,0) failed :
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
>>> find task_200801281756_0001_m_000006_0/file.out.index in any of the 
>>> configured local directories
>>>     at 
>>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)

>>>
>>>     at 
>>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)

>>>
>>>     at 
>>> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1967)

>>>
>>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>>>     at 
>>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>>>     at 
>>> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)

>>>
>>>     at 
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) 
>>>
>>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>>>     at 
>>> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)

>>>
>>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>>>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>>>     at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>>>     at 
>>> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>>>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>>>     at 
>>> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) 
>>>
>>>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>>>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>>>
>>> 2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker: 
>>> Unknown child with bad map output: 
>>> task_200801281756_0001_m_000006_0. Ignored.
>>> 2008-01-28 19:07:08,864 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:14,884 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:17,894 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:23,914 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:29,934 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:32,944 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:38,964 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:44,983 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>> 2008-01-28 19:07:47,994 INFO org.apache.hadoop.mapred.TaskTracker: 
>>> task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 
>>> at 0.00 MB/s) >
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Jan 28, 2008, at 8:43 PM, Amar Kamat wrote:
>>>
>>>> Hi,
>>>> Can you plz attach the Jobtracker and any of the stuck reducer's 
>>>> log? Is there any way I can reproduce this error?
>>>> Amar
>>>> Florian Leibert wrote:
>>>>> hi,
>>>>> i setup a cluster with 3 slaves and 1 master. when running the 
>>>>> wordcount example, hadoop execution halts after the mapping phase. 
>>>>> It has started the reduce phase and generally gets to 10-20% but 
>>>>> halts there. I left it running for 30 minutes, no change. Log 
>>>>> files don't show any errors...
>>>>> I reformatted the Hadoop FS that i created, access to the files 
>>>>> from all nodes seems fine.
>>>>> Has anyone seen a similar behavior?
>>>>>
>>>>> Thanks,
>>>>> Florian
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message