hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod K V (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5241) Reduce tasks get stuck because of over-estimated task size (regression from 0.18)
Date Fri, 13 Feb 2009 04:13:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673137#action_12673137
] 

Vinod K V commented on HADOOP-5241:
-----------------------------------

The patch related to considering disk space for scheduling was committed to 0.19 and later
(HADOOP-657). So, the difference between behaviour of 0.18 and that of 0.19 is expected. But,
it is definitely weird for the estimate to shoot up to as high as 400 GB. Can you attach the
whole JT log when this particular problem occured, so that we can find how the estimates performed/were
calculated over time and when they started going wayward.

> Reduce tasks get stuck because of over-estimated task size (regression from 0.18)
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-5241
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5241
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>         Environment: Red Hat Enterprise Linux Server release 5.2
> JDK 1.6.0_11
> Hadoop 0.19.0
>            Reporter: Andy Pavlo
>            Priority: Blocker
>
> I have a simple MR benchmark job that computes PageRank on about 600 GB of HTML files
using a 100 node cluster. For some reason, my reduce tasks get caught in a pending state.
The JobTracker's log gets filled with the following messages:
> 2009-02-12 15:47:29,839 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce
task. Node tracker_d-59.cs.wisc.edu:localhost/127.0.0.1:33227 has 110125027328 bytes free;
but we expect reduce input to take 399642198235
> 2009-02-12 15:47:29,852 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce
task. Node tracker_d-67.cs.wisc.edu:localhost/127.0.0.1:48626 has 107537776640 bytes free;
but we expect reduce input to take 399642198235
> 2009-02-12 15:47:29,885 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce
task. Node tracker_d-73.cs.wisc.edu:localhost/127.0.0.1:58849 has 113631690752 bytes free;
but we expect reduce input to take 399642198235
> <SNIP>
> The weird thing is that I get through about 70 reduce tasks completing before it hangs.
If I reduce the amount of the input data on 100 nodes down to 200GB, then it seems to work.
As I scale the amount of input to the number of nodes, I can get it work some of the times
on 50 nodes and without any problems on 25 nodes and less.
> Note that it worked without any problems on Hadoop 0.18 late last year without changing
any of the input data or the actual MR code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message