hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-657) Free temporary space should be modelled better
Date Thu, 29 May 2008 06:34:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600694#action_12600694

Devaraj Das commented on HADOOP-657:

bq. 4) Create a new ResourceConsumptionEstimator class, and have an instance of that type
for each JobInProgress. This will have, at a minimum, reportCompletedMapTask(MapTaskStatus
t) and estimateSpaceForMapTask(MapTask mt) The implementation would probably be a thread that
processes asynchronously, and updates an atomic value that'll be either the estimated space
requirement, or else the estimated ratio between input size and output size. Until sufficiently
many maps have completed (10%, say) the size estimate would just be the size of each map's
input. Afterwards, we'll take the 75th percentile of the measured blowup in task size.

Ari, I haven't looked at the patch yet, but it'd help if could you please give an example
for this one with some numbers.

> Free temporary space should be modelled better
> ----------------------------------------------
>                 Key: HADOOP-657
>                 URL: https://issues.apache.org/jira/browse/HADOOP-657
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Ari Rabkin
>         Attachments: diskspaceest.patch
> Currently, there is a configurable size that must be free for a task tracker to accept
a new task. However, that isn't a very good model of what the task is likely to take. I'd
like to propose:
> Map tasks:  totalInputSize * conf.getFloat("map.output.growth.factor", 1.0) / numMaps
> Reduce tasks: totalInputSize * 2 * conf.getFloat("map.output.growth.factor", 1.0) / numReduces
> where totalInputSize is the size of all the maps inputs for the given job.
> To start a new task, 
>   newTaskAllocation + (sum over running tasks of (1.0 - done) * allocation) >= 
>        free disk * conf.getFloat("mapred.max.scratch.allocation", 0.90);
> So in English, we will model the expected sizes of tasks and only task tasks that should
leave us a 10% margin. With:
> map.output.growth.factor -- the relative size of the transient data relative to the map
> mapred.max.scratch.allocation -- the maximum amount of our disk we want to allocate to

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message