hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ari Rabkin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-657) Free temporary space should be modelled better
Date Thu, 17 Jul 2008 16:15:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614408#action_12614408
] 

Ari Rabkin commented on HADOOP-657:
-----------------------------------

I don't have strong feelings about whether to do space-consumed measurement in the TaskTracker
or the Task.  I figured it made more sense to fill out the whole TaskStatus in one place.Otherwise
you get confused in the TaskTracker code, whether or not the space-consumed has been filled
in yet.   I'm open to doing this the other way 'round, and having TaskTracker responsible
for it.  Certainly if there were other similar resource counters being filled in in TaskTracker,
this one ought to be.

I was tempted to use metrics for this, and looked at piggybacking of this sort of thing more
generally on heartbeats.  I was promptly shot down.  There was a strong sentiment, notably
from Owen and Arun, that Hadoop's core functionality shouldn't depend on Metrics, and that
Metrics should just be for analytics. 

> Free temporary space should be modelled better
> ----------------------------------------------
>
>                 Key: HADOOP-657
>                 URL: https://issues.apache.org/jira/browse/HADOOP-657
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Ari Rabkin
>             Fix For: 0.19.0
>
>         Attachments: clean_spaceest.patch, diskspaceest.patch, diskspaceest_v2.patch,
diskspaceest_v3.patch, diskspaceest_v4.patch
>
>
> Currently, there is a configurable size that must be free for a task tracker to accept
a new task. However, that isn't a very good model of what the task is likely to take. I'd
like to propose:
> Map tasks:  totalInputSize * conf.getFloat("map.output.growth.factor", 1.0) / numMaps
> Reduce tasks: totalInputSize * 2 * conf.getFloat("map.output.growth.factor", 1.0) / numReduces
> where totalInputSize is the size of all the maps inputs for the given job.
> To start a new task, 
>   newTaskAllocation + (sum over running tasks of (1.0 - done) * allocation) >= 
>        free disk * conf.getFloat("mapred.max.scratch.allocation", 0.90);
> So in English, we will model the expected sizes of tasks and only task tasks that should
leave us a 10% margin. With:
> map.output.growth.factor -- the relative size of the transient data relative to the map
inputs
> mapred.max.scratch.allocation -- the maximum amount of our disk we want to allocate to
tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message