hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-657) Free temporary space should be modelled better
Date Thu, 17 Jul 2008 07:35:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614266#action_12614266

Vinod Kumar Vavilapalli commented on HADOOP-657:

HADOOP-3581 tries to manage memory used by tasks. I am trying to follow the approach of this
JIRA, and have a couple of comments.
 - I see that you are having free-space computation inside the task. Instead, why can't we
do it in the tasktracker itself? In this JIRA, we are caring only about mapOutputFiles and
for watching them, we just need the JOB ID and TIP ID. Memory tracking HAS to be done in TT
and not task, to shield the tracking business itself from any rogue tasks. I think it would
be good if we can manage both these resources in TT itself, ultimately moving all of these
into a single resource management class in TT. Unless I am missing something else here. Thoughts?
 - I also see in this patch that availableSpace is sent to JT via TaskTrackerStatus. What
happened to Doug's idea of "using a general mechanism to route metrics to the jobtracker through
heartbeats, rather than hack things in one-by-one". A general mechanism like the one Arun
proposed (MetricsContext) would also help HADOOP-3759(which intends to use freeMemory information
for scheduling decisions).

> Free temporary space should be modelled better
> ----------------------------------------------
>                 Key: HADOOP-657
>                 URL: https://issues.apache.org/jira/browse/HADOOP-657
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Ari Rabkin
>             Fix For: 0.19.0
>         Attachments: clean_spaceest.patch, diskspaceest.patch, diskspaceest_v2.patch,
diskspaceest_v3.patch, diskspaceest_v4.patch
> Currently, there is a configurable size that must be free for a task tracker to accept
a new task. However, that isn't a very good model of what the task is likely to take. I'd
like to propose:
> Map tasks:  totalInputSize * conf.getFloat("map.output.growth.factor", 1.0) / numMaps
> Reduce tasks: totalInputSize * 2 * conf.getFloat("map.output.growth.factor", 1.0) / numReduces
> where totalInputSize is the size of all the maps inputs for the given job.
> To start a new task, 
>   newTaskAllocation + (sum over running tasks of (1.0 - done) * allocation) >= 
>        free disk * conf.getFloat("mapred.max.scratch.allocation", 0.90);
> So in English, we will model the expected sizes of tasks and only task tasks that should
leave us a 10% margin. With:
> map.output.growth.factor -- the relative size of the transient data relative to the map
> mapred.max.scratch.allocation -- the maximum amount of our disk we want to allocate to

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message