hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4305) Implement delay scheduling in capacity scheduler for improving data locality
Date Fri, 08 Jun 2012 00:21:23 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291444#comment-13291444

Konstantin Shvachko commented on MAPREDUCE-4305:

Task locality is important. Interesting that it is only necessary to hook Capacity Scheduler
up to the logic that already existed in JobInProgress etc. I went over the general logic of
the patch. It looks good. But I have several formatting and code organization comments.
# Append _PROPERTY to new config key constants, e.g. NODE_LOCALITY_DELAY_PROPERTY. Looks like
other constants in CapacitySchedulerConf are like that.
# Bend longs lines.
# In CapacitySchedulerConf convert comments describing variables to a JavaDoc.
# In initializeDefaults() you should use {{capacity-scheduler}} not {{fairscheduler}} config
variables. Also since you introduced constants for the keys, use them rather than the raw
# JobInfo is confusing because there is already a class with that name. Call it something
like JobLocality. I'd rather move it into JobQueuesManager, because the latter maintains the
map of those
# Correct indentations in CapacityTaskScheduler, particularly eliminate all tabs, should be
spaces only.
# Add spaces between arguments, operators, and in some LOG messages.
# Add empty lines between new methods.
# updateLocalityWaitTimes() and updateLastMapLocalityLevel() should belong to JobQueuesManager,
# JobQueuesManager.infos is a map keyed with JobInProgress. It'd be better to use JobID as
a key?
# In TaskSchedulingMgr you need only one version of obtainNewTask to be abstract, the one
with cachelevel parameter. The other one should not be abstract and just call the abstract
obtainNewTask() with cachelevel set to any.

> Implement delay scheduling in capacity scheduler for improving data locality
> ----------------------------------------------------------------------------
>                 Key: MAPREDUCE-4305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4305
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Mayank Bansal
>            Assignee: Mayank Bansal
>         Attachments: MAPREDUCE-4305, MAPREDUCE-4305-1.patch
> Capacity Scheduler data local tasks are about 40%-50% which is not good.
> While my test with 70 node cluster i consistently get data locality around 40-50% on
a free cluster.
> I think we need to implement something like delay scheduling in the capacity scheduler
for improving the data locality.
> http://radlab.cs.berkeley.edu/publication/308
> After implementing the delay scheduling on Hadoop 22 I am getting 100 % data locality
in free cluster and around 90% data locality in busy cluster.
> Thanks,
> Mayank

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message