hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
Date Mon, 10 Nov 2008 11:02:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646223#action_12646223

Amar Kamat commented on HADOOP-4558:

Here J1 is still using 12 extra map and 1 extra reduce slots
It took nearly two more minutes to when j1 and j2 both starts using MR slots equal to their
The reason is as follows :
When job2 gets added, a {{ReclaimedResource}} object is added to the reclaim queue. After
_whenToKill_ units of time, tasks from job1 are killed. But at this point of time job2 is
not set up and hence is not able to schedule tasks. So again job1 is selected for scheduling
tasks. Now once job2 finishes setup, the reclaim request is added for the (extra) scheduled
tasks. Hence the observation that there is some extra killings and the guaranteed capacity
is allocated after few mins.

I think the issue is more involved. Here are the choices
1) Let it be : Since the setup task took time to schedule and finish, its ok to keep it as
it is. What we guarantee here is that the slots will be allocated to the queue as soon as
a request is made
2) Delay : One way to avoid the _thrashing_ is to delay the reclaim until the job/queue which
wants it, actually needs it. The obvious problem with this is that it will take sometime to
kill the tasks and hence there will a little delay in reclaim. Also the _sla_ needs to be

Note that this issue also depends on how set-up tasks are handled in future and when the job
actually becomes _RUNNING_.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler
fails to reclaim capacity for second job.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message