hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4513) Capacity scheduler should initialize tasks asynchronously
Date Wed, 29 Oct 2008 10:59:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643464#action_12643464
] 

Hemanth Yamijala commented on HADOOP-4513:
------------------------------------------

Some more information on the proposal above, based on my discussion with Vivek.

bq. The limits on the initialized jobs are for waiting jobs only.

This means that we do *not* count jobs that are already running (and therefore, init'ed) in
applying the limits. In that sense, it is easier for me to think about the limit as analogous
to a cache pre-fetch limit, rather than a cap on the number of init'ed jobs. Maybe we should
call this something like {{mapred.capacity-scheduler.queue.queue-name.max-waiting-jobs-inited-per-user}}.

bq. So it doesn't make sense to have a per-queue limit on the total number of initialized
jobs. Having such a limit can actually cause incorrect behavior, as this pre-configured limit
may be small enough to prevent initialization of jobs from one or more users.

To illustrate this point, suppose we had such a limit as 5 jobs in the example above, then
we would never end up initializing any job from the 4th user. Hence though by virtue of user
limits, he could have run, as the job is not inited until one of the other jobs becomes running,
he does not. Even worse, if there are more jobs from the first three users ahead of the queue,
he would have to wait until all of them become running before this job can run. This could
take quite a while.

bq. Ideally, the thread would un-initialize one of the 2 previously jobs. This is a nice optimization,
but we probably don't need it right away.

Reversing the initialization of a job looks like a good option to think about.

> Capacity scheduler should initialize tasks asynchronously
> ---------------------------------------------------------
>
>                 Key: HADOOP-4513
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4513
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>            Reporter: Hemanth Yamijala
>            Assignee: Sreekanth Ramakrishnan
>
> Currently, the capacity scheduler initializes tasks on demand, as opposed to the eager
initialization technique used by the default scheduler. This is done in order to save JT memory
footprint. However, the initialization is done in the {{assignTasks}} API which is not a good
idea as task initialization could be a time consuming operation. This JIRA is to move out
the initialization outside the {{assignTasks}} API and do it asynchronously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message