hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sreekanth Ramakrishnan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4513) Capacity scheduler should initialize tasks asynchronously
Date Wed, 12 Nov 2008 11:37:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646873#action_12646873
] 

Sreekanth Ramakrishnan commented on HADOOP-4513:
------------------------------------------------

After off-line discussion with Hemanth and Vivek, following is the proposal for implementing
asynchronous initialization of jobs by capacity Scheduler:

- Modify _CapacityTaskScheduler_ to look only at the Run-queue maintained by _JobQueueManager_.
This queue contains all initialized jobs.
- Modify _JobQueueManager_ to change semantics of waiting job queue to a list of jobs which
with are waiting to be scheduled. Please note that when a job is waiting to be scheduled it
means, that there is a possibility that a Job J1 would be in both running and job queue at
same time. When the first map or reduce of the job is scheduled, the job would be removed
from the job queue which _JobQueueManager_ maintains.
- Introduce a new poller class, which looks at the _JobQueueManager.getJobs(queue)_ and picks
up tasks to initialize for that queue.
- Following will be parameters which would be parameters which would be used for selecting
jobs for eager initialization:
-- Maximum jobs which can be initialized per user. This would be a configuration parameter
which would be introduced in _capacity_scheduler.xml_
-- Number of concurrent users supported by the queue, so the initialization poller would initialize
((userlimits/100) + 2 ) user jobs.
- The selected jobs would be passed on to worker threads, which can be assigned duty of initializing
jobs from one or more queues.
- The worker thread maintains separate lists for jobs from different queues sorted by priority
as same as _JobQueueManager_
- The worker thread then initializes the jobs from queues in a round robin fashion amongst
the job queues assigned to it, i.e. it initializes first job from q1 and then first job from
q2.

Illustration:

Consider a job queue : q which can support one con-current user (i.e. userlimits = 100%).
Three users U1,U2,U3 are submittign jobs in following distribution:

Maximum number of jobs to be initialized per user : 2


J1U1,J2U1,J3U1,J4U1,J1U2,J2U2,J3U3,J4U4,J1U3,J2U3,J3U3,J4U3.

Jobs initialized by the Initialization threads would be:

J1U1,J2U1,J1U2,J2U2,J1U3,J2U3.

And all these are just initialized but not scheduled and a User U4 submits a very high priority
Job and a normal priority, so our job queue in t+1 instance would look like :

J1U4,J1U1,J2U1,J3U1,J4U1,J1U2,J2U2,J3U3,J4U4,J1U3,J2U3,J3U3,J4U3,J2U4.

So next iteration poller would have initialized following :

J1U4,J1U1,J2U1,J1U2,J2U2,J1U3,J2U3. 

Please note that U4's second job would not be initialized.

If user1 had submitted the very high priority Job then he would be crossing the maximum limit
of jobs which are allowed to be initialized per user. 


In above example if J1U1 is a job which takes long initialization time, the next job to be
initialized would be the next highest priority  or highest priority jobs (if the job is submitted
late as above example).


Any thoughts on the above approach?




> Capacity scheduler should initialize tasks asynchronously
> ---------------------------------------------------------
>
>                 Key: HADOOP-4513
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4513
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>            Reporter: Hemanth Yamijala
>            Assignee: Sreekanth Ramakrishnan
>
> Currently, the capacity scheduler initializes tasks on demand, as opposed to the eager
initialization technique used by the default scheduler. This is done in order to save JT memory
footprint. However, the initialization is done in the {{assignTasks}} API which is not a good
idea as task initialization could be a time consuming operation. This JIRA is to move out
the initialization outside the {{assignTasks}} API and do it asynchronously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message