hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5964) Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
Date Fri, 19 Jun 2009 14:16:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721792#action_12721792

Hemanth Yamijala commented on HADOOP-5964:

I've looked at most of the code changes (excluding tests and examples). Here are a few more

 - In getTaskFromQueue, I would request a comment on why we are not reserving tasktrackers
in the second pass (the reason, as we discussed offline, was because we don't think we need
to give users more leeway by reserving slots given they are already over their user limit)

 - hostnameToTrackerName seems a wrong name. it should be hostnameToTracker
 - Comment on trackerExpiryQueue refers to TreeSet of status objects.
 - In recovery, there is an 'interesting' behavior currently that a job can be initialized
by both the RecoveryManager or a job initialization thread like EagerTaskInitializer or JobInitializationPoller.
Which means that relying on preInitializeJob to set the right number of slots may be broken.
 - Since we are not storing information about reservations across restarts, one impact could
be on the fact that the counter information about how long reservations were made for a job
on a tracker could be lost. This may not be a big issue because reservations themselves are
lost on restart, but just wanted to check what you thought.

 - I am thinking if it will be good to make unreserveSlots re-entrant. I struggled a bit to
determine that it will never be called twice in any scenario, which seems to be the case now.
But if we can make it re-entrant by simply ignoring the operation if the reserved Job is null,
it might save us some corner case bugs. Note we are currently throwing a runtime exception.

 - We are not handling the case where memory based scheduling is disabled, but jobconf has
some non default value for the job size (say because of user misconfiguration). computeNumSlotsPerMap
should probably check the value and return 1 if it is disabled. Otherwise it could get set
to a -ve value.

 - The computation of committed memory included tasks that were in the commit pending state
for a reason. We'll need to check this with someone from the M/R team.

> Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
> ---------------------------------------------------------------------------
>                 Key: HADOOP-5964
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5964
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.21.0
>         Attachments: HADOOP-5964_0_20090602.patch, HADOOP-5964_1_20090608.patch, HADOOP-5964_2_20090609.patch,
HADOOP-5964_4_20090615.patch, HADOOP-5964_6_20090617.patch, HADOOP-5964_7_20090618.patch,
> When a HighRAMJob turns up at the head of the queue, the current implementation of support
for HighRAMJobs in the Capacity Scheduler has problem in that the scheduler stops assigning
tasks to all TaskTrackers in the cluster until a HighRAMJob finds a suitable TaskTrackers
for all its tasks.
> This causes a severe utilization problem since effectively no new tasks are allowed to
run until the HighRAMJob (at the head of the queue) gets slots.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message