hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5003) When computing absoluet guaranteed capacity (GC) from a percent value, Capacity Scheduler should round up floats, rather than truncate them.
Date Wed, 21 Jan 2009 06:28:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665722#action_12665722
] 

Vivek Ratan commented on HADOOP-5003:
-------------------------------------

People are not going to want SLAs that are 10 mins or higher. I think they'll be OK waiting
a few mins, maybe 5. It should be OK if we let them set whatever time they want for the queue,
but print a suitable warning message indicating that if the capacity is not reclaimed, it's
likely because of failed TTs. Small values for the SLA seem perfectly reasonable, especially
when all TTs are running. 

bq. From Q2's point of view, he doesn't see that TT1 is down. He sees that he is allocated
50% and that he isn't getting the 5 slots he should. He gets mad that no timer is running
to get him his slot back.
Well, even if you start a timer the moment the job is submitted, there is no task to kill
because nobody is running over capacity. So the timer is wasted. This is, of course, with
the assumption that the reclaim time is less than the TT failure detection time (which, as
I wrote earlier, should be allowed). 

I still don't agree with #2 and #3 of your 'Take home messages'. As per existing requirements,
the SLA is in force if there are resources to be claimed. We can change the requirements,
but it's not clear to me that we should. I'd rather modify the documentation to let users
know that if they don't see a timer being started, it's because some TTs are down. Versus
asking them to set SLAs higher than 10mins. 

I understand that you want it to be clear to the user why they're not getting all their slots.
But your only choices seem to be to force SLA times to be very high, or to provide an explanation
somewhere (in documentation, UI, or whatever). We should do the latter, but realize that if
we're accepting smaller SLA times, setting timers early will not help - users will still not
get their slots back if TTs are down. 

> When computing absoluet guaranteed capacity (GC) from a percent value, Capacity Scheduler
should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5003
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vivek Ratan
>            Priority: Minor
>
> The Capacity Scheduler calculates a queue's absolute GC value by getting its percent
of the total cluster capacity (which is a float, since the configured GC% is a float) and
casting it to an int. Casting a float to an int always rounds down. For very small clusters,
this can result in the GC of a queue being one lower than what it should be. For example,
if Q1 has a GC of 50%, Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity
is 4 (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's to 0 with
today's code. Q2's capacity should really be 2, as 40% of 4, rounded up, should be 2. 
> Simple fix is to use Math.round() rather than cast to an int. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message