hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5003) When computing absoluet guaranteed capacity (GC) from a percent value, Capacity Scheduler should round up floats, rather than truncate them.
Date Mon, 19 Jan 2009 10:23:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665089#action_12665089

Vivek Ratan commented on HADOOP-5003:

Well, I thought I was clear why this isn't is a bug, but let me give you a detailed example.
 My argument is simple: in theory, you do not want the sum of GCs to be larger than the cluster
size (and we ensure that when we start up and read the config file), but in practice, given
that this is a distributed system, there will be situations when the sum of GCs is greater
than the 'actual' cluster size at a given moment. This happens when TTs fail. Consider the
situation when you have two queues: Q1 and Q2. Assume their GCs are 5 slots (map or reduce,
doesn't matter) each, i.e., there are 10 slots in the system. For simplification, assume there
are 10 TTs and 1 slot per TT. Now suppose that Q1 is running at capacity, and Q2 is only using
4 out of 5 slots, because it doesn't have any more tasks to run. So, 1 TT  is free. Also assume
that the tasks are long running, i.e. they take minutes to complete. This is time T0. Now,
suppose a user submits a job with lots of tasks to Q2, at time T0+1second. Also suppose that
at around the same time, i.e., at T0+1, the idle TT dies (it doesn't have to be the same time,
but anytiem before it sends a heartbeat) . Further, suppose that the reclaim capacity thread
runs at time T0+3 seconds (it runs every 5 seconds by default). What should it do? The actual
cluster capacity is 9, but the JT and scheduler do not know that yet. Remember it takes over
10 minutes for the JT to detect that the TT is down, and to update the cluster status. So,
what does the Scheduler do? 

Your suggestion is that a timer is started for Q2, since it's below capacity and has pending
tasks. So, at time T0+3, a timer gets started. Assuming that the reclaim time for the queue
is 3 minutes, this timer will go off at T0+183 (in seconds). When the timer goes off, what
happens? We still haven't detected the lost TT (that will happen at T0+600 at the earliest,
I believe). The timer has gone off, and we need to kill. Well, do we kill from Q1? If you
say no (Q1 is, after all, running at capacity only), the timer is wasted. If you say yes,
it's unfair to Q1. 

I am arguing that the timer shouldn't have been set in the first place. The SLA is valid,
as I see it, only IF there is capacity to reclaim. If nobody has taken my capacity, there
is no SLA. Had we known instantly about the TT going down, you would recompute capacities,
and Q2's GC would be 4 instead of 5, and it would be at capacity. Q2's demand is not being
satisfied because there is an incorrect view of what Q2's capacity is. The SLA should not
apply here. 

If you look at the way we've worded the requirement for reclaiming of capacity in HADOOP-3421,
it reads "...the system will guarantee that excess resources taken from an Org will be restored
to it within N minutes of its need for them". The key phrase is 'resources taken from an Org'.
If no queue is running over capacity, no resources have been taken from a queue, and hence
the SLA is not in force. 

> When computing absoluet guaranteed capacity (GC) from a percent value, Capacity Scheduler
should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-5003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5003
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vivek Ratan
>            Priority: Minor
> The Capacity Scheduler calculates a queue's absolute GC value by getting its percent
of the total cluster capacity (which is a float, since the configured GC% is a float) and
casting it to an int. Casting a float to an int always rounds down. For very small clusters,
this can result in the GC of a queue being one lower than what it should be. For example,
if Q1 has a GC of 50%, Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity
is 4 (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's to 0 with
today's code. Q2's capacity should really be 2, as 40% of 4, rounded up, should be 2. 
> Simple fix is to use Math.round() rather than cast to an int. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message