Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <1284792906.1232482742101.JavaMail.jira@brutus>
Date: Tue, 20 Jan 2009 12:19:02 -0800 (PST)
From: "Joydeep Sen Sarma (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-5075) Potential infinite loop in
 updateMinSlots
In-Reply-To: <337180766.1232156639566.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665540#action_12665540 ] 

Joydeep Sen Sarma commented on HADOOP-5075:
-------------------------------------------

question - regarding the 'break' in the slotsLeft == oldSlots

this doesn't look correct to me - it seems that there is no guarantee that all available slots are distributed in one round. and that is why earlier we had a for loop over the slots. but now we are claiming that by going over the jobs one last time - we will be able to distribute all the slots?

The basic problem seems to be:

             int share = (int) Math.ceil(oldSlots * weight / totalWeight);
              slotsLeft = giveMinSlots(job, type, slotsLeft, share);

I believe that the share computed is quite likely to be less than the maximum number of slots that the task can consume. So going from 'floor' to 'ceil' may not be enough to guarantee that slots get consumed (and certainly not enough to consume that *all* the slots left get consumed).

my gut feel is that the correct solution (when oldSlots == slotsLeft) should be something that takes into account the max tasks that a job can consume (as opposed to it's weighted share only). 


> Potential infinite loop in updateMinSlots
> -----------------------------------------
>
>                 Key: HADOOP-5075
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5075
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Priority: Blocker
>             Fix For: 0.19.1, 0.20.0, 0.21.0
>
>         Attachments: hadoop-5075-v2.patch, hadoop-5075-v3.patch, hadoop-5075.patch
>
>
> We ran into a problem at Facebook where the updateMinSlots loop in the scheduler was repeating infinitely. This might happen if, due to rounding, we are unable to assign the last few slots in a pool. This patch adds a break statement to ensure that the loop exists if it hasn't managed to assign any slots.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.