Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 48923 invoked from network); 20 Jan 2009 20:19:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Jan 2009 20:19:26 -0000 Received: (qmail 5859 invoked by uid 500); 20 Jan 2009 20:19:25 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 5825 invoked by uid 500); 20 Jan 2009 20:19:25 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 5814 invoked by uid 99); 20 Jan 2009 20:19:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2009 12:19:25 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2009 20:19:23 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 19229234C498 for ; Tue, 20 Jan 2009 12:19:02 -0800 (PST) Message-ID: <1284792906.1232482742101.JavaMail.jira@brutus> Date: Tue, 20 Jan 2009 12:19:02 -0800 (PST) From: "Joydeep Sen Sarma (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-5075) Potential infinite loop in updateMinSlots In-Reply-To: <337180766.1232156639566.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665540#action_12665540 ] Joydeep Sen Sarma commented on HADOOP-5075: ------------------------------------------- question - regarding the 'break' in the slotsLeft == oldSlots this doesn't look correct to me - it seems that there is no guarantee that all available slots are distributed in one round. and that is why earlier we had a for loop over the slots. but now we are claiming that by going over the jobs one last time - we will be able to distribute all the slots? The basic problem seems to be: int share = (int) Math.ceil(oldSlots * weight / totalWeight); slotsLeft = giveMinSlots(job, type, slotsLeft, share); I believe that the share computed is quite likely to be less than the maximum number of slots that the task can consume. So going from 'floor' to 'ceil' may not be enough to guarantee that slots get consumed (and certainly not enough to consume that *all* the slots left get consumed). my gut feel is that the correct solution (when oldSlots == slotsLeft) should be something that takes into account the max tasks that a job can consume (as opposed to it's weighted share only). > Potential infinite loop in updateMinSlots > ----------------------------------------- > > Key: HADOOP-5075 > URL: https://issues.apache.org/jira/browse/HADOOP-5075 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/fair-share > Reporter: Matei Zaharia > Priority: Blocker > Fix For: 0.19.1, 0.20.0, 0.21.0 > > Attachments: hadoop-5075-v2.patch, hadoop-5075-v3.patch, hadoop-5075.patch > > > We ran into a problem at Facebook where the updateMinSlots loop in the scheduler was repeating infinitely. This might happen if, due to rounding, we are unable to assign the last few slots in a pool. This patch adds a break statement to ensure that the loop exists if it hasn't managed to assign any slots. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.