Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 11928 invoked from network); 18 Feb 2009 05:02:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Feb 2009 05:02:27 -0000 Received: (qmail 97390 invoked by uid 500); 18 Feb 2009 05:02:24 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 97348 invoked by uid 500); 18 Feb 2009 05:02:24 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 97115 invoked by uid 99); 18 Feb 2009 05:02:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2009 21:02:23 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2009 05:02:22 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 1E85D234C48D for ; Tue, 17 Feb 2009 21:02:02 -0800 (PST) Message-ID: <577600423.1234933322123.JavaMail.jira@brutus> Date: Tue, 17 Feb 2009 21:02:02 -0800 (PST) From: "Andy Pavlo (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-5241) Reduce tasks get stuck because of over-estimated task size (regression from 0.18) In-Reply-To: <60673373.1234476179610.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Pavlo updated HADOOP-5241: ------------------------------- Attachment: hadoop-patched-jobtracker.log.gz JobTracker logfile running same benchmarks as before without any errors. > Reduce tasks get stuck because of over-estimated task size (regression from 0.18) > --------------------------------------------------------------------------------- > > Key: HADOOP-5241 > URL: https://issues.apache.org/jira/browse/HADOOP-5241 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.19.0 > Environment: Red Hat Enterprise Linux Server release 5.2 > JDK 1.6.0_11 > Hadoop 0.19.0 > Reporter: Andy Pavlo > Assignee: Sharad Agarwal > Priority: Blocker > Fix For: 0.19.1 > > Attachments: 5241_v1.patch, hadoop-jobtracker.log.gz, hadoop-patched-jobtracker.log.gz, hadoop_task_screenshot.png > > > I have a simple MR benchmark job that computes PageRank on about 600 GB of HTML files using a 100 node cluster. For some reason, my reduce tasks get caught in a pending state. The JobTracker's log gets filled with the following messages: > 2009-02-12 15:47:29,839 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-59.cs.wisc.edu:localhost/127.0.0.1:33227 has 110125027328 bytes free; but we expect reduce input to take 399642198235 > 2009-02-12 15:47:29,852 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-67.cs.wisc.edu:localhost/127.0.0.1:48626 has 107537776640 bytes free; but we expect reduce input to take 399642198235 > 2009-02-12 15:47:29,885 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-73.cs.wisc.edu:localhost/127.0.0.1:58849 has 113631690752 bytes free; but we expect reduce input to take 399642198235 > > The weird thing is that I get through about 70 reduce tasks completing before it hangs. If I reduce the amount of the input data on 100 nodes down to 200GB, then it seems to work. As I scale the amount of input to the number of nodes, I can get it work some of the times on 50 nodes and without any problems on 25 nodes and less. > Note that it worked without any problems on Hadoop 0.18 late last year without changing any of the input data or the actual MR code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.