Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 17807 invoked from network); 14 Aug 2008 02:18:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Aug 2008 02:18:41 -0000 Received: (qmail 21504 invoked by uid 500); 14 Aug 2008 02:18:34 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 21482 invoked by uid 500); 14 Aug 2008 02:18:34 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 21471 invoked by uid 99); 14 Aug 2008 02:18:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Aug 2008 19:18:34 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Aug 2008 02:17:46 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 62B35234C1AD for ; Wed, 13 Aug 2008 19:17:44 -0700 (PDT) Message-ID: <252568237.1218680264403.JavaMail.jira@brutus> Date: Wed, 13 Aug 2008 19:17:44 -0700 (PDT) From: "Chris Douglas (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-3446) The reduce task should not flush the in memory file system before starting the reducer In-Reply-To: <1115045238.1211824682958.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HADOOP-3446: ---------------------------------- Attachment: 3446-0.patch I tested this on a 100 node cluster (98 tasktrackers) using sort. Given 300MB/node of data and a sufficiently large io.sort.mb and fs.inmemory.size.mb, io.sort.spill.percent=1.0, fs.inmemory.merge.threshold=0, and mapred.inmem.usage=1.0, each reduce took an average of 121 seconds when reading from disk vs 79 seconds merging and reducing from memory. While the sort with the patch finished the job in 8 minutes instead of 9, both had slow tasktrackers that threw off the running time. This also includes some similar changes to MapTask, letting the record and serialization buffer soft limits be configured separately. > The reduce task should not flush the in memory file system before starting the reducer > -------------------------------------------------------------------------------------- > > Key: HADOOP-3446 > URL: https://issues.apache.org/jira/browse/HADOOP-3446 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Priority: Critical > Attachments: 3446-0.patch > > > In the case where the entire reduce inputs fit in ram, we currently force the input to disk and re-read it before giving it to the reducer. It would be much better if we merged from the ramfs and any spills to feed the reducer its input. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.