Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 75364 invoked from network); 17 Oct 2006 18:50:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 17 Oct 2006 18:50:38 -0000 Received: (qmail 64677 invoked by uid 500); 17 Oct 2006 18:50:33 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 64655 invoked by uid 500); 17 Oct 2006 18:50:33 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 64628 invoked by uid 99); 17 Oct 2006 18:50:33 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Oct 2006 11:50:33 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Oct 2006 11:50:32 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 780BF7142F7 for ; Tue, 17 Oct 2006 11:49:40 -0700 (PDT) Message-ID: <9382488.1161110980489.JavaMail.jira@brutus> Date: Tue, 17 Oct 2006 11:49:40 -0700 (PDT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index In-Reply-To: <25178997.1151527889918.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12443034 ] Doug Cutting commented on HADOOP-331: ------------------------------------- Owen: > the partition is implicit in which list the KeyByteOffset is on Ah, I somehow missed the lists. Sorry! You're right, no part is needed in KeyByteOffset. Sameer: > only include the partition# in the first key in each partition But then we cannot use SequenceFile's merger, since keys wouldn't be independently comparable, right? > map outputs should be written to a single output file with an index > ------------------------------------------------------------------- > > Key: HADOOP-331 > URL: http://issues.apache.org/jira/browse/HADOOP-331 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Affects Versions: 0.3.2 > Reporter: eric baldeschwieler > Assigned To: Devaraj Das > > The current strategy of writing a file per target map is consuming a lot of unused buffer space (causing out of memory crashes) and puts a lot of burden on the FS (many opens, inodes used, etc). > I propose that we write a single file containing all output and also write an index file IDing which byte range in the file goes to each reduce. This will remove the issue of buffer waste, address scaling issues with number of open files and generally set us up better for scaling. It will also have advantages with very small inputs, since the buffer cache will reduce the number of seeks needed and the data serving node can open a single file and just keep it open rather than needing to do directory and open ops on every request. > The only issue I see is that in cases where the task output is substantiallyu larger than its input, we may need to spill multiple times. In this case, we can do a merge after all spills are complete (or during the final spill). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira