Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 90453 invoked from network); 13 Mar 2007 19:57:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Mar 2007 19:57:31 -0000 Received: (qmail 58496 invoked by uid 500); 13 Mar 2007 19:57:38 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 58475 invoked by uid 500); 13 Mar 2007 19:57:38 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 58466 invoked by uid 99); 13 Mar 2007 19:57:38 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2007 12:57:38 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2007 12:57:29 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 70BC1714074 for ; Tue, 13 Mar 2007 12:57:09 -0700 (PDT) Message-ID: <32393664.1173815829459.JavaMail.jira@brutus> Date: Tue, 13 Mar 2007 12:57:09 -0700 (PDT) From: "Sameer Paranjpye (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-910) Reduces can do merges for the on-disk map output files in parallel with their copying In-Reply-To: <1830903.1169189129949.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480530 ] Sameer Paranjpye commented on HADOOP-910: ----------------------------------------- What was your specific observation Runping? We're already doing a lot of in-memory merges in parallel with the shuffle and from all the runs we've seen it looks like the shuffle/merge tracks the maps pretty closely. Can we get some real data here because this feels like premature optimization. > Reduces can do merges for the on-disk map output files in parallel with their copying > ------------------------------------------------------------------------------------- > > Key: HADOOP-910 > URL: https://issues.apache.org/jira/browse/HADOOP-910 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Devaraj Das > Assigned To: Gautam Kowshik > > Proposal to extend the parallel in-memory-merge/copying, that is being done as part of HADOOP-830, to the on-disk files. > Today, the Reduces dump the map output files to disk and the final merge happens only after all the map outputs have been collected. It might make sense to parallelize this part. That is, whenever a Reduce has collected io.sort.factor number of segments on disk, it initiates a merge of those and creates one big segment. If the rate of copying is faster than the merge, we can probably have multiple threads doing parallel merges of independent sets of io.sort.factor number of segments. If the rate of copying is not as fast as merge, we stand to gain a lot - at the end of copying of all the map outputs, we will be left with a small number of segments for the final merge (which hopefully will feed the reduce directly (via the RawKeyValueIterator) without having to hit the disk for writing additional output segments). > If the disk bandwidth is higher than the network bandwidth, we have a good story, I guess, to do such a thing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.