Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 4456 invoked from network); 26 Mar 2008 22:31:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Mar 2008 22:31:37 -0000 Received: (qmail 79444 invoked by uid 500); 26 Mar 2008 22:31:36 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 78973 invoked by uid 500); 26 Mar 2008 22:31:35 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 78964 invoked by uid 99); 26 Mar 2008 22:31:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2008 15:31:35 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2008 22:30:53 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 67680234C09E for ; Wed, 26 Mar 2008 15:29:24 -0700 (PDT) Message-ID: <1938604808.1206570564408.JavaMail.jira@brutus> Date: Wed, 26 Mar 2008 15:29:24 -0700 (PDT) From: "Chris Douglas (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-2919) Create fewer copies of buffer data during sort/spill In-Reply-To: <1844634893.1204337031074.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HADOOP-2919: ---------------------------------- Attachment: 2919-7.patch This patch is idential to 2919-6, but the output buffer is released prior to the merge. > Create fewer copies of buffer data during sort/spill > ---------------------------------------------------- > > Key: HADOOP-2919 > URL: https://issues.apache.org/jira/browse/HADOOP-2919 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Reporter: Chris Douglas > Assignee: Chris Douglas > Priority: Blocker > Fix For: 0.17.0 > > Attachments: 2919-0.patch, 2919-1.patch, 2919-2.patch, 2919-3.patch, 2919-4.patch, 2919-5.patch, 2919-6.patch, 2919-7.patch > > > Currently, the sort/spill works as follows: > Let r be the number of partitions > For each call to collect(K,V) from map: > * If buffers do not exist, allocate a new DataOutputBuffer to collect K,V bytes, allocate r buffers for collecting K,V offsets > * Write K,V into buffer, noting offsets > * Register offsets with associated partition buffer, allocating/copying accounting buffers if nesc > * Calculate the total mem usage for buffer and all partition collectors by iterating over the collectors > * If total mem usage is greater than half of io.sort.mb, then start a new thread to spill, blocking if another spill is in progress > For each spill (assuming no combiner): > * Save references to our K,V byte buffer and accounting data, setting the former to null (will be recreated on the next call to collect(K,V)) > * Open a SequenceFile.Writer for this partition > * Sort each partition separately (the current version of sort reuses, but still requires wrapping, indices in IntWritable objects) > * Build a RawKeyValueIterator of sorted data for the partition > * Deserialize each key and value and call SequenceFile::append(K,V) on the writer for this partition > There are a number of opportunities for reducing the number of copies, creations, and operations we perform in this stage, particularly since growing many of the buffers involved requires that we copy the existing data to the newly sized allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.