Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 76148 invoked from network); 28 Oct 2006 08:16:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Oct 2006 08:16:26 -0000 Received: (qmail 42273 invoked by uid 500); 28 Oct 2006 08:16:37 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 42245 invoked by uid 500); 28 Oct 2006 08:16:37 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 42236 invoked by uid 99); 28 Oct 2006 08:16:37 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Oct 2006 01:16:37 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Oct 2006 01:16:23 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id EFDD57142D1 for ; Sat, 28 Oct 2006 01:15:18 -0700 (PDT) Message-ID: <24422744.1162023318979.JavaMail.root@brutus> Date: Sat, 28 Oct 2006 01:15:18 -0700 (PDT) From: "Devaraj Das (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-611) SequenceFile.Sorter should have a merge method that returns an iterator In-Reply-To: <6429596.1161148355062.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/HADOOP-611?page=all ] Devaraj Das updated HADOOP-611: ------------------------------- Attachment: merge.patch Apart from the RawKeyValueIterator that the merge APIs return, this patch implements the following merge optimizations: 1. Optimization described in Hadoop-591 (Reducer sort should even out the pass factors in merging different pass) 2. Temp files are deleted as soon as possible 3. Added two new APIs nextRawKey and nextRawValue. Current implementation keeps at least 'factor * (sizeof(RawKey) + sizeof(RawValue))' bytes in memory. Optimized this to keep at most 'factor * (sizeof(RawKey)) + sizeof(RawValue)' bytes in memory. 4. Merge temp files are spread out on multiple disks if mapred.local.dir is configured to have multiple directories. The RawKeyValueIterator looks like: public static interface RawKeyValueIterator { /** Gets the current raw key * @return DataOutputBuffer * @throws IOException */ DataOutputBuffer getKey() throws IOException; /** Gets the current raw value * @return ValueBytes * @throws IOException */ ValueBytes getValue() throws IOException; /** Sets up the current key and value (for getKey and getValue) * @return true if there exists a key/value, false otherwise * @throws IOException */ boolean next() throws IOException; /** closes the iterator so that the underlying streams can be closed * @throws IOException */ void close() throws IOException; } The public merge methods are: //Merges the list of segments of type SegmentDescriptor 1. public RawKeyValueIterator merge(List segments) throws IOException; //Merges the contents of files passed in Path[] 2. public RawKeyValueIterator merge(Path [] inNames, boolean deleteInputs) throws IOException; //Clones the attributes (like compression of the input file and //creates a corresponding Writer 3. public Writer cloneFileAttributes(FileSystem fileSys, Path inputFile, Path outputFile, Progressable prog) throws IOException; //Writes records from RawKeyValueIterator into a file represented //by the passed writer 4. public void writeFile(RawKeyValueIterator records, Writer writer) throws IOException; //Merge the provided files 5. public void merge(Path[] inFiles, Path outFile) throws IOException; Feedback welcome. > SequenceFile.Sorter should have a merge method that returns an iterator > ----------------------------------------------------------------------- > > Key: HADOOP-611 > URL: http://issues.apache.org/jira/browse/HADOOP-611 > Project: Hadoop > Issue Type: New Feature > Components: io > Reporter: Owen O'Malley > Assigned To: Devaraj Das > Fix For: 0.8.0 > > Attachments: merge.patch > > > SequenceFile.Sorter should get a new merge method that returns an iterator over the keys/values. > The current merge method should become a simple method that gets the iterator and writes the records out to a file. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira