Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <24422744.1162023318979.JavaMail.root@brutus>
Date: Sat, 28 Oct 2006 01:15:18 -0700 (PDT)
From: "Devaraj Das (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Updated: (HADOOP-611) SequenceFile.Sorter should have a
 merge method that returns an iterator
In-Reply-To: <6429596.1161148355062.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

     [ http://issues.apache.org/jira/browse/HADOOP-611?page=all ]

Devaraj Das updated HADOOP-611:
-------------------------------

    Attachment: merge.patch

Apart from the RawKeyValueIterator that the merge APIs return, this patch implements the following merge optimizations:
1. Optimization described in Hadoop-591 (Reducer sort should even out the pass factors in merging different pass)
2. Temp files are deleted as soon as possible
3. Added two new APIs nextRawKey and nextRawValue. Current implementation keeps at least
'factor * (sizeof(RawKey) + sizeof(RawValue))' bytes in memory. Optimized this to keep at most
'factor * (sizeof(RawKey)) + sizeof(RawValue)'  bytes in memory.
4. Merge temp files are spread out on multiple disks if mapred.local.dir is configured to have multiple directories.

The RawKeyValueIterator looks like:
    
public static interface RawKeyValueIterator {
      /** Gets the current raw key
       * @return DataOutputBuffer
       * @throws IOException
       */
      DataOutputBuffer getKey() throws IOException; 
      /** Gets the current raw value
       * @return ValueBytes 
       * @throws IOException
       */
      ValueBytes getValue() throws IOException; 
      /** Sets up the current key and value (for getKey and getValue)
       * @return true if there exists a key/value, false otherwise 
       * @throws IOException
       */
      boolean next() throws IOException;
      /** closes the iterator so that the underlying streams can be closed
       * @throws IOException
       */
      void close() throws IOException;
    }    

The public merge methods are:
//Merges the list of segments of type SegmentDescriptor
1. public RawKeyValueIterator merge(List <SegmentDescriptor> segments)  throws IOException;
//Merges the contents of files passed in Path[]
2. public RawKeyValueIterator merge(Path [] inNames, boolean deleteInputs)  throws IOException;
//Clones the attributes (like compression of the input file and 
//creates a corresponding Writer
3. public Writer cloneFileAttributes(FileSystem fileSys, Path inputFile, Path outputFile, Progressable prog) throws IOException;
//Writes records from RawKeyValueIterator into a file represented 
//by the passed writer
4. public void writeFile(RawKeyValueIterator records, Writer writer) throws IOException;
//Merge the provided files
5. public void merge(Path[] inFiles, Path outFile) throws IOException;

Feedback welcome.

> SequenceFile.Sorter should have a merge method that returns an iterator
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-611
>                 URL: http://issues.apache.org/jira/browse/HADOOP-611
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.8.0
>
>         Attachments: merge.patch
>
>
> SequenceFile.Sorter should get a new merge method that returns an iterator over the keys/values.
> The current merge method should become a simple method that gets the iterator and writes the records out to a file.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira