Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
Message-ID: <464A00DF.8050908@apache.org>
Date: Tue, 15 May 2007 11:50:07 -0700
From: Doug Cutting <cutting@apache.org>
User-Agent: Thunderbird 1.5.0.10 (X11/20070403)
MIME-Version: 1.0
To: hadoop-dev@lucene.apache.org
Subject: Re: Merge sequence files
References: <4649F787.8020507@oskarsson.nu>
In-Reply-To: <4649F787.8020507@oskarsson.nu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Johan Oskarsson wrote:
> I'm considering using the sequence file output of hadoop jobs to serve 
> data from as it would mean I could skip the conversion from sequence 
> file -> other file format step.
> 
> To do this efficiently I would need the data to be in one file.

I think it should be more efficient to keep things in separate files. 
If you use MapFileOutputFormat, there are methods to randomly access 
entries from job output:

http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/MapFileOutputFormat.html

SequenceFileOutputFormat will also let you open all readers, but there's 
no random access, since a SequenceFile has no index.

http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html

Will these suffice?

Doug