Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 26472 invoked from network); 15 May 2007 18:50:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 May 2007 18:50:43 -0000 Received: (qmail 36859 invoked by uid 500); 15 May 2007 18:50:37 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 36830 invoked by uid 500); 15 May 2007 18:50:37 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 36785 invoked by uid 99); 15 May 2007 18:50:37 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 May 2007 11:50:37 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [206.18.177.56] (HELO alnrmhc16.comcast.net) (206.18.177.56) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 May 2007 11:50:30 -0700 Received: from [192.168.168.15] (c-71-202-24-246.hsd1.ca.comcast.net[71.202.24.246]) by comcast.net (alnrmhc16) with ESMTP id <20070515185008b1600j051ue>; Tue, 15 May 2007 18:50:09 +0000 Message-ID: <464A00DF.8050908@apache.org> Date: Tue, 15 May 2007 11:50:07 -0700 From: Doug Cutting User-Agent: Thunderbird 1.5.0.10 (X11/20070403) MIME-Version: 1.0 To: hadoop-dev@lucene.apache.org Subject: Re: Merge sequence files References: <4649F787.8020507@oskarsson.nu> In-Reply-To: <4649F787.8020507@oskarsson.nu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Johan Oskarsson wrote: > I'm considering using the sequence file output of hadoop jobs to serve > data from as it would mean I could skip the conversion from sequence > file -> other file format step. > > To do this efficiently I would need the data to be in one file. I think it should be more efficient to keep things in separate files. If you use MapFileOutputFormat, there are methods to randomly access entries from job output: http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/MapFileOutputFormat.html SequenceFileOutputFormat will also let you open all readers, but there's no random access, since a SequenceFile has no index. http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html Will these suffice? Doug