Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 83874 invoked from network); 11 May 2006 18:09:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 May 2006 18:09:20 -0000 Received: (qmail 42216 invoked by uid 500); 11 May 2006 18:09:19 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 42193 invoked by uid 500); 11 May 2006 18:09:19 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 42184 invoked by uid 99); 11 May 2006 18:09:19 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 May 2006 11:09:19 -0700 X-ASF-Spam-Status: No, hits=1.3 required=10.0 tests=RCVD_IN_BL_SPAMCOP_NET X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [207.115.57.46] (HELO ylpvm15.prodigy.net) (207.115.57.46) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 May 2006 11:09:18 -0700 Received: from pimout6-ext.prodigy.net (pimout6-int.prodigy.net [207.115.4.22]) by ylpvm15.prodigy.net (8.12.10 outbound/8.12.10) with ESMTP id k4BI8kg9016294 for ; Thu, 11 May 2006 14:08:55 -0400 X-ORBL: [69.228.218.244] Received: from [192.168.168.15] (adsl-69-228-218-244.dsl.pltn13.pacbell.net [69.228.218.244]) by pimout6-ext.prodigy.net (8.13.6 out.dk/8.13.6) with ESMTP id k4BI8fCC196194; Thu, 11 May 2006 14:08:42 -0400 Message-ID: <44637DA9.7020207@apache.org> Date: Thu, 11 May 2006 11:08:41 -0700 From: Doug Cutting User-Agent: Mozilla Thunderbird 1.0.8 (X11/20060502) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: reading zip files References: <9EAEB8710A398D4D960257830756F57801F271DC@exchg02-bur.search.corpsys.p4pnet.net> In-Reply-To: <9EAEB8710A398D4D960257830756F57801F271DC@exchg02-bur.search.corpsys.p4pnet.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Vijay Murthi wrote: > I am trying to process several gigs of zipped text files from a directory. If I unzip them the size increase atleast 4 times and potentially I can run out of disk space. > > Has anyone tried to read zipped text files directly from the input directory? > > or anyone tried implementing a zip version of SequenceFileRecordReader.java and Filesplit? SequenceFile currently supports per-record compression. This is effective when your input records are fairly large (> a few kB). What format are your zipped input files in? Are there multiple records per file? If so, how big are the records? A future goal for SequenceFile is to support compression across multiple records, to make compression effective with small records. Until then, compression of small records is difficult. The best approach currently is to use an InputFormat that does not split files, but makes each file into a distinct split. Then try to divide your data into approximately equal sized files that are each compressed. Doug