Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 69327 invoked from network); 7 Jul 2009 15:34:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Jul 2009 15:34:46 -0000 Received: (qmail 53918 invoked by uid 500); 7 Jul 2009 15:34:54 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 53844 invoked by uid 500); 7 Jul 2009 15:34:54 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 53834 invoked by uid 99); 7 Jul 2009 15:34:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jul 2009 15:34:54 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 209.85.200.175 is neither permitted nor denied by domain of kjirapinyo@biz360.com) Received: from [209.85.200.175] (HELO wf-out-1314.google.com) (209.85.200.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jul 2009 15:34:46 +0000 Received: by wf-out-1314.google.com with SMTP id 23so1681105wfg.2 for ; Tue, 07 Jul 2009 08:34:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.142.47.13 with SMTP id u13mr2010163wfu.270.1246980865123; Tue, 07 Jul 2009 08:34:25 -0700 (PDT) In-Reply-To: References: <42a1925b0907062228v7c7ae8ddt815f828676eaa42e@mail.gmail.com> From: Kris Jirapinyo Date: Tue, 7 Jul 2009 08:34:05 -0700 Message-ID: <42a1925b0907070834m3144c473hbb352dcd3ead7958@mail.gmail.com> Subject: Re: zip files as input To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd1564081e8cd046e1f5bd8 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd1564081e8cd046e1f5bd8 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit If you can convert into any format, then I recommend gzip since hadoop will process those automatically on hdfs. Zip files are a pain to deal with, and it's better to avoid them if possible (I wasn't able to). -- Kris. On Tue, Jul 7, 2009 at 7:34 AM, Mark Kerzner wrote: > Kris, > how did you put the zips into SequenceFiles? For me, binary writes to > SequenceFiles are very slow. It does not have to be zip files: I create > them > myself out of my data, and I do anything - tar, gzip... > > Thank you, > Mark > > On Tue, Jul 7, 2009 at 12:28 AM, Kris Jirapinyo > wrote: > > > How big are the zip files? I am not sure if this is what you want, but > for > > my scenario, I had a lot of smaller zip files (not gzip) that need to be > > processed. I put these into a SequenceFile outside of hadoop and then > > upload to hdfs. Once in hdfs, I have the mapper read the SequenceFile > with > > each record being a zip file, then read it in as bytes that get > > decompressed, and then process the content. That way, hadoop can decide > on > > how to break up the work. If your scenario is that each zip file is > really > > huge, then I'm not sure...putting them in a SequenceFile will probably > not > > help you in that case. Perhaps you might want to break them up outside > of > > hadoop somehow first. Yeah, zip files are a pain to work with in hadoop > > (or > > I haven't found an easy way to do so, especially with large zip files). > > > > -- Kris. > > > > On Mon, Jul 6, 2009 at 8:28 PM, Mark Kerzner > > wrote: > > > > > Hi, > > > I have a few zip files as input, they reside in one directory on HDFS. > I > > > want each node to take a zip file and work on it. Specifically, I want > to > > > take the zip files and write the binary contents of each file contained > > > inside to a SequenceFile. > > > > > > Is that a good design? How do I tell hadoop that this is what I want? > > > > > > Thank you, > > > Mark > > > > > > --000e0cd1564081e8cd046e1f5bd8--