hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jirapinyo <kris.jirapi...@biz360.com>
Subject Re: zip files as input
Date Tue, 07 Jul 2009 05:28:12 GMT
How big are the zip files?  I am not sure if this is what you want, but for
my scenario, I had a lot of smaller zip files (not gzip) that need to be
processed.  I put these into a SequenceFile outside of hadoop and then
upload to hdfs.  Once in hdfs, I have the mapper read the SequenceFile with
each record being a zip file, then read it in as bytes that get
decompressed, and then process the content.  That way, hadoop can decide on
how to break up the work.  If your scenario is that each zip file is really
huge, then I'm not sure...putting them in a SequenceFile will probably not
help you in that case.  Perhaps you might want to break them up outside of
hadoop somehow first.  Yeah, zip files are a pain to work with in hadoop (or
I haven't found an easy way to do so, especially with large zip files).

-- Kris.

On Mon, Jul 6, 2009 at 8:28 PM, Mark Kerzner <markkerzner@gmail.com> wrote:

> Hi,
> I have a few zip files as input, they reside in one directory on HDFS. I
> want each node to take a zip file and work on it. Specifically, I want to
> take the zip files and write the binary contents of each file contained
> inside to a SequenceFile.
> Is that a good design? How do I tell hadoop that this is what I want?
> Thank you,
> Mark

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message