hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jirapinyo <kris.jirapi...@biz360.com>
Subject Re: zip files as input
Date Tue, 07 Jul 2009 15:34:05 GMT
If you can convert into any format, then I recommend gzip since hadoop will
process those automatically on hdfs.  Zip files are a pain to deal with, and
it's better to avoid them if possible (I wasn't able to).

-- Kris.

On Tue, Jul 7, 2009 at 7:34 AM, Mark Kerzner <markkerzner@gmail.com> wrote:

> Kris,
> how did you put the zips into SequenceFiles? For me, binary writes to
> SequenceFiles are very slow. It does not have to be zip files: I create
> them
> myself out of my data, and I do anything - tar, gzip...
>
> Thank you,
> Mark
>
> On Tue, Jul 7, 2009 at 12:28 AM, Kris Jirapinyo
> <kris.jirapinyo@biz360.com>wrote:
>
> > How big are the zip files?  I am not sure if this is what you want, but
> for
> > my scenario, I had a lot of smaller zip files (not gzip) that need to be
> > processed.  I put these into a SequenceFile outside of hadoop and then
> > upload to hdfs.  Once in hdfs, I have the mapper read the SequenceFile
> with
> > each record being a zip file, then read it in as bytes that get
> > decompressed, and then process the content.  That way, hadoop can decide
> on
> > how to break up the work.  If your scenario is that each zip file is
> really
> > huge, then I'm not sure...putting them in a SequenceFile will probably
> not
> > help you in that case.  Perhaps you might want to break them up outside
> of
> > hadoop somehow first.  Yeah, zip files are a pain to work with in hadoop
> > (or
> > I haven't found an easy way to do so, especially with large zip files).
> >
> > -- Kris.
> >
> > On Mon, Jul 6, 2009 at 8:28 PM, Mark Kerzner <markkerzner@gmail.com>
> > wrote:
> >
> > > Hi,
> > > I have a few zip files as input, they reside in one directory on HDFS.
> I
> > > want each node to take a zip file and work on it. Specifically, I want
> to
> > > take the zip files and write the binary contents of each file contained
> > > inside to a SequenceFile.
> > >
> > > Is that a good design? How do I tell hadoop that this is what I want?
> > >
> > > Thank you,
> > > Mark
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message