hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: Serialization with additional schema info
Date Thu, 04 Sep 2008 17:51:55 GMT
On Wed, Sep 3, 2008 at 9:24 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

Their answer was that if you care enough, then any compression algorithm
> around will compress away the type information.

I understand the argument, but there are certainly cases where having the
type information once in the header is a big win. If I have a dataset with
say 100 billion rows with 300 columns in each row, having 1k of type
information on each row is pretty much a non-starter. I wish that was a
hypothetical case. *smile*

So if you have a splittable compressed format (bz2 works with hadoop), you
> are set except for the compression cost.  Decompression cost is usually
> compensated for by the I/O advantage.

bz2 is *really* expensive and will almost always substantially slow down
your job. The default codec (gz) is usually a win for compressing outputs,
but is still fairly expensive and is *not* splittable. LZO is great for
speed and is almost always a win for overall job time, even on map outputs.
It is also not splittable. It would be really nice to have a codec that was
similar in compression/cpu cost to gzip that was splittable.

-- Owen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message