hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Should splittable Gzip be a "core" hadoop feature?
Date Wed, 29 Feb 2012 21:17:06 GMT

On Wed, Feb 29, 2012 at 19:13, Robert Evans <evans@yahoo-inc.com> wrote:

> What I really want to know is how well does this new CompressionCodec
> perform in comparison to the regular gzip codec in

various different conditions and what type of impact does it have on
> network traffic and datanode load.  My gut feeling is that

the speedup is going to be relatively small except when there is a lot of
> computation happening in the mapper

I agree, I made the same assesment.
In the javadoc I wrote under "When is this useful?"
*"Assume you have a heavy map phase for which the input is a 1GiB Apache
httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*

> and the added load and network traffic outweighs the speedup in most
> cases,

No, the trick to solve that one is to upload the gzipped files with a HDFS
blocksize equal (or 1 byte larger) than the filesize.
This setting will help in speeding up Gzipped input files in any situation
(no more network overhead).
>From there the HDFS file replication factor of the file dictates the
optimal number of splits for this codec.

> but like all performance on a complex system gut feelings are

almost worthless and hard numbers are what is needed to make a judgment
> call.


> Niels, I assume you have tested this on your cluster(s).  Can you share
> with us some of the numbers?

No I haven't tested it beyond a multiple core system.
The simple reason for that is that when this was under review last summer
the whole "Yarn" thing happened
and I was unable to run it at all for a long time.
I only got it running again last december when the restructuring of the
source tree was mostly done.

At this moment I'm building a experimentation setup at work that can be
used for various things.
Given the current state of Hadoop 2.0 I think it's time to produce some
actual results.

Best regards / Met vriendelijke groeten,

Niels Basjes

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message