hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bwolen Yang" <wbwo...@gmail.com>
Subject Re: compression and disk-bound application
Date Wed, 27 Jun 2007 21:19:40 GMT
On 6/21/07, Doug Cutting <cutting@apache.org> wrote:
> That's true only if decompression runs faster than disk input.

And for my case, compression speed also matters.  Since each step is
decompressing  for read and compressing for write.

I ran a test on this.  Looks like on 2GHz Opterons, end-to-end time is
roughly a tie between lzo and gzip.  Basically with gzip, disk-bound
map/reduce ran 20% faster than lzo, while copying local data into HDFS
ran 18% slower than lzo.  With copying taking longer than this
particular map/reduce, the overall time is about the same.

The local->HDFS copying is done by taking gzipped ASCII input files
(12GB uncompressed) on local NFS and write them out to HDFS in
sequence file as blocks of lines, where each block is at least 96KB.
The sequence files are BLOCK compressed.

The map/reduce takes these sequence files, break them up into entries,
and write them back out as BLOCK compressed sequence file.

I guess the tradeoff comes down to how much disk space is available
and whether the map/reduce apps are disk or cpu bound.


View raw message