hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: compression and disk-bound application
Date Thu, 21 Jun 2007 17:31:33 GMT
Bwolen Yang wrote:
> For disk bound map/reduce applications (those did very little
> computation but mainly about collating large amount of relevant data
> and extract out a smaller summary for future computations), I was
> wondering about whether or not it make sense for mappers to work
> directly on compressed inputs.   i.e., if we can reduce the input size
> by a factor of 4, then these applications will probably run close to
> 4x faster.    (or need 4x fewer disks).

That's true only if decompression runs faster than disk input.  Disk 
transfer rates are nearly 100MB/second, but bzip2 decompression is 
around 20MB/second, while lzo can probably run at 100MB/second. 
Obviously these vary with disk and cpu speed, but you get the idea.  If 
lzo compresses just 2:1 then it halves the amount of i/o and keeps 
things i/o bound, so doubling speed.  Bzip2 might have a compression 
ratio of 5:1, but CPU becomes the bottleneck.  In general, lzo will make 
things run faster, while bzip2 won't help speed much but will save more 
space than lzo.

This is consistent with Sami Siren's benchmarks:


> Looking at LineRecordReader.java, looks like it ignores "seek" to the
> start of split point when input is compressed.  This is probably
> because TextInputFormat marks anything compressed as not splittable.

SequenceFileInputFormat supports both splitting and compression (zip and 
lzo), but it is a non-standard file-format, not easily accessed by 
non-Java programs.

> However, data that are compressed by bzip2 is splittable because it is
> block based and have block markers (e.g., bzip2recover does partial
> recovery of corrupted compressed file by looking for block markers and
> decompress good blocks").   I can write an InputFormatter for my
> particular application to take advantage of this.

Yes, I've heard others mention that bzip2 is splittable.  It would be 
great to have an InputFormat for bzip2 included with Hadoop.


View raw message