hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bwolen Yang" <wbwo...@gmail.com>
Subject Re: compression and disk-bound application
Date Thu, 21 Jun 2007 21:03:58 GMT
> That's true only if decompression runs faster than disk input.  Disk
> transfer rates are nearly 100MB/second, but bzip2 decompression is
> around 20MB/second, while lzo can probably run at 100MB/second.
> Obviously these vary with disk and cpu speed, but you get the idea.  If
> lzo compresses just 2:1 then it halves the amount of i/o and keeps
> things i/o bound, so doubling speed.  Bzip2 might have a compression
> ratio of 5:1, but CPU becomes the bottleneck.  In general, lzo will make
> things run faster, while bzip2 won't help speed much but will save more
> space than lzo.

I haven't seen any map task run above 4MB/sec on ASCII input file
(most of them are 2MB/sec, running 2map tasks on 1 disk machine).
This is why I am hoping that getting a 5x compression would mean 4-5x

> http://blog.foofactory.fi/2007/03/twice-speed-half-size.html

cool. thanks for the pointer.

> SequenceFileInputFormat supports both splitting and compression (zip and
> lzo), but it is a non-standard file-format, not easily accessed by
> non-Java programs.

I thought a bit more about how I should interface with Hadoop.   For
my application, the initial interface with hadoop is ASCII file
produced by some servers.  Looking at "distcp/CopyFiles.java", perhaps
it is doable to write something similar that would grab input files as
soon as it reaches some size (perhaps 500MB), and then write the data
in Hadoop DFS as sequence file where each record is a line of input.

> Yes, I've heard others mention that bzip2 is splittable.  It would be
> great to have an InputFormat for bzip2 included with Hadoop.

will keep this in the backburnner as need arises.  thanks for the input.


View raw message