hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bwolen Yang" <wbwo...@gmail.com>
Subject compression and disk-bound application
Date Wed, 20 Jun 2007 23:25:37 GMT
For disk bound map/reduce applications (those did very little
computation but mainly about collating large amount of relevant data
and extract out a smaller summary for future computations), I was
wondering about whether or not it make sense for mappers to work
directly on compressed inputs.   i.e., if we can reduce the input size
by a factor of 4, then these applications will probably run close to
4x faster.    (or need 4x fewer disks).

Looking at LineRecordReader.java, looks like it ignores "seek" to the
start of split point when input is compressed.  This is probably
because TextInputFormat marks anything compressed as not splittable.

However, data that are compressed by bzip2 is splittable because it is
block based and have block markers (e.g., bzip2recover does partial
recovery of corrupted compressed file by looking for block markers and
decompress good blocks").   I can write an InputFormatter for my
particular application to take advantage of this.   Just thought to
ask first in case either people have thought of this, and/or are
working on this.  And perhaps some suggestions whether it would
benefit Hadoop in general and if so, where should it go.

bwolen

Mime
View raw message