hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tarandeep Singh <tarand...@gmail.com>
Subject Re: Compression issues!!
Date Wed, 15 Jul 2009 06:12:13 GMT
You can put compress data on HDFS and run Map Reduce job on it. But you
should use a codec that supports file splitting, otherwise whole file will
be read by one mapper. If you have read about Map reduce architecture, you
would understand that a map function processes chunk of data (called split).
If file is big and supports splitting (e.g. plain text file where lines are
separated by new lines or sequence files) then the big file can be processed
in parallel by multiple mappers (each processing a split of the file).
However if the compression codec that you use does not supprt file
splitting, then whole file will be processed by one mapper and you won't
achieve parallelism.

Check Hadoop wiki on compression codecs that support file splitting.


On Tue, Jul 14, 2009 at 10:39 PM, Sugandha Naolekar

> Hello!
> Few days back, I had asked about the compression of data placed in
> hadoop..I
> did get apt replies as::
> Place the data first in HDFS and then compress it, so that the data would
> be
> in sequence files.
> But, here my query is, I want to compress the data before placing it in
> HDFS, so that redundancy won't come into picture..!
> How to do that...!Also, will I have to use external compression algo. or
> simply api's would solve the purpose?
> --
> Regards!
> Sugandha

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message