hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: Compression issues!!
Date Wed, 15 Jul 2009 13:30:08 GMT
Particularly for highly compressible data such as web log files, the loss in
potential data locality is more than made up for by the increase in network
transfer speed. The other somewhat unexpected side benefit is that there are
fewer map tasks with less task startup overhead. If your data is not highly
compressible, or your jobs are cpu bound the cost benefit ratio may not be

On Tue, Jul 14, 2009 at 11:12 PM, Tarandeep Singh <tarandeep@gmail.com>wrote:

> You can put compress data on HDFS and run Map Reduce job on it. But you
> should use a codec that supports file splitting, otherwise whole file will
> be read by one mapper. If you have read about Map reduce architecture, you
> would understand that a map function processes chunk of data (called
> split).
> If file is big and supports splitting (e.g. plain text file where lines are
> separated by new lines or sequence files) then the big file can be
> processed
> in parallel by multiple mappers (each processing a split of the file).
> However if the compression codec that you use does not supprt file
> splitting, then whole file will be processed by one mapper and you won't
> achieve parallelism.
> Check Hadoop wiki on compression codecs that support file splitting.
> -Tarandeep
> On Tue, Jul 14, 2009 at 10:39 PM, Sugandha Naolekar
> <sugandha.n87@gmail.com>wrote:
> > Hello!
> >
> > Few days back, I had asked about the compression of data placed in
> > hadoop..I
> > did get apt replies as::
> >
> > Place the data first in HDFS and then compress it, so that the data would
> > be
> > in sequence files.
> >
> > But, here my query is, I want to compress the data before placing it in
> > HDFS, so that redundancy won't come into picture..!
> >
> > How to do that...!Also, will I have to use external compression algo. or
> > simply api's would solve the purpose?
> >
> > --
> > Regards!
> > Sugandha
> >

Pro Hadoop, a book to guide you from beginner to hadoop mastery,
www.prohadoopbook.com a community for Hadoop Professionals

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message