hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Re: data in compression format affect mapreduce speed
Date Thu, 26 Aug 2010 06:08:54 GMT
On Thu, Aug 26, 2010 at 11:20 AM, shangan <shangan@corp.kaixin001.com> wrote:
> I agree with you on the most part. But I have some other questions. mapper are working
on local machine so there's no network transfers during this process, if the original data
stored in hdfs is compressed it will only decrease the IO time. One major point is I doubt
whether the mapper can deal with only part of the whole data if the data is compressed which
seems can't be split ? I've try to do a "select sum()" in hive and trace the job, it seems
the .tar.gz data can only worked on one single matchine and stuck there for quite a long time(seems
like need to wait other part of data be copied from other machines),while other data not compressed
can work on different machines parallelly. Do you know something about this ?

GZip compressed files can not be decompressed as split blocks so only
one mapper runs. BZip2 algorithm supports splitting and decompressing
individual blocks of a file, you may try that.

LZO can be made to allow block splitting by indexing all the available
files first ( a program and a set of IOFormat classes for this is
provided by the hadoop-lzo project over at GitHub --
http://github.com/kevinweil/hadoop-lzo )

When using compression its usually also suggested to use SequenceFiles
and/or Avro data files for the data storage; as these are designed
with the HDFS and MR of Hadoop in mind and contain some form of block
checkpoints inside them which let them be split as blocks with any
form of compression codec applied. (Note: Avro uses deflate in its own

> 2010-08-26
> shangan
> 发件人: Harsh J
> 发送时间: 2010-08-26  12:15:49
> 收件人: common-user
> 抄送:
> 主题: Re: data in compression format affect mapreduce speed
> Logically it 'should' increase time as its an extra step beyond the
> Mapper/Reducer. But while your processing time would slightly (very
> very slightly) increase, your IO and Network Transfers time would
> decrease by a large margin -- giving you a clear impression that your
> total job time has decreased overall. The difference being in writing
> out say 10 GB before, and writing out 5-7 GB this time (a crude
> example).
> With the fast CPUs available these days, compressing and decompressing
> should hardly take a noticeable amount of extra time. Its almost
> negligible in case of using gzip, lzo or plain deflate.
> On Thu, Aug 26, 2010 at 9:13 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>> Compressed data would increase processing time in mapper/reducer but
>> decrease the amount of data transferred between tasktracker nodes.
>> Normally you should consider applying some form of compression.
>> On Wed, Aug 25, 2010 at 7:32 PM, shangan <shangan@corp.kaixin001.com> wrote:
>>> will data stored in  compression format affect mapreduce job speed?
>>> increase or decrease? or more complex relationship between these two ?  can
>>> anybody give some explanation in detail?
>>> 2010-08-26
>>> shangan
> --
> Harsh J
> www.harshj.com
> __________ Information from ESET NOD32 Antivirus, version of virus signature database
5397 (20100825) __________
> The message was checked by ESET NOD32 Antivirus.
> http://www.eset.com

Harsh J

View raw message