hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: Compressed data storage in HDFS - Error
Date Sat, 09 Jun 2012 00:54:42 GMT
Compression will make processing faster almost all the time. Gzip
compression can shrink a text file to 40 percent its original size. Snappy
maybe about 60 percent. On average.

Then your dealing with say 1tb of data 60 percent savings is 600 gb. If you
think about the disk and network savings  that will eclipse any CPU waist.

Advice use snappy for intermediate compression and gzip for final

On Friday, June 8, 2012, Mark Groveir <> wrote:
> Hi Sreenath,
> All the points made on this thread are very valid. However, I wanted to
add that you should keep in mind that Gzip compression is not splittable.
This is because of the very nature of the codec. So, if your input data
contains files of size greater than HDFS block size in Gzip format, Hadoop
wouldn't be able to split these files and the entire file would be sent to
a single mapper. This reduces performance of the job.
> As Vinod mentioned, Snappy is getting some traction. Definitely worth a
> Good luck!
> Mark
> On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <> wrote:
>> But it may payoff by saving on network IO while copying the data during
reduce phase. Though it will vary from case to case. We had good results by
using Snappy codec for compressing map output. Snappy provides reasonably
good compression at faster rate.
>> Thanks,
>> Vinod
>> On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <>
>>> Compression is an overhead when you have a CPU intensive job
>>> Debarshi Basak
>>> Tata Consultancy Services
>>> Mailto:
>>> Website:
>>> ____________________________________________
>>> Experience certainty. IT Services
>>> Business Solutions
>>> Outsourcing
>>> ____________________________________________
>>> -----Bejoy Ks wrote: -----
>>> To: "" <>
>>> From: Bejoy Ks <>
>>> Date: 06/06/2012 03:37PM
>>> Subject: Re: Compressed data storage in HDFS - Error
>>> Hi Sreenath
>>> Output compression is more useful on storage level, when a larger file
is compressed it saves on hdfs blocks and there by the cluster become more
scalable in terms of number of files.
>>> Yes lzo libraries needs to be there in all task tracker nodes as well
the node that hosts the hive client.
>>> Regards
>>> Bejoy KS
>>> ________________________________
>>> From: Sreenath Menon <>
>>> To:; Bejoy Ks <>
>>> Sent: Wednesday, June 6, 2012 3:25 PM
>>> Subject: Re: Compressed data storage in HDFS - Error
>>> Hi Bejoy
>>> I would like to make this clear.
>>> There is no gain on processing throughput/time on compressing the data
stored in HDFS (not talking about intermediate compression)...wright??
>>> And do I need to add the lzo libraries in Hadoop_Home/lib/native for
all the nodes (including the slave nodes)??
>>> =====-----=====-----=====
>>> Notice: The information contained in this e-mail
>>> message and/or attachments to it may contain
>>> confidential or privileged information. If you are
>>> not the intended recipient, any dissemination, use,
>>> review, distribution, printing or copying of the
>>> information contained in this e-mail message
>>> and/or attachments to it are strictly prohibited. If
>>> you have received this communication in error,
>>> please notify us by reply e-mail or telephone and
>>> immediately and permanently delete the message
>>> and any attachments. Thank you

View raw message