hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Broberg <Tim.Brob...@exar.com>
Subject RE: LZO Compression
Date Sun, 30 Oct 2011 17:01:17 GMT
Here are the issues that I'm aware of:

 *   Compression ratios are comparable
 *   Snappy Decompression is about twice as fast
 *   LZO is "splittable." It can be decompressed in pieces natively without using an AVRO
or sequence file. For LZO, this requres a separate operation to generate an index file that
identifies where the blocks are in the main file.
 *   LZO has to be downloaded and installed separately because the license is incompatible
with the hadoop license.

    - Tim.

________________________________________
From: Mark [static.void.dev@gmail.com]
Sent: Sunday, October 30, 2011 9:33 AM
To: common-user@hadoop.apache.org
Subject: Re: LZO Compression

Thanks for the info, very helpful.

Whats the difference between LZO and Snappy? I like how Cloudera has
snappy support so it looks like im going to go with that but I just
wanted to know the tradeoffs.

Thanks again

On 10/29/11 8:52 PM, Harsh J wrote:
> Hey Mark,
>
> (Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)
>
> On 30-Oct-2011, at 7:59 AM, Mark wrote:
>
>> Email was sent a bit prematurely.
>>
>> Anyway. How can one test that LZO compression is configured correctly? I've found
multiple sources on how to compile the hadoop-lzo jars and native files but no where did I
see a definitive way to test that the installation/configuration is correct.
> You can run the compression codec test for per-node, or run a job that reads or writes
with that codec.
>
> Single node test example, using an available test jar:
>
> $ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop org.apache.hadoop.io.compress.TestCodec
-count 1000 -codec com.hadoop.compression.lzo.LzoCodec
>
>> Also, when is this compression enabled? Is it enabled on every file I write? Do I
somehow have to specify that I want to use this format? For example we have a rather large
directory of server logs ... /user/mark/logs. How can we enable compression on this directory?
>>
> Compression in HDFS is pure client-side settings. You can't enable it 'globally'.
>
> For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to the desired
class to have final job outputs written with that codec (Compression of write streams is toggled
by {File}OutputFormat#setCompressOutput(…)). For optimizing the transient stages, you can
use JobConf#setMapOutputCompressorClass(…) and toggle with JobConf#setCompressMapOutput(…).
>
> Reading compressed files back again is handled automagically by your Hadoop framework,
and should require no settings.
>
> Hence, for a fully distributed test of your LZO install (which you may have hopefully
done with Todd's easy tool at https://github.com/toddlipcon/hadoop-lzo-packager), you can
run a simple parameterized (or mapred-site.xml configured) wordcount via an available example
jar:
>
> $ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount -Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
-Dmapred.output.compress=true inputDir outputDir
>
> Hope this helps!
>
> --
> Harsh J

________________________________
The information and any attached documents contained in this message
may be confidential and/or legally privileged. The message is
intended solely for the addressee(s). If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful. If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message