hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: map output compression codec setting issue
Date Wed, 05 Sep 2007 18:30:47 GMT
Riccardo,

On Wed, Sep 05, 2007 at 10:10:31AM -0700, Nt Never wrote:
>Hi Arun,
>
>thanks for your reply, I am CCing this e-mail to hadoop-dev. I will create
>the appropriate JIRA tickets today. Here are a few insights about my
>experience with Hadoop compression (all my comments apply to 0.13.0):
>

Thanks!

>1. Map output compression: besides the issue I mentioned to you guys about
>choosing two different codecs for map output and overall job output, it
>works very well for us. I have been using non-native map output compression
>on jobs that generate over 6Tb of data with no problems. Since I am using
>0.13.0, because of HADOOP-1193, I could test LZO native on very small jobs
>only. Our benchmarks show no degradation in performance whatsoever when
>using native-LZO.

That is good to hear, please keep us posted on things you notice with 0.14.* and beyond (i.e.
post H-1193).

>2. Compression type configuration: we noticed a small issue with the
>configuration here. If "io.seqfile.compression,type" is set to NONE in
>hadoop-site.xml, M/R jobs will not do any compression and there is no way to
>override it programmatically. As a matter of fact, each worker machine will
>end up using the value read from the local hadoop conf folder. I like the
>fact that each worker reads this property locally when creating generic
>SequenceFile(s), but, IMHO, the behavior of M/R jobs should be set in
>JobConf only. This issue is very easy to reproduce.

This is a known bug where JobConf is overridden by hadoop-site.xml, please see:
http://issues.apache.org/jira/browse/HADOOP-785

>3. Non-native GzipCodec: the codec returns Java's
>java.util.zip.GzipOutputStream and java.util.zip.GzipInputStream when native
>compression is not available. However, lines 197, 238, 299, and 357 of
>SequenceFile (basically all the createWriter() methods that select a
>compression codec) will throw an IllegalArgumentException if the GzipCodec
>is selected but the native library is *not* available. Why is that?

The issue with java.util.zip.GzipInputStream is that it doesn't let u access the underlying
decompressor, hence we cannot do a 'reset' and reuse it - this is required for SequenceFiles.

See http://issues.apache.org/jira/browse/HADOOP-441#action_12430068

>4. Reduce reported progress when consuming compressed map outputs: is
>generally incorrect, with reducers reporting over 220% completion. This is
>regardless of whether native compression is used or not.

This smells like a bug, please file a jira asap!
I'm guessing this could be due to the fact that we are checking the size of uncompressed key/value
pairs rather than the compressed sizes. Devaraj?

thanks,
Arun

>
>Best,
>
>Riccardo
>
>
>On 9/5/07, Arun C Murthy <arunc@yahoo-inc.com> wrote:
>>
>> Hi Riccardo,
>>
>> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote:
>> >Thanks Devaraj, good to hear from you.
>> >
>> >Actually, if you guys are interested, I have been testing Hadoop
>> compression
>> >(native and non-native), in the last 5 days on a cluster of 200 machines
>> >(running 0.12.3, with HDFS as file system). I have a few insights you
>> guys
>> >might be interested into. I am just trying to figure out what the proper
>> >channels would be, that is why I contacted you first. Thanks.
>> >
>>
>> You are absolutely correct. Please file a jira (and a patch if you are so
>> inclined! *smile*) to request a separate property for the 2 codecs.
>>
>> We'd love to hear any insights/opinion/ideas about the compression stuff
>> you've been working on, please don't hesitate to mail hadoop-dev@ or file
>> jira issues about any of them...
>>
>> thanks!
>> Arun
>>
>> >Riccardo
>> >
>> >
>> >On 9/4/07, Devaraj Das <ddas@yahoo-inc.com> wrote:
>> >>
>> >>  Hi Riccardo,
>> >> Thanks for contacting me. I am doing good and hope you are doing great
>> >> too!
>> >> I am copying this mail to Arun who is our compression expert. Arun pls
>> >> respond to the mail.
>> >> Thanks,
>> >> Devaraj
>> >>
>> >>  ------------------------------
>> >> *From:* Nt Never [mailto:ntnever@gmail.com]
>> >> *Sent:* Tuesday, September 04, 2007 10:24 PM
>> >> *To:* ddas@yahoo-inc.com
>> >> *Subject:* map output compression codec setting issue
>> >>
>> >> Hi Devaraj,
>> >>
>> >> how have you been doing? I finally got around to do some extensive
>> testing
>> >> with Hadoop's compression. I am aware of HADOOP-1193 and HADOOP-1545,
>> so I
>> >> am waiting for the release of 0.15.0 before I do more benchmarks.
>> However,
>> >> I noticed what seems to be a bug in JobConf. The property "
>> >> mapred.output.compression.codec" is used when setting and getting the
>> map
>> >> output compression codec, thus making it impossible to use a different
>> codec
>> >> for map outputs and overall job outputs. The methods that affect this
>> >> behavior are in line 341-371 of JobConf in Hadoop 0.13.0:
>> >>
>> >> /**
>> >>    * Set the given class as the  compression codec for the map outputs.
>> >>    * @param codecClass the CompressionCodec class that will compress
>> the
>> >>    *                   map outputs
>> >>    */
>> >>   public void setMapOutputCompressorClass(Class<? extends
>> >> CompressionCodec> codecClass) {
>> >>     setCompressMapOutput(true);
>> >>     setClass("mapred.output.compression.codec", codecClass,
>> >>              CompressionCodec.class);
>> >>   }
>> >>
>> >>   /**
>> >>    * Get the codec for compressing the map outputs
>> >>    * @param defaultValue the value to return if it is not set
>> >>    * @return the CompressionCodec class that should be used to compress
>> >> the
>> >>    *   map outputs
>> >>    * @throws IllegalArgumentException if the class was specified, but
>> not
>> >> found
>> >>    */
>> >>   public Class<? extends CompressionCodec>
>> getMapOutputCompressorClass(Class<?
>> >> extends CompressionCodec> defaultValue) {
>> >>     String name = get( "mapred.output.compression.codec");
>> >>     if (name == null) {
>> >>       return defaultValue;
>> >>     } else {
>> >>       try {
>> >>         return getClassByName(name).asSubclass( CompressionCodec.class
>> );
>> >>       } catch (ClassNotFoundException e) {
>> >>         throw new IllegalArgumentException("Compression codec " + name
>> +
>> >>                                            " was not found.", e);
>> >>       }
>> >>     }
>> >>   }
>> >>
>> >> This could be easily fixed by using a different property, for example,
>> "
>> >> map.output.compression.codec". Should I create an issue on JIRA for
>> this?
>> >> Thanks.
>> >>
>> >> Riccardo
>> >>
>> >>
>>

Mime
View raw message