Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Message-ID: <22632008.459411285749214911.JavaMail.jira@thor>
Date: Wed, 29 Sep 2010 04:33:34 -0400 (EDT)
From: "Niels Basjes (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Subject: [jira] Commented: (MAPREDUCE-2094)
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable
 implements unsafe default behaviour that is different from the documented
 behaviour.
In-Reply-To: <15850725.439721285666895824.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916076#action_12916076 ] 

Niels Basjes commented on MAPREDUCE-2094:
-----------------------------------------

bq. Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses. 

If you mean with "breaking" :  "Some subclasses will see an unexpected performance degradation" . then Yes, that will most likely occur ( first one I think of is SequenceFileInputFormat ).
I however do not expect any functional "breaking" of the output of these subclasses.

bq. Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect. 

Yes. 
The downside of this option is that if subclasses want to have detection depending on the compression I expect a lot of code duplication to occur. 
This code duplication is already occurring within the main code base in KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old API counterparts (I found a total of 5 duplicates of the same isSplittable implementation).

bq. Perhaps the javadoc should just be clarified to better document this default?

Definitely an option. However this would not fix the effect in the existing subclasses.

I just did a quick manual code check of the current trunk and I found that the following classes are derived from FileInputFormat yet do not  implement the isSplittable method (and thus use "return true").
* ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java
* ./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java
* ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java

I expect that the NLineInputFormat and the AutoInputFormat will affected by this "large gzip" bug.
So expect that simply fixing the isSplittable documentation would lead to the need to fix *at least* these two classes.

As far as I understand the SequenceFileInputFormat can only be compressed using a "splittable" compression, so the "return true;" from FileInputFormat will work fine there.

Overall I still prefer the clean option of  returning the correct value depending on the compression. That would effectively leave the behavior in most use cases unchanged. Yet in those cases where splitting is known to cause problems it would avoid those problems. Thus avoiding major issues like the ones we had and described in HADOOP-6901.
For the SequenceFileInputFormat it may be needed to implement the isSplittable as "return true;"

Effectively the set of changes I propose (in both the old and new API versions of these classes):
1) FileInputFormat . isSplittable  gets the implementation as seen in TextInputFormat
2) The isSplittable implementation is removed from  KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat (useless code duplication)
3) The isSplittable implementation "return true" is added to SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I expect this to be an optional fix.


> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2094
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0
>            Reporter: Niels Basjes
>
> When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. 
> A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed.
> It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply "return true;". 
> This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. " . The actual implementation effectively does "Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. "
> For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does "return false; "
> Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable):
> # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes.
> # "Force" developers to think about it and make this method abstract.
> # Use a "safe" default (i.e. return false)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.