Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 47320 invoked from network); 29 Sep 2010 08:34:00 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Sep 2010 08:34:00 -0000 Received: (qmail 50597 invoked by uid 500); 29 Sep 2010 08:34:00 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 50493 invoked by uid 500); 29 Sep 2010 08:33:57 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 50485 invoked by uid 99); 29 Sep 2010 08:33:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Sep 2010 08:33:56 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Sep 2010 08:33:55 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o8T8XYOs014626 for ; Wed, 29 Sep 2010 08:33:35 GMT Message-ID: <22632008.459411285749214911.JavaMail.jira@thor> Date: Wed, 29 Sep 2010 04:33:34 -0400 (EDT) From: "Niels Basjes (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. In-Reply-To: <15850725.439721285666895824.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916076#action_12916076 ] Niels Basjes commented on MAPREDUCE-2094: ----------------------------------------- bq. Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses. If you mean with "breaking" : "Some subclasses will see an unexpected performance degradation" . then Yes, that will most likely occur ( first one I think of is SequenceFileInputFormat ). I however do not expect any functional "breaking" of the output of these subclasses. bq. Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect. Yes. The downside of this option is that if subclasses want to have detection depending on the compression I expect a lot of code duplication to occur. This code duplication is already occurring within the main code base in KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old API counterparts (I found a total of 5 duplicates of the same isSplittable implementation). bq. Perhaps the javadoc should just be clarified to better document this default? Definitely an option. However this would not fix the effect in the existing subclasses. I just did a quick manual code check of the current trunk and I found that the following classes are derived from FileInputFormat yet do not implement the isSplittable method (and thus use "return true"). * ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java * ./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java * ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java I expect that the NLineInputFormat and the AutoInputFormat will affected by this "large gzip" bug. So expect that simply fixing the isSplittable documentation would lead to the need to fix *at least* these two classes. As far as I understand the SequenceFileInputFormat can only be compressed using a "splittable" compression, so the "return true;" from FileInputFormat will work fine there. Overall I still prefer the clean option of returning the correct value depending on the compression. That would effectively leave the behavior in most use cases unchanged. Yet in those cases where splitting is known to cause problems it would avoid those problems. Thus avoiding major issues like the ones we had and described in HADOOP-6901. For the SequenceFileInputFormat it may be needed to implement the isSplittable as "return true;" Effectively the set of changes I propose (in both the old and new API versions of these classes): 1) FileInputFormat . isSplittable gets the implementation as seen in TextInputFormat 2) The isSplittable implementation is removed from KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat (useless code duplication) 3) The isSplittable implementation "return true" is added to SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I expect this to be an optional fix. > org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. > ------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-2094 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: task > Affects Versions: 0.20.1, 0.20.2, 0.21.0 > Reporter: Niels Basjes > > When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. > A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. > It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply "return true;". > This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. " . The actual implementation effectively does "Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. " > For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does "return false; " > Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): > # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. > # "Force" developers to think about it and make this method abstract. > # Use a "safe" default (i.e. return false) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.