hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Basjes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
Date Thu, 07 Oct 2010 09:25:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918851#action_12918851
] 

Niels Basjes commented on MAPREDUCE-2094:
-----------------------------------------

I just noticed that the Yahoo Hadoop tutorial "[Module 5: Advanced MapReduce Features |http://developer.yahoo.com/hadoop/tutorial/module5.html]"
shows a code example for defining your own [FileInputFormat|http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat].
The shown example code implements a derivative using FileInputFormat and LineRecordReader
without overruling isSplittable ... I expect this tutorial code to lead people into this bug.

Since this bug will only become apparent when using large "non splittable" (gzipped) input
files it is also important to notice that almost no one will have a (unit) test that will
trip on this bug.

> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe
default behaviour that is different from the documented behaviour.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2094
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0
>            Reporter: Niels Basjes
>
> When implementing a custom derivative of FileInputFormat we ran into the effect that
a large Gzipped input file would be processed several times. 
> A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage
results and taking up a lot more CPU time than needed.
> It took a while to figure out and what we found is that the default implementation of
the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
] is simply "return true;". 
> This is a very unsafe default and is in contradiction with the JavaDoc of the method
which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed,
it will not be. " . The actual implementation effectively does "Is the given filename splitable?
Always true, even if the file is stream compressed using an unsplittable compression codec.
"
> For our situation (where we always have Gzipped input) we took the easy way out and simply
implemented an isSplittable in our class that does "return false; "
> Now there are essentially 3 ways I can think of for fixing this (in order of what I would
find preferable):
> # Implement something that looks at the used compression of the file (i.e. do migrate
the implementation from TextInputFormat to FileInputFormat). This would make the method do
what the JavaDoc describes.
> # "Force" developers to think about it and make this method abstract.
> # Use a "safe" default (i.e. return false)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message