hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15403) FileInputFormat recursive=false fails instead of ignoring the directories.
Date Mon, 23 Apr 2018 18:02:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448602#comment-16448602

Jason Lowe commented on HADOOP-15403:

bq. would a change in config be ok?

A change in the default value for a config is arguably the same thing as a code change that
changes the default behavior from the perspective of a user.

To be clear I'm not saying we can't ever change the default behavior, but we need to be careful
about the ramifications.  If we do, it needs to be marked as an incompatible change and have
a corresponding release note that clearly explains the potential for silent data loss relative
to the old behavior and what users can do to restore the old behavior.

Given the behavior for non-recursive has been this way for quite a long time, either users
aren't running into this very often or they've set the value to recursive.  That leads me
to suggest adding the ability to ignore directories but _not_ make it the default.  Then we
don't have a backward incompatibility and the one Hive case you're trying can still work once
the config is updated (or Hive can run the job with that setting automatically if it makes
sense for that use case).

> FileInputFormat recursive=false fails instead of ignoring the directories.
> --------------------------------------------------------------------------
>                 Key: HADOOP-15403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15403
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HADOOP-15403.patch
> We are trying to create a split in Hive that will only read files in a directory and
not subdirectories.
> That fails with the below error.
> Given how this error comes about (two pieces of code interact, one explicitly adding
directories to results without failing, and one failing on any directories in results), this
seems like a bug.
> {noformat}
> Caused by: java.io.IOException: Not a file: file:/,...warehouse/simple_to_mm_text/delta_0000001_0000001_0000
> 	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329) ~[hadoop-mapreduce-client-core-3.1.0.jar:?]
> 	at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:553)
> 	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:754)
> 	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:203)
> {noformat}
> This code, when recursion is disabled, adds directories to results 
> {noformat} 
> if (recursive && stat.isDirectory()) {
>               result.dirsNeedingRecursiveCalls.add(stat);
>             } else {
>               result.locatedFileStatuses.add(stat);
>             }
> {noformat} 
> However the getSplits code after that computes the size like this
> {noformat}
> long totalSize = 0;                          
// compute total size
>     for (FileStatus file: files) {                //
check we have valid files
>       if (file.isDirectory()) {
>         throw new IOException("Not a file: "+ file.getPath());
>       }
>       totalSize +=
> {noformat}
> which would always fail combined with the above code.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message