hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1981) Improve getSplits performance by using listFiles, the new FileSystem API
Date Sat, 14 Aug 2010 00:56:17 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Douglas updated MAPREDUCE-1981:

    Status: Open  (was: Patch Available)

I'm pleased to see this feature propagate to MR. The approach looks correct, just a few comments:

* It looks like this change:
-    return result.toArray(new FileStatus[result.size()]);
+    return result.toArray(new LocatedFileStatus[result.size()]);
Causes {{TestMapRed}} to fail. {{SequenceFileInputFormat}} (and, presumably, other supertypes
of {{FileInputFormat}}) may rely on the type of the array returned from {{FileInputFormat}}
to be {{FileStatus[]}}
* I think the HDFS fault injection is breaking the publishing of that artifact, so the mapred
tests currently do not recognize the change to the HDFS ClientProtocol and {{TestSubmitJob}}
fails to compile. However, the patch is current with HDFS trunk and disabling the fault injection
before running mvn-install, etc. works. Is this fault being tracked in HDFS?
* The patch causes {{TestNoDefaultsJobConf}} to fail:
Testcase: testNoDefaults took 4.489 sec
  Caused an ERROR
No AbstractFileSystem for scheme: hdfs
org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: hdfs
  at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:143)
  at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:198)   
  at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:394)      
  at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:409)      
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:188)       
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:234)        
  at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:461)      
  at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:453)
  at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:354)   
  at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1037)                       
  at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1034)                       
  at java.security.AccessController.doPrivileged(Native Method)                 
  at javax.security.auth.Subject.doAs(Subject.java:396)                         
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1030)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:1034)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:536)           
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:781)              
  at org.apache.hadoop.conf.TestNoDefaultsJobConf.testNoDefaults(TestNoDefaultsJobConf.java:83)
* Unfortunately, {{FileInputFormat::addInputPathRecursively}} could be overridden by a user.
This should either be marked as an incompatible change or the function should be deprecated,
but its functionality preserved. It may also be worth confirming that no test relies on it.

> Improve getSplits performance by using listFiles, the new FileSystem API
> ------------------------------------------------------------------------
>                 Key: MAPREDUCE-1981
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1981
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: job submission
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.22.0
>         Attachments: mapredListFiles.patch, mapredListFiles1.patch, mapredListFiles2.patch
> This jira will make FileInputFormat and CombinedFileInputForm to use the new API, thus
reducing the number of RPCs to HDFS NameNode.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message