hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Amundsen <lc...@us.ibm.com>
Subject Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
Date Wed, 24 Oct 2007 19:36:33 GMT
OK, that is encouraging.  I'll take another pass at it.  I succeeded
yesterday with an in-memory only InputFormat, but only after I commented
out some of the split referencing code, like the following in MapTask.java

    if (instantiatedSplit instanceof FileSplit) {
      FileSplit fileSplit = (FileSplit) instantiatedSplit;
      job.set("map.input.file", fileSplit.getPath().toString());
      job.setLong("map.input.start", fileSplit.getStart());
      job.setLong("map.input.length", fileSplit.getLength());

But maybe I simply need to override more methods in more of the embedded
classes.   You can see why I was wondering about the file system

             Doug Cutting                                                  
             rg>                                                        To 
             10/24/2007 09:02                                           cc 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  

Lance Amundsen wrote:
> I am starting to wonder if it might be indeed impossible to get map jobs
> running w/o writing to the file system.... as in, not w/o some major
> changes to the job and task tracker code.
> I was thinking about creating an InputFormat that does no file I/O,
> is queue based.  As mappers start up, their getRecordReader calls get
> re-directed to a remote queue to pull one or more records off of.  But I
> starting to wonder if the file system dependencies in the code are such
> that I could never completely avoid using files.  Specifically, even if I
> completely re-write an InputFormat, the framework is still going to try
> do Filesystem stuff on everything I return  (the extensive internal use
> splits is baffling me some).

Nothing internally should depend on an InputSplit representing a file.
You do need to be able to generate the complete set of splits when the
job is launched.  So if you wanted maps to poll a queue for each task
then you'd need to know how long the queue is when the job is launched
so that you could generate the right number of polling splits.


View raw message