hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Amundsen <lc...@us.ibm.com>
Subject Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
Date Thu, 25 Oct 2007 17:19:59 GMT
So I managed to get my fast InputFormat working.... it does still use the
FS, but in such a way that it improves mapper startup by over 2X.  And last
night I got a prototype working that allows the map task to run under the
JVM of the TaskTracker, rather than spawing a new JVM.

The initial performance look really, really good.  I just ran a 1000 map
single input record job, (mappers doing no work however), in a one master,
one slave setup... on my laptop....  It completed in a couple thousand
seconds, or a couple seconds per map.  Earlier I did a smaller 100 map job
with a stable quieced system and it came in at about 130 seconds.

So this prototype can start and end map jobs in 1-2 seconds, and should
scale flatly with respect to nodes in the setup.

             "Owen O'Malley"                                               
             m>                                                         To 
             10/24/2007 01:05                                           cc 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  

On Oct 24, 2007, at 12:42 PM, Doug Cutting wrote:

> Lance Amundsen wrote:
>> OK, that is encouraging.  I'll take another pass at it.  I succeeded
>> yesterday with an in-memory only InputFormat, but only after I
>> commented
>> out some of the split referencing code, like the following in
>> MapTask.java
>>     if (instantiatedSplit instanceof FileSplit) {
>>       FileSplit fileSplit = (FileSplit) instantiatedSplit;
>>       job.set("map.input.file", fileSplit.getPath().toString());
>>       job.setLong("map.input.start", fileSplit.getStart());
>>       job.setLong("map.input.length", fileSplit.getLength());
>>     }
> Yes, that code should not exist, but it shouldn't affect you
> either. You should be subclassing InputSplit, not FileSplit, so
> this code shouldn't operate on your splits.

That code doesn't do anything if they are non file-splits, so it
absolutely shouldn't break anything. Applications depend on those
attributes to know which split they are working on and there isn't a
better fix until we move to context objects. I know that non-
filesplits work because there are units tests to make sure they don't
break anything.

-- Owen

View raw message