hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
Date Thu, 25 Oct 2007 17:38:23 GMT
I did a patch last year that got similar improvements but still using an 
external process. (I really like the idea of keeping user code out of the 
JobTracker and the TaskTracker. It makes things more stable.) See HADOOP-249. 
It reuses the JVM for a task, which avoids the JVM restart hit. This hit is 
really bad for cases such as yours. It also avoids the performance hit of 
doing socket I/O for progress and task info, and instead uses the process 
pip, which also gives a big performance improvement.

Unfortunately, it was never incorporated and now the patch no longer applies. 
It's really not a big change, but the Hadoop code path to spawn the JVM is a 
bit convoluted, which made it hard to do the change and makes it hard to 
bring the patch up-to-date.


On Thursday 25 October 2007 10:19:59 Lance Amundsen wrote:
> So I managed to get my fast InputFormat working.... it does still use the
> FS, but in such a way that it improves mapper startup by over 2X.  And last
> night I got a prototype working that allows the map task to run under the
> JVM of the TaskTracker, rather than spawing a new JVM.
> The initial performance look really, really good.  I just ran a 1000 map
> single input record job, (mappers doing no work however), in a one master,
> one slave setup... on my laptop....  It completed in a couple thousand
> seconds, or a couple seconds per map.  Earlier I did a smaller 100 map job
> with a stable quieced system and it came in at about 130 seconds.
> So this prototype can start and end map jobs in 1-2 seconds, and should
> scale flatly with respect to nodes in the setup.
>              "Owen O'Malley"
>              <oom@yahoo-inc.co
>              m>                                                         To
>                                        hadoop-user@lucene.apache.org
>              10/24/2007 01:05                                           cc
>              PM
>                                                                    Subject
>                                        Re: InputFiles, Splits, Maps, Tasks
>              Please respond to         Questions 1.3 Base
>              hadoop-user@lucen
>                e.apache.org
> On Oct 24, 2007, at 12:42 PM, Doug Cutting wrote:
> > Lance Amundsen wrote:
> >> OK, that is encouraging.  I'll take another pass at it.  I succeeded
> >> yesterday with an in-memory only InputFormat, but only after I
> >> commented
> >> out some of the split referencing code, like the following in
> >> MapTask.java
> >>     if (instantiatedSplit instanceof FileSplit) {
> >>       FileSplit fileSplit = (FileSplit) instantiatedSplit;
> >>       job.set("map.input.file", fileSplit.getPath().toString());
> >>       job.setLong("map.input.start", fileSplit.getStart());
> >>       job.setLong("map.input.length", fileSplit.getLength());
> >>     }
> >
> > Yes, that code should not exist, but it shouldn't affect you
> > either. You should be subclassing InputSplit, not FileSplit, so
> > this code shouldn't operate on your splits.
> That code doesn't do anything if they are non file-splits, so it
> absolutely shouldn't break anything. Applications depend on those
> attributes to know which split they are working on and there isn't a
> better fix until we move to context objects. I know that non-
> filesplits work because there are units tests to make sure they don't
> break anything.
> -- Owen

View raw message