Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of lca13@us.ibm.com designates
 32.97.182.144 as permitted sender)
In-Reply-To: <C342AAAA.26142%tdunning@veoh.com>
Subject: Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
To: hadoop-user@lucene.apache.org
Message-ID: 
 <OF82C3E168.94B33887-ON8825737D.0012AA74-8825737D.00135E05@us.ibm.com>
From: Lance Amundsen <lca13@us.ibm.com>
Date: Mon, 22 Oct 2007 20:31:32 -0700
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII

What I am trying to do is see what it would take to modify the Hadoop
framework for transaction based processing.  So right now I am just trying
to get to the point where I can start looking at the hard stuff.  I am
still blocked by simple control at this point.  The workload I am trying to
measure at this point is nothing more than a print statement... ie.
nothing.  Startup costs currently are linear WRT nodes, dependent on Input
model.  I first need to find or create a model that has a flat startup cost
for n nodes, at which point I can start to tackle the actual pathlength and
latency issues of the startup itself.

This is all just investigative at this point, but I can already envision
changes that would allow process startup and takedown in less than 1 sec
and be nearly flatly growing as nodes increase.


             Ted Dunning                                                   
             <tdunning@veoh.co                                             
             m>                                                         To 
                                       <hadoop-user@lucene.apache.org>     
             10/22/2007 07:29                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  
             hadoop-user@lucen                                             
               e.apache.org                                                
                                                                           
                                                                           
You probably have determined by now that there is a parameter that
determines how many concurrent maps there are.

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>3</value>
  <description>The maximum number of tasks that will be run
        simultaneously by a task tracker.
  </description>
</property>

Btw... I am still curious about your approach.  Isn't it normally better to
measure marginal costs such as this startup cost by linear regression as
you
change parameters?  It seems that otherwise, you will likely be mislead by
what happens at the boundaries when what you really want it what happens in
the normal operating region.


On 10/22/07 5:53 PM, "Lance Amundsen" <lca13@us.ibm.com> wrote:

> ...
>
> Next I want to increase the concurrent # of tasks being executed for each
> node... currently it seems like 2 or 3 is the upper limit (at least on
the
> earlier binaries I was running).
>