hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
Date Wed, 17 Oct 2007 19:49:01 GMT

In practice, most jobs involve many more records than there are available
mappers (even for large clusters).

This means every mapper handles many records and mapper startup is amortized
pretty widely.

It would still be nice to have a smaller startup cost, but the limiting
factor is likely to be the job tracker spitting all of the jar files to the
task trackers, not actually the map construction time.

If you really care about map instantiation time, you could start by making
the map be in the same VM.  That doesn't sound like a good trade-off to me
which in turn tells me that I don't care about startup costs so much.

It is not all that surprising if small jobs not something that can be sped
up.  The fact that parallelism is generally easier to attain for large
problems has been noticed for some time.


On 10/17/07 11:47 AM, "Lance Amundsen" <lca13@us.ibm.com> wrote:

> For right now, I am testing boundary conditions related to startup costs.
> I want to build a mapper interface that performs relatively flatly WRT
> numbers of mappers.  My goal is to dramatically improve startup costs for
> one mapper, and then make sure that that startup cost does not increase
> dramatically as node, maps, and records are increased.
> 
> Example: let's say I have 10K one second jobs and I want the whole thing to
> run 2 seconds.  I currently see no way for Hadoop to achieve this,  But I
> also see how to get there, and this level of granularity would be one of
> the requirements..... I believe.
> 
> Lance
> 
> IBM Software Group - Strategy
> Performance Architect
> High-Performance On Demand Solutions (HiPODS)
> 
> 650-678-8425 cell
> 
> 
> 
> 
>                  
>              Ted Dunning
>              <tdunning@veoh.co
>              m>                                                         To
>                                        <hadoop-user@lucene.apache.org>
>              10/17/2007 11:34                                           cc
>              AM  
>                                                                    Subject
>                                        Re: InputFiles, Splits, Maps, Tasks
>              Please respond to         Questions 1.3 Base
>              hadoop-user@lucen
>                e.apache.org
>                  
>                  
>                  
>                  
> 
> 
> 
> 
> 
> 
> 
> On 10/17/07 10:37 AM, "Lance Amundsen" <lca13@us.ibm.com> wrote:
> 
>> 1 file per map, 1 record per file, isSplitable(true or false):  yields 1
>> record per mapper
> 
> Yes.
> 
>> 1 file total, n records, isSplitable(true):  Yields variable n records
> per
>> variable m mappers
> 
> Yes.
> 
>> 1 file total, n records, isSplitable(false):  Yields n records into 1
>> mapper
> 
> Yes.
> 
>> What I am immediately looking for is a way to do:
>> 
>> 1 file total, n records, isSplitable(true): Yields 1 record into n
> mappers
>> 
>> But ultimately need to control fully control the file/record
> distributions.
> 
> Why in the world do you need this level of control?  Isn't that the point
> of
> frameworks like Hadoop? (to avoid the need for this)
> 
> 
> 


Mime
View raw message