hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Amundsen <lc...@us.ibm.com>
Subject Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
Date Thu, 18 Oct 2007 22:03:02 GMT
There's lots of references on decreasing DFS block size to increase maps to
record ratios.  What is the easiest way to do this?  Is it possible with
the standard SequenceFile class?

Lance

IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)

650-678-8425 cell




                                                                           
             Ted Dunning                                                   
             <tdunning@veoh.co                                             
             m>                                                         To 
                                       <hadoop-user@lucene.apache.org>     
             10/17/2007 12:49                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  
             hadoop-user@lucen                                             
               e.apache.org                                                
                                                                           
                                                                           
                                                                           
                                                                           





In practice, most jobs involve many more records than there are available
mappers (even for large clusters).

This means every mapper handles many records and mapper startup is
amortized
pretty widely.

It would still be nice to have a smaller startup cost, but the limiting
factor is likely to be the job tracker spitting all of the jar files to the
task trackers, not actually the map construction time.

If you really care about map instantiation time, you could start by making
the map be in the same VM.  That doesn't sound like a good trade-off to me
which in turn tells me that I don't care about startup costs so much.

It is not all that surprising if small jobs not something that can be sped
up.  The fact that parallelism is generally easier to attain for large
problems has been noticed for some time.


On 10/17/07 11:47 AM, "Lance Amundsen" <lca13@us.ibm.com> wrote:

> For right now, I am testing boundary conditions related to startup costs.
> I want to build a mapper interface that performs relatively flatly WRT
> numbers of mappers.  My goal is to dramatically improve startup costs for
> one mapper, and then make sure that that startup cost does not increase
> dramatically as node, maps, and records are increased.
>
> Example: let's say I have 10K one second jobs and I want the whole thing
to
> run 2 seconds.  I currently see no way for Hadoop to achieve this,  But I
> also see how to get there, and this level of granularity would be one of
> the requirements..... I believe.
>
> Lance
>
> IBM Software Group - Strategy
> Performance Architect
> High-Performance On Demand Solutions (HiPODS)
>
> 650-678-8425 cell
>
>
>
>
>
>              Ted Dunning
>              <tdunning@veoh.co
>              m>
To
>                                        <hadoop-user@lucene.apache.org>
>              10/17/2007 11:34
cc
>              AM
>
Subject
>                                        Re: InputFiles, Splits, Maps,
Tasks
>              Please respond to         Questions 1.3 Base
>              hadoop-user@lucen
>                e.apache.org
>
>
>
>
>
>
>
>
>
>
>
> On 10/17/07 10:37 AM, "Lance Amundsen" <lca13@us.ibm.com> wrote:
>
>> 1 file per map, 1 record per file, isSplitable(true or false):  yields 1
>> record per mapper
>
> Yes.
>
>> 1 file total, n records, isSplitable(true):  Yields variable n records
> per
>> variable m mappers
>
> Yes.
>
>> 1 file total, n records, isSplitable(false):  Yields n records into 1
>> mapper
>
> Yes.
>
>> What I am immediately looking for is a way to do:
>>
>> 1 file total, n records, isSplitable(true): Yields 1 record into n
> mappers
>>
>> But ultimately need to control fully control the file/record
> distributions.
>
> Why in the world do you need this level of control?  Isn't that the point
> of
> frameworks like Hadoop? (to avoid the need for this)
>
>
>




Mime
View raw message