hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johan Oskarsson <jo...@oskarsson.nu>
Subject Re: Hadoop overhead
Date Wed, 16 Jan 2008 10:50:53 GMT
I simply followed the wiki "The right level of parallelism for maps 
seems to be around 10-100 maps/node", 
http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces

We have 8 cores in each machine, so perhaps 100 mappers ought to be 
right, it's set to 157 in the config but hadoop used ~200 for the job, 
don't know why. That would of course help in this case, but what about 
when we process large datasets? Especially if a mapper fails.

Reducers I also setup to use ~1 per core, slightly less.

/Johan

Ted Dunning wrote:
> Why so many mappers and reducers relative to the number of machines you
> have?  This just causes excess heartache when running the job.
> 
> My standard practice is to run with a small factor larger than the number of
> cores that I have (for instance 3 tasks on a 2 core machine).  In fact, I
> find it most helpful to have the cluster defaults rule the choice except in
> a few cases where I want one reducer or a few more than the standard 4
> reducers.
> 
> 
> On 1/15/08 9:15 AM, "Johan Oskarsson" <johan@oskarsson.nu> wrote:
> 
>> Hi.
>>
>> I believe someone posted about this a while back, but it's worth
>> mentioning again.
>>
>> I just ran a job on our 10 node cluster where the input data was
>> ~70 empty sequence files, with our default settings this ran about ~200
>> mappers and ~70 reducers.
>>
>> The job took almost exactly two minutes to finish.
>>
>> How can we reduce this overhead?
>>
>> * Pick number of mappers and reducers in a more dynamic way,
>>    depending on the size of the input?
>> * JVM reuse, one jvm per job instead of one per task?
>>
>> Any other ideas?
>>
>> /Johan
> 


Mime
View raw message