hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [jira] Commented: (HADOOP-38) default splitter should incorporate fs block size
Date Tue, 14 Feb 2006 23:06:43 GMT
Eric Baldeschwieler wrote:
> You may simply want to specify the input size per job (maybe in  
> blocks?) and let the framework sort things out.

You can achieve that in my proposal by increasing the minSplitSize to 
something larger than the block size.  So that's already possible.  All 
that I'm suggesting is that the default is to try to make things one 
block per split, unless that results in too few splits.

> A possible optimization would be to read discontinuous blocks into  one 
> map job if you want to pump several blocks worth of data into  each 
> job.  Given the map/reduce mechanism, this should work, yes?

I think you mean multiple blocks per task.  That has potential 
restartability issues, since it's a lot like bigger blocks.  And the 
tasktracker still has to have some representation of every block in 
memory, so I'm not sure it makes the datastructure much smaller, which 
is my primary concern with large numbers of tasks.

A reason to usually keep the number of map tasks much greater than the 
number of CPUs is to reduce the impact of restarting map tasks, since 
most user computation is done while mapping.

A reason to usually keep the number of reduce tasks only slightly 
greater than the number of CPUs is to use all resources while not 
generating too many output files, since these might be combined with the 
output of other maps to form inputs, and we'd rather have fewer large 
inputs to split up than more smaller ones.


View raw message