hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-38) default splitter should incorporate fs block size
Date Tue, 14 Feb 2006 21:19:34 GMT
You may simply want to specify the input size per job (maybe in  
blocks?) and let the framework sort things out.

A possible optimization would be to read discontinuous blocks into  
one map job if you want to pump several blocks worth of data into  
each job.  Given the map/reduce mechanism, this should work, yes?

On Feb 14, 2006, at 12:49 PM, Bryan Pendleton (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/HADOOP-38? 
> page=comments#action_12366381 ]
>
> Bryan Pendleton commented on HADOOP-38:
> ---------------------------------------
>
> The idea sounds sound, but is blocksize the best unit? There's a  
> certain overhead for each additional task added to a job - for jobs  
> with really large input, this could cause really large task lists.  
> Is there going to be any code for pre-replicating blocks? Maybe  
> sequences, so there'd be a natural "first choice" node for many  
> chunkings of larger than one block? Obviously, as datanodes come  
> and go this might not always work ideally, but it could help in the  
> 80% case.
>
>> default splitter should incorporate fs block size
>> -------------------------------------------------
>>
>>          Key: HADOOP-38
>>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>>      Project: Hadoop
>>         Type: Improvement
>>   Components: mapred
>>     Reporter: Doug Cutting
>
>>
>> By default, the file splitting code should operate as follows.
>>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>>   output is <file,start,length>*
>>   totalSize = sum of all file sizes;
>>   desiredSplitSize = totalSize / numMapTasks;
>>   if (desiredSplitSize > fsBlockSize)             /* new */
>>     desiredSplitSize = fsBlockSize;
>>   if (desiredSplitSize < minSplitSize)
>>     desiredSplitSize = minSplitSize;
>>   chop input files into desiredSplitSize chunks & return them
>> In other words, the numMapTasks is a desired minimum.  We'll try  
>> to chop input into at least numMapTasks chunks, each ideally a  
>> single fs block.
>> If there's not enough input data to create numMapTasks tasks, each  
>> with an entire block, then we'll permit tasks whose input is  
>> smaller than a filesystem block, down to a minimum split size.
>> This handles cases where:
>>   - each input record takes a lot of time to process.  In this  
>> case we want to make sure we use all of the cluster.  Thus it is  
>> important to permit splits smaller than the fs block size.
>>   - input i/o dominates.  In this case we want to permit the  
>> placement of tasks on hosts where their data is local.  This is  
>> only possible if splits are fs block size or smaller.
>> Are there other common cases that this algorithm does not handle  
>> well?
>> The part marked 'new' above is not currently implemented, but I'd  
>> like to add it.
>> Does this sound reasonble?
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>


Mime
View raw message