hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <owen.omal...@gmail.com>
Subject Re: How to speed up the copy phrase?
Date Fri, 28 Aug 2009 01:02:44 GMT
There is an index with the offset of each reduce's first byte. The  
index is written to disk, but is also cached by the task tracker.

-- Owen

On Aug 27, 2009, at 17:12, George Porter <gmporter@gmail.com> wrote:

> Interesting.  In this case, how does Jetty dole out the proper
> partitions of the intermediate data to the appropriate reducers if
> they are located in the same files?
>
> Thanks,
> George
>
> On Thu, Aug 27, 2009 at 11:31 AM, Arun C Murthy<acm@yahoo-inc.com>  
> wrote:
>>
>> On Aug 24, 2009, at 5:49 PM, Aaron Kimball wrote:
>>
>>> If you've got 20 nodes, then you want to have 20-ish reduce tasks.  
>>> Maybe
>>> 40
>>> if you want it to run in two waves. (Assuming 1 core/node.  
>>> Multiply by N
>>> for
>>> N cores...) As it is, each node has 500-ish map tasks that it has  
>>> to read
>>> from and for each of these, it needs to generate 500 separate  
>>> reduce task
>>> output files.  That's going to take Hadoop a long time to do.
>>
>> Maps do not produce one output file per reduce, the entire map- 
>> output is in
>> a single file.
>>
>> Arun
>>

Mime
View raw message