hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saptarshi Guha" <saptarshi.g...@gmail.com>
Subject Re: Combiner run specification and questions
Date Thu, 08 Jan 2009 06:32:37 GMT
. So as long as the correctness of the computation doesn't
> rely on a transformation performed in the combiner, it should be OK. In

Right, i had the same thought.



>
> However, this restriction limits the scalability of your solution. It might
> be necessary to work around R's limitations by breaking up large
> computations into intermediate steps, possibly by explicitly instantiating
> and running the combiner in the reduce.
>

So, i explicitly call the combiner? However at times, the reducer
needs all the values so calling the combiner would not always work
here. However, if i recall correctly(from reading the google paper)
one does not **humongous**  number values for a single key


>> 1) I am guaranteed a reducer.
>> So,
>>>
>>> The combiner, if defined, will run zero or more times on records emitted
>>> from the map, before being fed to the reduce.
>>
>>
>> This zero case possibility worries me. However you mention, that it occurs
>>>
>>> collector spills in the map
>>
>> I have noticed this happening - what 'spilling' mean?
>
> Records emitted from the map are serialized into a buffer, which is
> periodically written to disk when it is (sufficiently) full. Each of these
> batch writes is a "spill". In casual usage, it refers to any time when
> records need to be written to disk. The merge of intermediate files into the
> final map output and merging in-memory segments to disk in the reduce are
> two examples. -C
>

Thanks for the explanation.
Regards
Saptarshi

Mime
View raw message