hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saptarshi Guha" <saptarshi.g...@gmail.com>
Subject Re: Combiner run specification and questions
Date Thu, 08 Jan 2009 06:32:37 GMT
. So as long as the correctness of the computation doesn't
> rely on a transformation performed in the combiner, it should be OK. In

Right, i had the same thought.

> However, this restriction limits the scalability of your solution. It might
> be necessary to work around R's limitations by breaking up large
> computations into intermediate steps, possibly by explicitly instantiating
> and running the combiner in the reduce.

So, i explicitly call the combiner? However at times, the reducer
needs all the values so calling the combiner would not always work
here. However, if i recall correctly(from reading the google paper)
one does not **humongous**  number values for a single key

>> 1) I am guaranteed a reducer.
>> So,
>>> The combiner, if defined, will run zero or more times on records emitted
>>> from the map, before being fed to the reduce.
>> This zero case possibility worries me. However you mention, that it occurs
>>> collector spills in the map
>> I have noticed this happening - what 'spilling' mean?
> Records emitted from the map are serialized into a buffer, which is
> periodically written to disk when it is (sufficiently) full. Each of these
> batch writes is a "spill". In casual usage, it refers to any time when
> records need to be written to disk. The merge of intermediate files into the
> final map output and merging in-memory segments to disk in the reduce are
> two examples. -C

Thanks for the explanation.

View raw message