hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: When exactly is combiner invoked?
Date Thu, 28 Jan 2010 14:13:20 GMT
Hi Le,
I don't think mapreduce can completely combine all the records with the same key into one
record. one situation is when "min.num.spills.for.combine" is too high, while you get less
records than that which share the same key, the combiner will not be invoked on these records.

Actually, I think mapreduce is doing a merge sort and at the last round of merging, it load
one bucket from each of the spilled files into memory. Combiner could only see and combine
the records reside in memory currently. If a record comes after the previous part has been
written back to disk, there is no chance for it to be combined with the previous part. 


----- 原始邮件 ----
发件人: Le Zhao <lezhao@cs.cmu.edu>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/27 (周三) 5:23:51 下午
主   题: Re: When exactly is combiner invoked?

Gang, Jeff and Amogh,

Thanks for all the replies.

It seems no matter how many times internally combiners are invoked, the output for one specific
map task will be *totally* partitioned and combined.  Then, the data is shuffled/sent to reducers.

That's good to know, because if combining isn't fully done on one map's output, there might
be problems.  (E.g. for indexing a document, <word, docid> pairs are the mapper's output,
and if records for the same document may end up not fully combined.  The inverted index may
end up having duplicate records for the same <word, docid> tuple.  So reducer has to
do extra work.)

Also, good idea to keep combiner light weight.


Amogh Vasekar wrote:
> Hi,
> To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent,
during which spills are created. If the number of spills is more than min.num.spills.for.combine,
combiner gets invoked on the spills created before writing to disk.
> I'm not sure what exactly you intend to say by "finish processing an input record". Typically,
the processing (map) ends with a output.collect.
> Amogh


View raw message