hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-3594) Guaranteeing that combiner is called at least once
Date Thu, 19 Jun 2008 00:46:18 GMT
In pymapred.py we use "in-mapper" equivalent of combiner for aggregating the
counts.  Just as Doug suggests, it is based on a large hashtable and
probably is more efficient than using a standard combiner.

On 6/18/08 4:31 PM, "Doug Cutting (JIRA)" <jira@apache.org> wrote:

>     [ 
> https://issues.apache.org/jira/browse/HADOOP-3594?page=com.atlassian.jira.plug
> in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606147#action_126
> 06147 ] 
> Doug Cutting commented on HADOOP-3594:
> --------------------------------------
> I spoke with Ben, and he argued that Pig could implement its combiner in its
> mapper to address this, and that would probably be faster too, since a HashMap
> could be used to buffer tuples and they would not need to be serialized and
> deserialized, as they are with a combiner.
>> Guaranteeing that combiner is called at least once
>> --------------------------------------------------
>>                 Key: HADOOP-3594
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-3594
>>             Project: Hadoop Core
>>          Issue Type: Bug
>>            Reporter: Olga Natkovich
>>             Fix For: 0.18.0
>> In 18, hadoop decides how many times to call combiner on both map and reduce
>> sides. The possible number is between 0 and N.
>> While having multiple invocations can be useful, not invoking combiner at all
>> can have serious consequences for a range of functions called algebraic
>> (http://classweb.gmu.edu/kersch/inft864/Readings/Shoshani/DataCube/DataCubeTe
>> chReport.pdf). The main properties of such functions is that the intermediate
>> and final computations are different and that the first invokation transforms
>> the data to a different form. A most common example of this is AVERAGE
>> function. While it is possible to workaround this issue by annotating each
>> tuple, it seems that it would be much easier and faster if hadoop always
>> guaranteed at least a single invocation.
>> Not having this guarantee will break all sorts of existing combiners.

View raw message