pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PIG-274) changing combiner behavior past hadoop 18
Date Mon, 26 Jan 2009 20:25:59 GMT

     [ https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Olga Natkovich resolved PIG-274.

    Resolution: Fixed

This work has been done

> changing combiner behavior past hadoop 18
> -----------------------------------------
>                 Key: PIG-274
>                 URL: https://issues.apache.org/jira/browse/PIG-274
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
> In hadoop 18, the way commbiners are handled is changing. The hadoop team agreed to keep
things backward compatible for now but will depricate the current behavior in the future (likely
in hadoop 19) so pig needs to adjust to the new behavior. This should be done in the post
2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce sides. 0 times
happens if only a single <K, V> fits into sort buffer. Multiple time can happen in case
of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in advance whether
the combiner will run 0,1 or more times. This causes several issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects to get
numbers not values as its input. Hadoop team suggested that we could annotate each tuple with
a byte that tells if it want through combiner. This could be expensive computatinally as well
as will use extra memory. One things to notice is that some algebraics (like SUM, MIN, MAX)
don't care whether the data was precombined as they always to the same thing. Perhaps we can
make algebaic functions declare if they care or not. Then we only anotate the ones that need
> (2) Since combiner can be called 1 or more times, getInitial and getIntermediate have
to do the same thing. So again, we need to change the interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it runs on the
reduce side, it can be dealing with tuples from multiple inputs. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message