pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2774) Fix merge join to work with many duplicate left keys
Date Thu, 28 Jun 2012 21:21:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403480#comment-13403480
] 

Thejas M Nair commented on PIG-2774:
------------------------------------

bq. we might have other operations queued up after the join 

In 2nd approach, the operations within map task don't complicate things. But to handle a reduce
after the merge-join, we would need to introduce another map task that does a union of merge-join
results. For example, if the merge-join is followed by a group+agg , then the follow transformation
to plan would be needed. 
Map(Merge-join + group+agg ops) + Reduce(group+agg ops)  
     => Map (merge-join wave 1 + group+agg ops)  + Map (merge-join wave 2 + group+agg opps)
+ Map(union of 1st 2 maps) + Reduce(group+agg ops)

This transformation can't happen dynamically - we can't decide to skip the reduce while in
the map phase. 


To handle this case dynamically, looks like the first approach is one that actually would
work! The user or a metadata system possibly identify the skew problem and recommend using
a 'skew-merge' join next time query is run on similar data.


                
> Fix merge join to work with many duplicate left keys
> ----------------------------------------------------
>
>                 Key: PIG-2774
>                 URL: https://issues.apache.org/jira/browse/PIG-2774
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Aneesh Sharma
>
> A merge join can throw an OOM error if the number of duplicate left tuples is large as
it accumulates all of them in memory. There are two solutions around this problem:
> 1. Serialize the accumulated tuples to disk if they exceed a certain size.
> 2. Spit out join output periodically, and re-seek on the right hand side index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message