pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2774) Fix merge join to work with many duplicate left keys
Date Thu, 28 Jun 2012 20:46:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403461#comment-13403461

Thejas M Nair commented on PIG-2774:

bq. I'd like to avoid having the user encode these details in the pig script. 

Floating some more ideas -

A more performant way of doing this would be to stop accumulating tuples for a join key value
from left relation into memory when a certain memory threshold is exceeded. Once join of these
tuples against the right relation is done, discard the accumulated left rel tuples for the
join key and and load a new set, go back to the start of relations with this join key in right
relation and continue.
To go back more efficiently to the start of join key in right relation we can keep track of
its record offset. This approach will have no additional writes and have less IO overall.
The right relation block hopefully gets in to OS cache.
But this approach can result in some map tasks being much slower than others.

Another option is to write the left side join key values that didn't fit into memory onto
hdfs in separate files, one file for each chunch that is expected to fit into memory, and
have another round of MR job do merge join on these files. ( I think hive has a skew join
impl on similar lines). This would involve changing the MR plan at runtime.

> Fix merge join to work with many duplicate left keys
> ----------------------------------------------------
>                 Key: PIG-2774
>                 URL: https://issues.apache.org/jira/browse/PIG-2774
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Aneesh Sharma
> A merge join can throw an OOM error if the number of duplicate left tuples is large as
it accumulates all of them in memory. There are two solutions around this problem:
> 1. Serialize the accumulated tuples to disk if they exceed a certain size.
> 2. Spit out join output periodically, and re-seek on the right hand side index.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message