pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Satish Subhashrao Saley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner
Date Mon, 01 Oct 2018 21:32:00 GMT

    [ https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634662#comment-16634662
] 

Satish Subhashrao Saley commented on PIG-5342:
----------------------------------------------

Updated patch.

> Add setting to turn off bloom join combiner
> -------------------------------------------
>
>                 Key: PIG-5342
>                 URL: https://issues.apache.org/jira/browse/PIG-5342
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Satish Subhashrao Saley
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>         Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, PIG-5342-4.patch,
PIG-5342-5.patch, PIG-5342-6.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom join. When
the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were the join
key. Combining involved doing a distinct on the bag of values which has memory issues for
more than 10 million records. That needs to be flipped and distinct combiner used to scale
to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right outer join
with smaller dataset on the right. Replicate join only supports left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message