pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Satish Subhashrao Saley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner
Date Mon, 01 Oct 2018 21:32:00 GMT

     [ https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Satish Subhashrao Saley updated PIG-5342:
    Attachment: PIG-5342-6.patch

> Add setting to turn off bloom join combiner
> -------------------------------------------
>                 Key: PIG-5342
>                 URL: https://issues.apache.org/jira/browse/PIG-5342
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Satish Subhashrao Saley
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>         Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, PIG-5342-4.patch,
PIG-5342-5.patch, PIG-5342-6.patch
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom join. When
the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were the join
key. Combining involved doing a distinct on the bag of values which has memory issues for
more than 10 million records. That needs to be flipped and distinct combiner used to scale
to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right outer join
with smaller dataset on the right. Replicate join only supports left outer join.

This message was sent by Atlassian JIRA

View raw message