pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4963) Add a Bloom join
Date Thu, 26 Jan 2017 05:57:24 GMT

    [ https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839298#comment-15839298
] 

Rohini Palaniswamy commented on PIG-4963:
-----------------------------------------

bq. But I feel it is more clear if the plan show a filter + regular local rearrange. The execution
plan of the later is more understandable.
   Actually in this case bloom filter cannot be applied before local rearrange. Local rearrange
is the one that separates the record into key and value for the join and Bloom filter is then
applied on the key. So it has to be either part of the local rearrange operator as currently
implemented or be a separate operator after local rearrange which will we be lot more confusing.


> Add a Bloom join
> ----------------
>
>                 Key: PIG-4963
>                 URL: https://issues.apache.org/jira/browse/PIG-4963
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>         Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, PIG-4963-4.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. But found
that actually using it for big data which required huge vector size was very inefficient and
led to OOM.
>    I had initially calculated that it would take around 12MB bytearray for 100 million
vectorsize (100000000 + 7) / 8 = 12500000 bytes) and that would be the scalar value broadcasted
and would not take much space. But problem is 12MB was written out for every input record
with BuildBloom$Initial before the aggregation happens and we arrive at the final BloomFilter
vector. And with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or skewed join
it would boost performance for a lot of jobs. Bloom filter of the smaller tables can be sent
to the bigger tables as scalar and data filtered before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message