pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2328) Add builtin UDFs for building and using bloom filters
Date Wed, 19 Oct 2011 23:41:10 GMT

    [ https://issues.apache.org/jira/browse/PIG-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131164#comment-13131164
] 

Dmitriy V. Ryaboy commented on PIG-2328:
----------------------------------------

Right on.

Quick questions from reading the patch:

Correct me if I am wrong, but this doesn't work if you use 2 different bloom filters in a
single task.

Why "contains" test for jenkins and murmur?

In addition to defining a bloom filter by array size and number of functions, you can construct
bloom filters by expected number of elements and desired accuracy (math's on wikipedia, or
you can check my Bloom::Faster perl module :)). That's probably more generally useful, since
people don't necessarily know how to choose the right number of functions.

Why does the Bloom function only work on files? It'd be pretty simple to do Bloom( (bytearray)
mybloom.$0, item) where mybloom is the Bloom relation.



                
> Add builtin UDFs for building and using bloom filters
> -----------------------------------------------------
>
>                 Key: PIG-2328
>                 URL: https://issues.apache.org/jira/browse/PIG-2328
>             Project: Pig
>          Issue Type: New Feature
>          Components: internal-udfs
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: 0.10
>
>         Attachments: PIG-bloom.patch
>
>
> Bloom filters are a common way to do select a limited set of records before moving data
for a join or other heavy weight operation.  Pig should add UDFs to support building and using
bloom filters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message