incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul <>
Subject Re: BloomFilters in Crunch
Date Tue, 21 Aug 2012 04:31:22 GMT

If you look at the current piece of code then it can be. But in general 
I want it to work on a PCollection. This was just a sample testbed where 
I was playing with it.
If it works an a PCollection then it can be more useful, I am thinking 
of a Aggregation function which can do this.

Also what you said about building filters for a bunch of files/folder 
looks an interesting use case to me. I can add something on the lines of 
piggybank and make it there. J


On 20-08-2012 20:29, Josh Wills wrote:
> Hey Rahul,
> Very cool use case. A thought: isn't the name of the file that
> contains the bloom filter a better key than the boolean? That way, I
> could point the input at an entire directory of files and have it
> build bloom filters for all of them for me.
> It seems useful to me in general, but I'm not quite sure where to put
> it-- it's more useful than an example, but not such a common use case
> that we would put it in core. We need something like the equivalent of
> Pig's piggybank.
> J
> On Mon, Aug 20, 2012 at 12:58 AM, Rahul <> wrote:
>> Hi,
>> Today I tried to create BloomFilters using Crunch,  attached is the testcase
>> for the same. I do not know if there is  a better way of accomplishing the
>> same.
>> I think APIs to create/load BloomFilters could be a good add-on to Crunch's
>> existing set. If people feel like it could be added then I can make a patch
>> for the same.
>> regards,
>> Rahul

View raw message