crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: What defines "smaller" on a BloomFilterJoin?
Date Wed, 09 Mar 2016 15:22:08 GMT
Hi Sean,

Good question. Unfortunately, the answer is mostly "it depends".

Your thoughts about the PTable with the smalle number of distinct keys
going on the left side (even if it contains a larger number of records) are
mostly correct. However, don't forget that the full left-side of the join
will be used in the underlying join strategy (i.e. the
DefaultJoinStrategy), so all values for a single key will need to fit in
memory at one time on the left side of the join, regardless of the fact
that it's a bloom filter join. The bloom filter is only used to reduce the
amount of data being sorted for the right side of the join.

My initial guess for the example case that you outlined is that the default
join strategy might still be a better fit (assuming that you use the
smaller of the two tables on the left side of the join). However, I also
think that it all depends on how big the big side is, and how small the
small slide is.

The main thing to consider is that the bloom filter join only filters out
data that is used on the right side of the join, and apart from that it has
all of the same limitations and/or performance implications of the default
join strategy.

- Gabriel

On Wed, Mar 9, 2016 at 4:03 PM, Griffin,Sean <> wrote:

> Crunch,
> The Crunch User Guide states that in a join the smaller collection should
> go on the left and the larger collection should go on the right.  I assume
> this is for performance reasons and seems simple enough.  I am interpreting
> that in this statement that the “smaller” PTable is the one with the least
> number of KV pairs.  However, a Bloom Filter join is to be used when the
> “vast majority of the keys in the right-hand side table will not match the
> keys in the left-hand side table.”  This got me wondering – is the
> definition of “smaller” in this type of join not the number of KV pairs but
> the number of distinct keys?
> I noticed that when a BloomFilterJoin is used in a MRPipeline, a M/R job
> is kicked off to create the bloom filter hashes and write them to HDFS.
> This job is processed on the left-hand side of the join, and of course a
> smaller input data set will make this job execute faster.  But, the output
> of that job would be the smaller set of distinct keys, and it’s that set of
> keys that is used in the join to the right-hand side table.
> As an example in case it’s not clear, if PTable 1 has 1,000 entries with
> 10 distinct keys and PTable 2 has 100 entries with 100 distinct keys, and
> all the keys in PTable 1 match keys in PTable 2, which should go on the
> left vs. right?  In my real-world example, my PTable 1 has millions of
> entries while my PTable 2 has a few thousand, but PTable 2 does have more
> distinct keys.
> Based on what I can gather, even though PTable 1 has less distinct keys it
> should still go on the right side of the join because of its significantly
> larger size, but I wanted to verify.  I was also curious what the impact is
> of swapping these.  As I stated above, I assume this has performance
> implications, but I’m also wondering if it might have memory implications
> as well?
> Thanks,
> *Sean Griffin *
> Director, Revenue Cycle Reporting & Analytics
> | 816-201-1599
> <>
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

View raw message