crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Griffin,Sean" <SGRIF...@CERNER.COM>
Subject What defines "smaller" on a BloomFilterJoin?
Date Wed, 09 Mar 2016 15:03:15 GMT
The Crunch User Guide states that in a join the smaller collection should go on the left and
the larger collection should go on the right.  I assume this is for performance reasons and
seems simple enough.  I am interpreting that in this statement that the “smaller” PTable
is the one with the least number of KV pairs.  However, a Bloom Filter join is to be used
when the “vast majority of the keys in the right-hand side table will not match the keys
in the left-hand side table.”  This got me wondering – is the definition of “smaller”
in this type of join not the number of KV pairs but the number of distinct keys?

I noticed that when a BloomFilterJoin is used in a MRPipeline, a M/R job is kicked off to
create the bloom filter hashes and write them to HDFS.  This job is processed on the left-hand
side of the join, and of course a smaller input data set will make this job execute faster.
 But, the output of that job would be the smaller set of distinct keys, and it’s that set
of keys that is used in the join to the right-hand side table.

As an example in case it’s not clear, if PTable 1 has 1,000 entries with 10 distinct keys
and PTable 2 has 100 entries with 100 distinct keys, and all the keys in PTable 1 match keys
in PTable 2, which should go on the left vs. right?  In my real-world example, my PTable 1
has millions of entries while my PTable 2 has a few thousand, but PTable 2 does have more
distinct keys.

Based on what I can gather, even though PTable 1 has less distinct keys it should still go
on the right side of the join because of its significantly larger size, but I wanted to verify.
 I was also curious what the impact is of swapping these.  As I stated above, I assume this
has performance implications, but I’m also wondering if it might have memory implications
as well?


Sean Griffin
Director, Revenue Cycle Reporting & Analytics<> | 816-201-1599

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

View raw message