crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Sanek <rob...@wealthfront.com>
Subject Using ShardedJoinStrategy with LEFT_OUTER_JOIN JoinType
Date Tue, 29 Sep 2015 20:46:33 GMT
I'm trying to use the ShardedJoinStrategy for a join where ~80% of the
underlying data has the same key; using DefaultJoinStrategy results in one
reducer holding up the entire pipeline. My intuition was that I could use
the ShardedJoinStrategy to work around the duplicate key problem.

The issue that I am having is that when I try to do a left join, an
UnsupportedOperationException is throw. Looking into the code reveals that
a right join will not throw this exception. Why can I not use the
LEFT_OUTER_JOIN
JoinType, but Crunch allows using a RIGHT_OUTER_JOIN? If I restructure my
code to functionally do the same thing but use a right join instead, will I
see performance degradation?

Possibly relevant: I also found this article
<https://labs.spotify.com/2014/12/19/torching-our-reducers-taught-us-this-essential-lesson/>
from Spotify Engineering that described how they did a sharded left join
"from scratch" using Crunch; they claim they are not using
ShardedJoinStrategy because of an "underlying bug".

Note: I posted this
<https://groups.google.com/a/cloudera.org/forum/#!topic/crunch-users/hEQuCMzkT-Y>
on Google Groups as well before seeing that this mailing list was more
active.

-- 
Rob Sanek

Mime
View raw message