crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Sanek <>
Subject Using ShardedJoinStrategy with LEFT_OUTER_JOIN JoinType
Date Tue, 29 Sep 2015 20:46:33 GMT
I'm trying to use the ShardedJoinStrategy for a join where ~80% of the
underlying data has the same key; using DefaultJoinStrategy results in one
reducer holding up the entire pipeline. My intuition was that I could use
the ShardedJoinStrategy to work around the duplicate key problem.

The issue that I am having is that when I try to do a left join, an
UnsupportedOperationException is throw. Looking into the code reveals that
a right join will not throw this exception. Why can I not use the
JoinType, but Crunch allows using a RIGHT_OUTER_JOIN? If I restructure my
code to functionally do the same thing but use a right join instead, will I
see performance degradation?

Possibly relevant: I also found this article
from Spotify Engineering that described how they did a sharded left join
"from scratch" using Crunch; they claim they are not using
ShardedJoinStrategy because of an "underlying bug".

Note: I posted this
on Google Groups as well before seeing that this mailing list was more

Rob Sanek

View raw message