crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Using ShardedJoinStrategy with LEFT_OUTER_JOIN JoinType
Date Tue, 29 Sep 2015 21:02:49 GMT
Gabriel should probably weigh in, but I believe that reversing your PTables
and using the RIGHT_OUTER_JOIN strategy should work correctly. I believe
the reason the left outer join isn't supported is because each value in the
"left" arg to ShadedJoinStrategy.join(left, right, type) is replicated N
times (where N is the sharding number), whereas each value in the right arg
is only written once (and randomly goes to one of the N sharded replicas
generated for the left-side values.) That means that we expect some of the
left-side values to not have right-side entries just b/c of the way the
algorithm is implemented, and not because of the actual values in the
underlying data, so we wouldn't be able to tell if a left-side value was
missing a right-side entry b/c of the algorithm or b/c of the data.


On Tue, Sep 29, 2015 at 4:46 PM, Robert Sanek <>

> I'm trying to use the ShardedJoinStrategy for a join where ~80% of the
> underlying data has the same key; using DefaultJoinStrategy results in
> one reducer holding up the entire pipeline. My intuition was that I could
> use the ShardedJoinStrategy to work around the duplicate key problem.
> The issue that I am having is that when I try to do a left join, an
> UnsupportedOperationException is throw. Looking into the code reveals
> that a right join will not throw this exception. Why can I not use the LEFT_OUTER_JOIN
> JoinType, but Crunch allows using a RIGHT_OUTER_JOIN? If I restructure my
> code to functionally do the same thing but use a right join instead, will I
> see performance degradation?
> Possibly relevant: I also found this article
> <>
> from Spotify Engineering that described how they did a sharded left join
> "from scratch" using Crunch; they claim they are not using
> ShardedJoinStrategy because of an "underlying bug".
> Note: I posted this
> <!topic/crunch-users/hEQuCMzkT-Y>
> on Google Groups as well before seeing that this mailing list was more
> active.
> --
> Rob Sanek

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message