crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Using ShardedJoinStrategy with LEFT_OUTER_JOIN JoinType
Date Wed, 30 Sep 2015 07:25:54 GMT
I'll just weigh in to say that Josh's explanation was exactly correct :-).

There are limitations in terms of which of the outer join strategies
can be performed in most of the alternative JoinStrategy
implementations, all for these kinds of reasons.

I'm curious what the underlying bug was that was encountered at
Spotify -- I don't remember seeing it being reported, although I might
have missed it or forgotten it.

- Gabriel


On Tue, Sep 29, 2015 at 11:02 PM, Josh Wills <jwills@cloudera.com> wrote:
> Gabriel should probably weigh in, but I believe that reversing your PTables
> and using the RIGHT_OUTER_JOIN strategy should work correctly. I believe the
> reason the left outer join isn't supported is because each value in the
> "left" arg to ShadedJoinStrategy.join(left, right, type) is replicated N
> times (where N is the sharding number), whereas each value in the right arg
> is only written once (and randomly goes to one of the N sharded replicas
> generated for the left-side values.) That means that we expect some of the
> left-side values to not have right-side entries just b/c of the way the
> algorithm is implemented, and not because of the actual values in the
> underlying data, so we wouldn't be able to tell if a left-side value was
> missing a right-side entry b/c of the algorithm or b/c of the data.
>
> J
>
> On Tue, Sep 29, 2015 at 4:46 PM, Robert Sanek <robert@wealthfront.com>
> wrote:
>>
>> I'm trying to use the ShardedJoinStrategy for a join where ~80% of the
>> underlying data has the same key; using DefaultJoinStrategy results in one
>> reducer holding up the entire pipeline. My intuition was that I could use
>> the ShardedJoinStrategy to work around the duplicate key problem.
>>
>> The issue that I am having is that when I try to do a left join, an
>> UnsupportedOperationException is throw. Looking into the code reveals that a
>> right join will not throw this exception. Why can I not use the
>> LEFT_OUTER_JOIN JoinType, but Crunch allows using a RIGHT_OUTER_JOIN? If I
>> restructure my code to functionally do the same thing but use a right join
>> instead, will I see performance degradation?
>>
>> Possibly relevant: I also found this article from Spotify Engineering that
>> described how they did a sharded left join "from scratch" using Crunch; they
>> claim they are not using ShardedJoinStrategy because of an "underlying bug".
>>
>> Note: I posted this on Google Groups as well before seeing that this
>> mailing list was more active.
>>
>> --
>> Rob Sanek
>
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills

Mime
View raw message