crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Baugher <bjb...@gmail.com>
Subject Re: Joins and null values
Date Thu, 19 Feb 2015 04:06:55 GMT
Hmm, I'm trying to get the elements of set A which are not in set B.
Set#comm(..) could work but seems like the wrong choice. I'm currently
doing a left outer join and then filtering to the results with only left
side values. Does that seem like the best choice or are there more gems
hidden in the crunch library?

On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <jwills@cloudera.com> wrote:

> If I got that right, then I think o.a.c.lib.Set does what you want. LMK.
>
> On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Oh, I'm dumb-- you mean you want like a left-join like thing where you
>> can find all values in collection A that aren't in collection B, etc., etc.?
>>
>> J
>>
>> On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left,
>>> PCollection<T> right, int parallelism) in some way?
>>>
>>> J
>>>
>>> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bjbq4d@gmail.com> wrote:
>>>
>>>>
>>>> Maybe,
>>>>
>>>> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T,
T>>
>>>>
>>>> You could make additional methods for the different join strategies or
>>>> maybe an enum perhaps?
>>>>
>>>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jwills@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Bryan,
>>>>>
>>>>> I like the idea of throwing exceptions when there are null values in
>>>>> one of the collections in a join. Not sure if there are any other
>>>>> implications of that I should think through first.
>>>>>
>>>>> On the convenience methods for PCollection joins, what do you have in
>>>>> mind?
>>>>>
>>>>> J
>>>>>
>>>>>
>>>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bjbq4d@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> The other day I ran into the issue mentioned here[1] about joining
>>>>>> data with null values. This took awhile to figure out until I broke
down
>>>>>> and went to look at the docs to see if I was doing something obviously
>>>>>> wrong. I used null values because I'm basically wanting to join two
>>>>>> pcollections.
>>>>>>
>>>>>> Can crunch either throw an exception or log errors if I do something
>>>>>> like this? Similarly would it be possible to get convenience methods
for
>>>>>> doing joins on PCollections?
>>>>>>
>>>>>> [1] - http://crunch.apache.org/user-guide.html#joins
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message