crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Baugher <bjb...@gmail.com>
Subject Re: Joins and null values
Date Thu, 19 Feb 2015 05:01:51 GMT
Ahh yes reading the whole doc would help. Thanks!

On Wed Feb 18 2015 at 10:38:56 PM David Ortiz <dortiz@videologygroup.com>
wrote:

>  You most definitely want Set.difference(setA, setB) ;
>
>
>  Sent from my T-Mobile 4G LTE Device
>
>
> -------- Original message --------
> From: Bryan Baugher
> Date:02/18/2015 11:07 PM (GMT-05:00)
> To: user@crunch.apache.org
> Subject: Re: Joins and null values
>
>  Hmm, I'm trying to get the elements of set A which are not in set B.
> Set#comm(..) could work but seems like the wrong choice. I'm currently
> doing a left outer join and then filtering to the results with only left
> side values. Does that seem like the best choice or are there more gems
> hidden in the crunch library?
>
> On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <jwills@cloudera.com> wrote:
>
>> If I got that right, then I think o.a.c.lib.Set does what you want. LMK.
>>
>> On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Oh, I'm dumb-- you mean you want like a left-join like thing where you
>>> can find all values in collection A that aren't in collection B, etc., etc.?
>>>
>>>  J
>>>
>>> On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jwills@cloudera.com> wrote:
>>>
>>>> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left,
>>>> PCollection<T> right, int parallelism) in some way?
>>>>
>>>>  J
>>>>
>>>> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bjbq4d@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Maybe,
>>>>>
>>>>>  PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T,
>>>>> T>>
>>>>>
>>>>>  You could make additional methods for the different join strategies
>>>>> or maybe an enum perhaps?
>>>>>
>>>>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jwills@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Bryan,
>>>>>>
>>>>>>  I like the idea of throwing exceptions when there are null values
>>>>>> in one of the collections in a join. Not sure if there are any other
>>>>>> implications of that I should think through first.
>>>>>>
>>>>>>  On the convenience methods for PCollection joins, what do you have
>>>>>> in mind?
>>>>>>
>>>>>>  J
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bjbq4d@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>  The other day I ran into the issue mentioned here[1] about joining
>>>>>>> data with null values. This took awhile to figure out until I
broke down
>>>>>>> and went to look at the docs to see if I was doing something
obviously
>>>>>>> wrong. I used null values because I'm basically wanting to join
two
>>>>>>> pcollections.
>>>>>>>
>>>>>>>  Can crunch either throw an exception or log errors if I do
>>>>>>> something like this? Similarly would it be possible to get convenience
>>>>>>> methods for doing joins on PCollections?
>>>>>>>
>>>>>>>  [1] - http://crunch.apache.org/user-guide.html#joins
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>>  Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>>  Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>>
>>>  --
>>>  Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>>  --
>>  Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>  *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>

Mime
View raw message