crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dor...@videologygroup.com>
Subject RE: Joins and null values
Date Thu, 19 Feb 2015 04:38:27 GMT
You most definitely want Set.difference(setA, setB) ;


Sent from my T-Mobile 4G LTE Device


-------- Original message --------
From: Bryan Baugher
Date:02/18/2015 11:07 PM (GMT-05:00)
To: user@crunch.apache.org
Subject: Re: Joins and null values

Hmm, I'm trying to get the elements of set A which are not in set B. Set#comm(..) could work
but seems like the wrong choice. I'm currently doing a left outer join and then filtering
to the results with only left side values. Does that seem like the best choice or are there
more gems hidden in the crunch library?

On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <jwills@cloudera.com<mailto:jwills@cloudera.com>>
wrote:
If I got that right, then I think o.a.c.lib.Set does what you want. LMK.

On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <jwills@cloudera.com<mailto:jwills@cloudera.com>>
wrote:
Oh, I'm dumb-- you mean you want like a left-join like thing where you can find all values
in collection A that aren't in collection B, etc., etc.?

J

On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jwills@cloudera.com<mailto:jwills@cloudera.com>>
wrote:
Different from o.a.c.lib.Cartesian.cross(PCollection<U> left, PCollection<T> right,
int parallelism) in some way?

J

On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bjbq4d@gmail.com<mailto:bjbq4d@gmail.com>>
wrote:

Maybe,

PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>

You could make additional methods for the different join strategies or maybe an enum perhaps?

On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jwills@cloudera.com<mailto:jwills@cloudera.com>>
wrote:
Hey Bryan,

I like the idea of throwing exceptions when there are null values in one of the collections
in a join. Not sure if there are any other implications of that I should think through first.

On the convenience methods for PCollection joins, what do you have in mind?

J


On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bjbq4d@gmail.com<mailto:bjbq4d@gmail.com>>
wrote:
Hi everyone,

The other day I ran into the issue mentioned here[1] about joining data with null values.
This took awhile to figure out until I broke down and went to look at the docs to see if I
was doing something obviously wrong. I used null values because I'm basically wanting to join
two pcollections.

Can crunch either throw an exception or log errors if I do something like this? Similarly
would it be possible to get convenience methods for doing joins on PCollections?

[1] - http://crunch.apache.org/user-guide.html#joins



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>
This email is intended only for the use of the individual(s) to whom it is addressed. If you
have received this communication in error, please immediately notify the sender and delete
the original email.

Mime
View raw message