Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9FBE517313 for ; Wed, 18 Feb 2015 22:55:28 +0000 (UTC) Received: (qmail 49533 invoked by uid 500); 18 Feb 2015 22:55:22 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 49493 invoked by uid 500); 18 Feb 2015 22:55:22 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 49482 invoked by uid 99); 18 Feb 2015 22:55:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2015 22:55:22 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.216.175 as permitted sender) Received: from [209.85.216.175] (HELO mail-qc0-f175.google.com) (209.85.216.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2015 22:55:17 +0000 Received: by mail-qc0-f175.google.com with SMTP id b13so3578711qcw.6 for ; Wed, 18 Feb 2015 14:53:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=aoJnpYSFk0zs7/QhznRD6XlHAF5giFHHv+OtOJ6IJ9g=; b=G9Lnu8D073TJb8kmdoEFzZBN1MGxCy4FaNxVklpQ73HZS/0xdngP2XkYrhpo8SMaPl BrmHWcM7iFENMAkeGRiFfJadRW/iUascx9hlwS7Dd8M3uQFPRx4ZRYT8ywkIq0nVoDex C4CTPHgZqR/IRnyjLCZcvBeOvS+O0WmXUE1FPjAi8nHGh7My/foqkpK4l63vSjqQ4w3c yefLiC+ubRR41OOUxmAIbuZ8GFjGJcCsSVHXMXfFbmR8Q0EERovOGPOFqzBfl6VwbyfY d3t9AGitAl7byDc7D5Hugl+xQTzk+xbL9KwMy0ZKq6xAcAT+cjGoabiBy4y37slwFbzg aAIQ== X-Gm-Message-State: ALoCoQkOBpE/ODhMxi899mnK+yx4cZh+Y4vWJjee/T8JWBYpb5bcd0kPDyBnc2QwFMY65c50xkh5 X-Received: by 10.140.151.141 with SMTP id 135mr5733411qhx.8.1424300007152; Wed, 18 Feb 2015 14:53:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.141.1.87 with HTTP; Wed, 18 Feb 2015 14:53:05 -0800 (PST) In-Reply-To: References: From: Josh Wills Date: Wed, 18 Feb 2015 14:53:05 -0800 Message-ID: Subject: Re: Joins and null values To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a113552c6faf193050f64af00 X-Virus-Checked: Checked by ClamAV on apache.org --001a113552c6faf193050f64af00 Content-Type: text/plain; charset=UTF-8 Oh, I'm dumb-- you mean you want like a left-join like thing where you can find all values in collection A that aren't in collection B, etc., etc.? J On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills wrote: > Different from o.a.c.lib.Cartesian.cross(PCollection left, > PCollection right, int parallelism) in some way? > > J > > On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher wrote: > >> >> Maybe, >> >> PCollection#join(PCollection, JoinType) : PCollection> >> >> You could make additional methods for the different join strategies or >> maybe an enum perhaps? >> >> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills wrote: >> >>> Hey Bryan, >>> >>> I like the idea of throwing exceptions when there are null values in one >>> of the collections in a join. Not sure if there are any other implications >>> of that I should think through first. >>> >>> On the convenience methods for PCollection joins, what do you have in >>> mind? >>> >>> J >>> >>> >>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher >>> wrote: >>> >>>> Hi everyone, >>>> >>>> The other day I ran into the issue mentioned here[1] about joining data >>>> with null values. This took awhile to figure out until I broke down and >>>> went to look at the docs to see if I was doing something obviously wrong. I >>>> used null values because I'm basically wanting to join two pcollections. >>>> >>>> Can crunch either throw an exception or log errors if I do something >>>> like this? Similarly would it be possible to get convenience methods for >>>> doing joins on PCollections? >>>> >>>> [1] - http://crunch.apache.org/user-guide.html#joins >>>> >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera >>> Twitter: @josh_wills >>> >> > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > -- Director of Data Science Cloudera Twitter: @josh_wills --001a113552c6faf193050f64af00 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Oh, I'm dumb-- you mean you want like a left-join like= thing where you can find all values in collection A that aren't in col= lection B, etc., etc.?

J

On Wed, Feb 18, 2015 at 2:43 PM, Josh= Wills <jwills@cloudera.com> wrote:
Different from o.a.c.lib.Cartesian.cross(PCo= llection<U> left, PCollection<T> right, int parallelism) in som= e way?

J<= /div>

On Wed, Feb 18, 2015 at 2:4= 1 PM, Bryan Baugher <bjbq4d@gmail.com> wrote:

Maybe,

PCollecti= on<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T= , T>>

You could make additional methods for = the different join strategies or maybe an enum perhaps?

On Wed Feb 18 2015 at 3:58:38 PM Josh Will= s <jwills@cloud= era.com> wrote:
H= ey Bryan,

I like the idea of throwing exceptions when th= ere are null values in one of the collections in a join. Not sure if there = are any other implications of that I should think through first.
=
On the convenience methods for PCollection joins, what do yo= u have in mind?

J


On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bjbq4d@gmail.com= > wrote:
<= span style=3D"font-size:13.1999998092651px;line-height:19.7999992370605px">= Hi everyone,

The other day I ran into the issue mentioned h= ere[1] about joining data with null values. This took awhile to figure out = until I broke down and went to look at the docs to see if I was doing somet= hing obviously wrong.=C2=A0I u= sed null values because I'm basically wanting to join two pcollections.=

=
= Can crunch either throw an exc= eption or log errors if I do something like this? Similarly would it be pos= sible to get convenience methods for doing joins on PCollections?







--
=
Director of Data Science
Twitter: @josh_wills



--
=
Director of Data Science
Tw= itter: @josh_wi= lls
--001a113552c6faf193050f64af00--