Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 54074 invoked from network); 16 Apr 2010 18:29:25 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Apr 2010 18:29:25 -0000 Received: (qmail 23759 invoked by uid 500); 16 Apr 2010 18:29:25 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 23740 invoked by uid 500); 16 Apr 2010 18:29:25 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 23725 invoked by uid 99); 16 Apr 2010 18:29:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 18:29:25 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hc.busy@gmail.com designates 209.85.221.187 as permitted sender) Received: from [209.85.221.187] (HELO mail-qy0-f187.google.com) (209.85.221.187) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 18:29:19 +0000 Received: by qyk17 with SMTP id 17so1399183qyk.9 for ; Fri, 16 Apr 2010 11:28:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type; bh=IZho+rT1AR4HeMyLsenm4I0AuDZrRYmfWlcjf5/KWkk=; b=Z2qXXH5qO9wh1MyhqFLF7XRZ3jdaHpsLyuzxcByKqkht5f6f9VemAoVaHGVy9JP5eZ 8B2D0Pqxac9IwXkQEOxZdJ3gvnosdgcb5070OzXC8J6GHz9O8eQ9XX0mhzMqeKoJIxUx 7v/xQZHiShFDyswe2tv0mU9vkJ/vkH5nEABwA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=xEDCNi5puLLElxEMfdMXPTBKJyzf8wMV3cpE7W7TdszFOiXNCms/e2WYR8m6IaRAHc 0ctcx+jdDpvFiXiAfI46u7diPPfYb36ePku5xzr9wCz2MDGtRrtjE70gMUXfxAs39r36 NAi1M5mhbiU3EqUt4IyUA8ZP8JgxtgRsqt0rk= MIME-Version: 1.0 Received: by 10.229.127.218 with HTTP; Fri, 16 Apr 2010 11:28:58 -0700 (PDT) In-Reply-To: <697CDE6B-6D98-4EE2-8842-160BACD1735C@yahoo-inc.com> References: <1C98B349-6FB6-497F-BE33-C63A3D51C3E7@richrelevance.com> <697CDE6B-6D98-4EE2-8842-160BACD1735C@yahoo-inc.com> Date: Fri, 16 Apr 2010 11:28:58 -0700 Received: by 10.229.248.137 with SMTP id mg9mr2841526qcb.33.1271442538157; Fri, 16 Apr 2010 11:28:58 -0700 (PDT) Message-ID: Subject: Re: incorrect Inner Join result for multi column join with null values in join key From: hc busy To: pig-user@hadoop.apache.org, pig-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e6464330d6e05504845ec864 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6464330d6e05504845ec864 Content-Type: text/plain; charset=ISO-8859-1 Cool! can't wait until CDH has >0.7... Kinda surprised that nobody encountered this problem before... Can I file a ticket? On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates wrote: > > On Apr 16, 2010, at 9:37 AM, hc busy wrote: > > What scott noticed is present when the multiple column join key is used in >> a >> distributed setting. The trap is that when you unit test the behavior/PIG >> script and it does the join right in a local environment and then you get >> F'ed after u deploy to production in distributed enviro. >> > > In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid > that will fix these issues with development and deployment differences. > > Alan. > > > >> >> On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey > >wrote: >> >> CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of 2 >>> weeks >>> ago. >>> >>> Joins on multiple columns have null key values matching. >>> >>> IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, >>> f1:int, f2:int); >>> IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, >>> f1:int, f2:int); >>> --- both the above are the same >>> >>> dump IN; >>> (,1,2,3) >>> (,-5,5,5) >>> ( ,100,200,300) >>> ( ,0,200,300) >>> (a,4,5,6) >>> (a,7,8,9) >>> (b,10,11,12) >>> (b,11,11,12) >>> >>> IN_NULLS = FILTER IN BY ind is NULL; >>> dump IN_NULLS; >>> (,1,2,3) >>> (,-5,5,5) >>> >>> J1 = JOIN IN by (ind), IN2 by (ind); >>> dump J1; >>> ( ,0,200,300, ,0,200,300) >>> (a,4,5,6,a,4,5,6) >>> (a,4,5,6,a,7,8,9) >>> (a,7,8,9,a,4,5,6) >>> (a,7,8,9,a,7,8,9) >>> ( ,100,200,300, ,100,200,300) >>> (b,10,11,12,b,10,11,12) >>> (b,10,11,12,b,11,11,12) >>> (b,11,11,12,b,10,11,12) >>> (b,11,11,12,b,11,11,12) >>> >>> The above is the expected result of the self-join on the first column. >>> >>> J2 = JOIN IN by (ind, ts) IN2 by (ind, ts); >>> dump J2; >>> ( ,0,200,300, ,0,200,300) >>> ( ,100,200,300, ,100,200,300) >>> (a,4,5,6,a,4,5,6) >>> (a,7,8,9,a,7,8,9) >>> (b,10,11,12,b,10,11,12) >>> (b,11,11,12,b,11,11,12) >>> (,-5,5,5,,-5,5,5) >>> (,1,2,3,,1,2,3) >>> >>> >>> The above is incorrect, since it matched the rows that have NULL for the >>> ind field. >>> >>> There is a work-around, by explicitly filtering for null on the join >>> columns before the join, but the above still looks incorrect to me. >>> I suspect it is fixed in 0.6 or later, but I have not been able to find a >>> JIRA ticket or message on this list about this. >>> >>> >>> >>> >>> > --0016e6464330d6e05504845ec864--