pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Noguchi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-5370) Union onschema + columnprune dropping used fields
Date Wed, 28 Nov 2018 06:21:00 GMT

     [ https://issues.apache.org/jira/browse/PIG-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Koji Noguchi updated PIG-5370:
------------------------------
    Attachment: pig-5370-v1.patch

I can think of two different approaches.

(i) Even for overlapping uids on different nested level, do not allow them and
 force IdentityColumn. This way, all uids will be unique.

(ii) Change LOUnion uidMapping logic from (output_uid, input_uid) lists to
 (output_uid, nested_uids).

Attaching a patch that tries (ii). If possible, I'd like to avoid (i) which is already
 creating more uids to keep track.

Taking one relation as example,
{noformat}
B: (Name: LOForEach Schema: A#36:bag{#37:tuple(a1#9:int,a2#*10*:chararray,a3#*11*:int)},a2#*10*:chararray,a3#*11*:int)
{noformat}
Before the patch, input_uid
 36,9,10,11,10,11
were used for uidMapping.

After the patch, it'll use nested_uids,
 _36, _36_9, _36_10, _36_11, _10, _11

This way, there won't be any incorrect list lookup.

 [~daijy], would this approach work? 


> Union onschema + columnprune dropping used fields 
> --------------------------------------------------
>
>                 Key: PIG-5370
>                 URL: https://issues.apache.org/jira/browse/PIG-5370
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Major
>         Attachments: pig-5370-v1.patch
>
>
> After PIG-5312, below query started failing.
> {code}
> A = load 'input.txt' as (a1:int, a2:chararray, a3:int);
> B = FOREACH (GROUP A by (a1,a2)) {
>     A_FOREACH = FOREACH A GENERATE a2,a3;
>     GENERATE A, FLATTEN(A_FOREACH) as (a2,a3);
> }
> C = load 'input2.txt' as (A:bag{tuple:(a1: int,a2: chararray,a3:int)},a2: chararray,a3:int);
> D = UNION ONSCHEMA B, C;
> dump D;
> {code}
> {code:title=input1.txt}
> 1       a       3
> 2       b       4
> 2       c       5
> 1       a       6
> 2       b       7
> 1       c       8
> {code}
> {code:title=input2.txt}
> {(10,a0,30),(20,b0,40)} zzz     222
> {code}
> {noformat:title=Expected output}
> ({(10,a0,30),(20,b0,40)},zzz,222)
> ({(1,a,6),(1,a,3)},a,6)
> ({(1,a,6),(1,a,3)},a,3)
> ({(1,c,8)},c,8)
> ({(2,b,7),(2,b,4)},b,7)
> ({(2,b,7),(2,b,4)},b,4)
> ({(2,c,5)},c,5)
> {noformat}
> {noformat:title=Actual (incorrect) output}
> ({(10,a0,30),(20,b0,40)})    ****ONLY 1 Field ****
> ({(1,a,6),(1,a,3)},a,6)
> ({(1,a,6),(1,a,3)},a,3)
> ({(1,c,8)},c,8)
> ({(2,b,7),(2,b,4)},b,7)
> ({(2,b,7),(2,b,4)},b,4)
> ({(2,c,5)},c,5)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message