pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (PIG-3310) ImplicitSplitInserter does not generate new uids for nested schema fields, leading to miscomputations
Date Thu, 30 May 2013 18:58:21 GMT

     [ https://issues.apache.org/jira/browse/PIG-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai resolved PIG-3310.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.12
         Assignee: Clément Stenac
     Hadoop Flags: Reviewed

Patch committed to trunk. Thanks Clément, Koji!
                
> ImplicitSplitInserter does not generate new uids for nested schema fields, leading to
miscomputations
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-3310
>                 URL: https://issues.apache.org/jira/browse/PIG-3310
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.11.1
>         Environment: Reproduced on 0.10.1, 0.11.1 and trunk
>            Reporter: Clément Stenac
>            Assignee: Clément Stenac
>             Fix For: 0.12
>
>         Attachments: generate-uid-for-nested-fields.patch
>
>
> Hi,
> Consider the following example
> {code}
> inp = LOAD '$INPUT' AS (memberId:long, shopId:long, score:int);
> tuplified = FOREACH inp GENERATE (memberId, shopId) AS tuplify, score;
> D1 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId as shopId,
score AS score;
> D2 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId as shopId,
score AS score;
> J = JOIN D1 By shopId, D2 by shopId;
> K = FOREACH J GENERATE D1::memberId AS member_id1, D2::memberId AS member_id2, D1::shopId
as shop;
> EXPLAIN K;
> DUMP K;
> {code}
> It is a bit weird written like that, but it provides a minimal reproduction case (in
the real case, the "tuplified" phase came from a multi-key grouping).
> On input data:
> {code}
> 1       1001    101
> 1       1002    103
> 1       1003    102
> 1       1004    102
> 2       1005    101
> 2       1003    101
> 2       1002    123
> 3       1042    101
> 3       1005    101
> 3       1002    133
> {code}
> This will give a wrongful output like ..
> {code}
> (1,1001,1001)
> (1,1002,1002)
> (1,1002,1002)
> (1,1002,1002)
> {code}
> The second column should be a member id so (1,2,3,4,5).
> In the initial case, there was a FILTER (member_id1 < member_id2) after K, and computation
failed because of PushUpFilter optimization mistakenly moving the LOFilter operation before
the join, at a place where it tried to work on a tuple and failed.
> My understanding of the issue is that when the ImplicitSplitInserter creates the LOSplitOutputs,
it will correctly reset the schema, and the LOSplitOutput will regenerate uids for the fields
of D1 and D2 ... but will not do that on the tuple members.
> The logical plan after the ImplicitSplitINserter will look like (simplified)
> {code}
>    |---D1: (Name: LOForEach Schema: memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[127]ColumnPrune:OutputUids=[125,
124]
>         |---tuplified: (Name: LOSplitOutput Schema: tuplify#127:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[127]
>            |---tuplified: (Name: LOSplit Schema: tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
>     |---D2: (Name: LOForEach Schema: memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[130]ColumnPrune:OutputUids=[125,
124]
>         |---tuplified: (Name: LOSplitOutput Schema: tuplify#130:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[130]
>            |---tuplified: (Name: LOSplit Schema: tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
> {code}
> tuplified correctly gets a new uid (127 and 130) but the members of the tuple don't.
When they get reprojected, both branches have the same uid and the join looks like:
> {code}
> |---J: (Name: LOJoin(HASH) Schema: D1::memberId#124:long,D1::shopId#125:long,D2::memberId#139:long,D2::shopId#132:long)ColumnPrune:InputUids=[125,
124, 132]ColumnPrune:OutputUids=[125, 124, 132]
>         |   |
>         |   shopId:(Name: Project Type: long Uid: 125 Input: 0 Column: 1)
>         |   |
>         |   shopId:(Name: Project Type: long Uid: 125 Input: 1 Column: 1)
> {code}
> If for example instead of reprojecting "memberId", we project "memberId+0", a new node
is created, and ultimately the two branches of the join will correctly get separate uids.
> My understanding is that LOSplitOutput.getSchema() should recurse on nested schema fields.
However, I only have a light understanding of all of the logical plan handling, so I may be
completely wrong.
> Attached is a draft of patch and a test reproducing the issue. Unfortunately, I haven't
been able to run all unit tests with the "fix" (I have some weird hangs)
> I'd be happy if you could indicate if that looks like completely the wrong way to fix
the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message