hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PIG-874) Problems in pushing down foreach with flatten
Date Mon, 03 May 2010 17:07:56 GMT

     [ https://issues.apache.org/jira/browse/PIG-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olga Natkovich resolved PIG-874.
--------------------------------

    Resolution: Fixed

This is getting addressed with the optimizer re-work

> Problems in pushing down foreach with flatten
> ---------------------------------------------
>
>                 Key: PIG-874
>                 URL: https://issues.apache.org/jira/browse/PIG-874
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Santhosh Srinivasan
>
> If the graph contains more than one foreach connected to an operator, pushing down foreach
with flatten is not possible with the current optimizer pattern matching algorithm and current
implementation of rewire. The following mechanism of pushing foreach with flatten does not
work.
> 1. Search for foreach (with flatten) connected to an operator
> 2. If checks pass then unflatten the flattened column in the foreach
> 3. Create a new foreach that flattens the mapped column (the original column number could
have changed) and insert the new foreach after the old foreach's successor.
> An example to illustrate the problem:
> {code}
> A = load 'myfile' as (name, age, gpa:(letter_grade, point_score));
> B = foreach A generate $0, $1, flatten($2);
> C = load 'anotherfile' as (name, age, preference:(course_name, instructor));
> D = foreach C generate $0, $1, flatten($2);
> E = join B by $0, D by $0 using "replicated";
> F = limit E 10;
> {code}
> In the code snipped (see above), the optimizer will find two matches, B->E and D->E.
For the first pattern match (B->E), $2 will be unflattened and a new foreach will be introduced
after the join.
> {code}
> A = load 'myfile' as (name, age, gpa:(letter_grade, point_score));
> B = foreach A generate $0, $1, $2;
> C = load 'anotherfile' as (name, age, preference:(course_name, instructor));
> D = foreach C generate $0, $1, flatten($2);
> E = join B by $0, D by $0 using "replicated";
> E1 = foreach E generate $0, $1, flatten($2), $3, $4, $5, $6;
> F = limit E1 10;
> {code}
> For the second match (D->E), the same transformation is applied. However, this transformation
will not work for the following reason. The new foreach is now inserted between the E and
E1. When E1 is rewired, rewire is unable to map $6 in E1 as it never exists in E. In order
to fix such situations, the pattern matching should return a global match instead of a local
match.
> Reference: PIG-873

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message