hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-158) Rework logical plan
Date Thu, 01 May 2008 15:00:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593549#action_12593549

Pi Song commented on PIG-158:

Questions and comments as usual:-

1. I still don't quite understand this bit:-
>Generate is also used for a couple of things:
>i. In Group and Cogroup, the group by columns are wrapped in LOGenerate
>ii. In nested statements in foreach, the expression operators are wrapped in LOGenerate.
In the example below, b refers to the column in c. >LOGenerate wraps up b as distinct expects
a relational operator.

Let's formalize a bit to see what really these operator do:-
LOForEach : Bag x (f: Tuple -> Tuple ) -> Bag    where f is applied to every tuple in
the input bag (the concept is like map in Haskell)
LOGenerate : Tuple -> Tuple     (and this should be where the nested plans go)
So, from the normal form "Foreach X generate $0, $1, $2 ", we can write:-
LOForEach : Bag x (LOGenerate: Tuple -> Tuple ) -> Bag 

For COGroup:-
LOCOGroup :  [ Bag_i, f_i: Bag -> Bag ]^n -> Bag      (f_i is f subscripted with i.
Same for Bag_i)
where f_i is applied to the input Bag to get the intermediate Bag which contains Tuple [groupedColumn,
Thus, f_i cannot be LOGenerate (otherwise type mismatch) but LOForeach coupled with LOGenerate
will be ok.

That's why I don't understand what you said. I think ForEach and Generate still have to go
together anyway even in COGroup. Except that if the definition of LOCOGroup here is different
(if say it also does iterating the input bags). But if we do so, we're not reusing the functionality
of ForEach.

2. One more thing, if you say LOGenerate will be wrapped in LOCOGroup (regardless of the outcome
of the first question). Then LOCOGroup may not need to handle all the nested plan directly.

private MultiMap<LogicalOperator, LogicalPlan> mGroupByPlans;
might become
private Map<LogicalOperator, LogicalPlan> mGroupByPlans;
and then you put LOGenerate in LogicalPlan

This seems to be supported by the definition of LOCOGroup above which has a function (from
Tuple to Tuple) as a parameter. Also as shown by the definition of LOForEach above, LOForEach
should have a nested plan which contains LOGenerate. (I don't have time to think through this
suggestion yet. Might be wrong. Though I believe it's quite right because things from math
induction are usually right)

3. I think LOProject doesn't need an inner plan. The only scenario LOProject being used is
to extract data from tuple which doesn't involve any nested calculations. In contrast LOGenerate
needs a list of plans (which is already good) because this operator can do nested expressions.

4. LOGenerate - as we allow users to define aliases like this "generate $1, $2, ($2*$3) +
5, $2+1  as (name,age,gpa,gpa2)", I think the rule should be:-
Case 1: Aliases are given. We associate given aliases with output column by column
Case 2: Aliases are not given. If the column is use as it is (eg. $1, $2 in example), we just
propagate aliases to the output. If user manipulates the column somehow (eg. $2+1 in example),
the output column should not be given an alias.

5. LOForEach has "private ArrayList<LogicalOperator> mOperators". Can ForEach be associated
with more than one input?

6. In some operators, you haven't removed unused inner ExpressionOperators yet. 

7. I still think LOGenerate should have "List<Boolean> isFlattenList" because you can
do flatten only in LOGenerate therefore flatten is not a property of operator

> Rework logical plan
> -------------------
>                 Key: PIG-158
>                 URL: https://issues.apache.org/jira/browse/PIG-158
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: logical_operators.patch, logical_operators_rev_1.patch, logical_operators_rev_2.patch,
logical_operators_rev_3.patch, parser_changes.patch, parser_changes_v1.patch, parser_changes_v2.patch,
parser_changes_v3.patch, ParserErrors.txt, visitorWalker.patch
> Rework the logical plan in line with http://wiki.apache.org/pig/PigExecutionModel

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message