hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-158) Rework logical plan
Date Mon, 21 Apr 2008 18:21:24 GMT

    [ https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590987#action_12590987
] 

Santhosh Srinivasan commented on PIG-158:
-----------------------------------------

Pi,

Thanks for the comments/questions. My replies are inline with [Santhosh]

1) So seems like every operator will handle its own output schema right? That makes it easier
for me.

[Santhosh] Yes, each operator handles its own output schema.

2) in LOProject.getSchema() :-

mExp is an ExpressionOperator. From what you said before .getSchema() will always return null
except if we're dealing with tuple operators. I think this will make more sense if mExp becomes
RelationOperator (We only do projection operations on relational operators, don't we?)

[Santhosh]  The getSchema() method for expression operators will mostly return null except
when tuple operations are performed. Projection operator is used on expression operators and
not on relational operators. Projection can occur in multiple places, UDFs, aliases to expressions,
etc. For example, $1.($0, $1, $2) will translate to LOProject(List<1>, LOProject(List<0,1,2>,
null)).

In the example Pig Script given below, $0 and $1 are aliased as url and rank respectively.
The aliases are associated with the schema of the column projection. In order to use these
aliases in subsequent statements, LOProject needs to support schemas.

{code}
a = load 'a';
b = foreach a generate $0 as url, $1 as rank;
c = foreach b generate url; 
{code}

3) LOSplitOutput

shouldn't the output schema be the same as the input schema ? (So, use mPlan.getPredecessors()).
Typing is done by induction so you're not supposed to use type from downstream operator to
infer type in upstream operator.

[Santhosh] Thanks, I will fix that

4) In think LOSort should have "List<LogicalPlan> cols" because we should be able to
sort by more than one field and each field can be nested expressions

[Santhosh] List<ExpressionOperators> cols seems to be a better fit than List<LogicalPlan>
cols. Given that each field can be nested expressions, the expression operator can then be
a LOProject which is a nested expression operator.

When I make changes for nested plans, I will come back to the discussion of including a list
of LogicalPlan.


5) LOCross also uses mPlan.getSuccessors()

[Santhosh] Thanks, I will fix that

6) LOSplit should have "List<LogicalPlan> conds" because we should be able to split
be nested expressions

[Santhosh] A split operator is similar to a filter operator except that there are multiple
outputs. The condition on which the split occurs is hence an expression.

{code}
split a into b by $1 > '3', c by $1 < '1';
{code}

The above statement is equivalent to the following pig statements.

{code}
b = filter a by $1 > '3';
c = filter a by $1 < '1';
{code}

When I make changes for nested plans, I will come back to the discussion of including a list
of LogicalPlan.

7) All "private Log log = LogFactory.getLog(getClass());" should be made static

[Santhosh] Good point. I spent a few moments thinking about static v/s member variable. Static
is better.

8) LogicalOperator :-

{code}
* A boolean variable to remember if input has to be flattened Used only in
     * the context of generate
     */
    private boolean mIsFlatten = false;
{code}

So, shouldn't this be in LOGenerate ?

[Santhosh] Flatten is a property of the participating expressions and not an attribute of
the operator.  In the example below, $1 is flattened but $2 is not. If flatten is an attribute
of LOGenerate, we will not be able to figure out which column to flatten.

E.g.: 

{code}
a = load 'input1';
b = load 'input2
c =  group a by $1, b by $1;
d = foreach b generate flatten($1), $2;
{code}

9) I still think we should have LORelationalOperator superclass that has getSchema()/setSchema()
and remove those two methods from ExpressionOperator.

[Santhosh] You made this suggestion last week too :) Let me get these things out and come
back to it in a couple of days.

10) LOForEach should have "List<LogicalPlan> evals" because each output column should
be calculated used nested expressions.

[Santhosh] LOForEach will have a nested LogicalPlan instead of a list of LogicalPlan. The
idea is to incorporate the nested plan patch and then have LOForEach contain a nested Pig
Script.

11) LOCogroup should have "ArrayList<LogicalPlan> mGroupByCols", same reason as above.

[Santhosh] I think the reasoning in my response to LOSort having List<ExpressionOperator>
holds good here. When I make changes for nested plans, I will come back to the discussion
of including a list of LogicalPlan.


12) What is LORegexp for? "Regex match" or "Regex replace" ? If it's "Regex match" so getType()
should return DataType.BOOLEAN

[Santhosh] LORegexp is used for Regex Matches. I agree that getType should return DataType.BOOLEAN

> Rework logical plan
> -------------------
>
>                 Key: PIG-158
>                 URL: https://issues.apache.org/jira/browse/PIG-158
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: logical_operators.patch, logical_operators_rev_1.patch, logical_operators_rev_2.patch,
logical_operators_rev_3.patch, parser_changes.patch, ParserErrors.txt, visitorWalker.patch
>
>
> Rework the logical plan in line with http://wiki.apache.org/pig/PigExecutionModel

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message