hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-158) Rework logical plan
Date Fri, 06 Jun 2008 01:02:49 GMT

    [ https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602870#action_12602870
] 

Santhosh Srinivasan commented on PIG-158:
-----------------------------------------

Eliminating the Generate Operator

It has been recommended earlier (Thanks Pi) that we eliminate the Generate operator in the
Foreach ... Generate context.

In the types branch, we have a Generate operator (on the logical and physical side) that is
a container for the expressions that are projected. The Generate operator wraps each operator
inside a nested plan. The resulting list of plans can be a mixture of expressions that derive
their input from generate's predecessor or directly from the foreach input. Examples that
illustrate these points follow.

{code}

--Example 1

a = load 'input1';
b = group a by $0;
c = foreach b {
	d = distinct a;
	generate group, sum(d.$1);
}

{code}

Logical plan after parsing:

ForEach Test-Plan-Builder-655
|   |
|   Generate Test-Plan-Builder-654
|   |   |
|   |   Project Test-Plan-Builder-650
|   |   |
|   |   UserFunc Test-Plan-Builder-653
|   |   |
|   |   |---Project Test-Plan-Builder-652
|   |
|   |---Distinct Test-Plan-Builder-649
|       |
|       |---Project Test-Plan-Builder-648
|
|---CoGroup Test-Plan-Builder-647
    |   |
    |   Project Test-Plan-Builder-646
    |
    |---Load Test-Plan-Builder-645


The Generate operator has 2 nested plans, one for the Project(group, b) and the other for
the aggregate (sum). There are a couple of points to observe:

1. The projection of 'group' does not require the input 'd'. 
2. The root of the second plan Project(1, project(d, b)) requires the input 'd' which is connected
to Generate but not as input in the nested plan.

The former should be part of the Foreach operator and the latter is a problem on the physical
side. When the getNext call is made for the root of the nested plan, the input from generate
is sought whereas the input from Distinct (d) is required.

Let us look at another example. Here, input 'd' is used twice in the generate. This is a case
of an implicit split. The output of 'd' has to be split to both the sum and the count.

{code}

--Example 2

a = load 'input1';
b = group a by $0;
c = foreach b {
	d = distinct a;
	generate sum(d.$1), count(d.$1);
}

{code}

In order to remove the Generate operator, the nested plans which are currently part of the
Generate will be promoted to be a part of the Foreach operator with the following changes:

1. Any expression that is part of the generate (root of the nested plan) which does not require
generate's input will be moved into a nested plan of Foreach.

2. The remaining expressions of generate will be attached as leaves of generate's input by
duplicating the graph.

Going back to example 1, the logical plan for Foreach will have two nested plans. The first
nested plan will contain Project(group, b). The second nested plan will have 'd' as the root
and the aggregate function sum as the leaf

Example 2 will translate to two nested plans both of which will have 'd' as the input. The
leaves of the individual plans will be the aggregate functions sum and count respectively.

> Rework logical plan
> -------------------
>
>                 Key: PIG-158
>                 URL: https://issues.apache.org/jira/browse/PIG-158
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: is_null.patch, logical_operators.patch, logical_operators_rev_1.patch,
logical_operators_rev_2.patch, logical_operators_rev_3.patch, parser_changes.patch, parser_changes_v1.patch,
parser_changes_v2.patch, parser_changes_v3.patch, parser_changes_v4.patch, ParserErrors.txt,
udf_fix.patch, udf_funcSpec.patch, udf_return_type.patch, user_func_and_store.patch, visitorWalker.patch
>
>
> Rework the logical plan in line with http://wiki.apache.org/pig/PigExecutionModel

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message