hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: The plan generated for this nested plan is not as per we had discussed
Date Mon, 30 Jun 2008 18:20:58 GMT
Analysis below.

Shravan M Narayanamurthy wrote:
> Hi Guys,
> I think we need to find a proper set of rules for the project's 
> schema. The following script kinda of covers all the scenarios:
> A = load 'a';
> B = group A by $0;
> C = foreach B {
> C1 = filter A by $0>5;
> C2 = distinct C1;
> C3 = distinct A;
> generate group, udf1(*), udf2(C2), udf3(C2.$1), udf4(C3), udf(C3.$1);
> }
>
> I think, we had not thought about the projection in the inner plan of 
> filter. With this constraint, we need a new set of rules. Can you post 
> an algorithm that will work to set the return types of the projects?
>
> Thanks & Regards,
> --Shravan
>
> <snip>
In this case, the foreach should have the following plans:

0 - proj(0)

1 - proj( * ) -> udf1

2 - proj (1) -> filter -> distinct -> proj( * ) -> udf2

3 - proj (1) -> filter -> distinct -> proj(1) -> udf3

4 - proj(1) -> distinct -> proj( * ) -> udf4

5 - proj(1) -> distinct -> proj(1) -> udf5

In plans 2 and 3, filter will have an inner plan of:

proj(0) -> gt, const(5) -> gt

In discussing the scenario, Santhosh and I saw one issue, which is that 
in plan 1, the proj( * ) will be incorrectly trying to accumulate a bag 
for udf1, when it should just pass the tuple.  Santhosh is going to fix 
that by changing the project to determine whether it has a predecessor, 
and if so whether that predecessor is a relational operator, instead of 
looking at its input to see if it's a relational operator.

I didn't follow your comment on the issue with the project in the filter 
plan.  It looked fine to me.

Alan.

Mime
View raw message