hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aniket Mokashi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1434) Allow casting relations to scalars
Date Fri, 25 Jun 2010 21:04:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882711#action_12882711
] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

The proposal for scalars is as follows -
{code}
A = load '1.txt' as (a1, a2);
B = group A all;
C = foreach B generate COUNT(A);
Y = foreach A generate C;
store Y into 'Ystore';
{code}
Based on the schema of C, we detect that Y means to use C as a scalar and internally track
it as scalar. Thus, operations like C * C are also allowed. The limitation is that C should
have long convertible value (when stored into the file). Also (int) C would be allowed and
will succeed if the cast operation succeeds.

As mentioned by Daniel earlier, there are two challenges in introducing scalars--
1. Addition of implicit store- We cannot do it too early (parsing), as we get redundant (implicit)
store operation for rest of the commands in the script. If we do it too late, merge algorithm
doesn't find the store and discards the branch that compiles and executes the store.
To solve this, whenever we process a store plan after the parsing stage, we detect the existence
of scalars into the plan and add required branches that has those scalars into the current
plan. We also attach LOStores for the scalars and merge the required plan.
2. Tracking of implicit dependency- Existence of scalar C needs to be converted into a implicit
ReadScalar operation, but other than this it also needs to add dependency on the map-reduce
job that generates this scalar value. We track this dependency by adding LOScalar, POScalar
operators that carry the reference to the scalar they depend upon. When we compile the map
reduce plan, we replace POScalar with POUserFunc to load the scalar value and mark the dependency
between two map reduce jobs.

I am attaching the patch with above mentioned changes.

Few known issues-
To track the dependencies of scalars, we need access to map of operators from one type of
plan to other, but this map is generated by visitors. The same visitors are responsible for
converting LOScalar ->POScalar -> POUserFunc. So, if a visitor visits LOScalar before
LO associated with scalar ( C in example) we do not find PO associated with C. 

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case
that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe
we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be
to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message