hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1434) Allow casting relations to scalars
Date Wed, 23 Jun 2010 17:45:51 GMT

    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881784#action_12881784

Daniel Dai commented on PIG-1434:

We decide to change some implementation to solve the following problem:
1. To decide when to add store. Currently, we parse statement by statement, until we saw a
store, we merge that branch into the integrated logical plan. If we add store too late, the
merge algorithm cannot see the store and discard this branch. If we add store too early (during
the parsing of Y, in the example), then later we do not store/dump Y, we get a redundant store
for C
2. Implicit dependency between C -> Y. C will create a side file and Y will use it. However,
this is not the normal data flow and should not be represented as a connection in logical

Now we are exploring the following implementation:
1. Add LOScalar, POScalar to represent scalar expression
2. When parsing Y, we put LOScalar as a placeholder in the ForEach inner plan
3. When parsing store (Y), we know we need to merge the store branch. In the mean time, we
check the branch (Y) if it contains a scalar, if so, find what the scalar refers to (C), add
a store to that branch, and merge that branch to the integrated logical plan
4. Add a map reduce layer optimizer ScalarOptimizer. It check for map-reduce job contains
POScalar, and map-reduce job POScalar contains the operator POScalar refers to, create a dependency
between these two map-reduce jobs. ScalarOptimizer should run before MultiQueryOptimizer

> Allow casting relations to scalars
> ----------------------------------
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case
that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe
we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be
to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message