hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-159) Make changes to the parser to support new types functionality
Date Sun, 18 May 2008 12:14:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597801#action_12597801
] 

Pi Song commented on PIG-159:
-----------------------------

Comments on v8:-

1) Double doesn't work at all when you create a typed complex constant. To fix this, add "(dataType
== DOUBLE) ||" in DataType.isAtomic
2) In LOCross.getSchema, you concat all the fields from "Collection<LogicalOperator>
pred = mPlan.getPredecessors(this);". This is not good because mPlan.getPredecessors doesn't
preserve the order. Use mInputs in stead.
3) Here is my query:-
{noformat}
a = load 'a' as (field1: integer, field2: long);
b = load 'a' as (field1: bytearray, field2: double);
c = group a by field1, b by field1  ;
{noformat}
When I parse using the latest query parser, I've got:-
{noformat}
(74:LOCogroup={group: integer,a: {field1: integer,field2: long},b: {field1: bytearray,field2:
double}}==>80)
        <COGroup Inner Plan>
        (72:LOProject=integer==>TERMINAL)
        <COGroup Inner Plan>
        (73:LOProject=bytearray==>81)
        (81:LOCast=integer==>TERMINAL)
(80:LOForEach={field2: double,bytearray}==>TERMINAL)
        <ForEach Inner Plan>
        (79:LOGenerate=(field2: double,bytearray)==>TERMINAL)
                <Generate Inner Plan>
                (76:LOProject=double==>TERMINAL)
                <Generate Inner Plan>
                (77:LOProject=(field1: integer,field2: long)==>78)
                (78:LOUserFunc=bytearray==>TERMINAL)
{noformat}
I don't know where LOUserFunc comes from.

4) The way LOProject is used seems a bit weird to me.  I found that when you do someting like
this:-
{noformat}
c = group a by field1, b by field1  ;
d = foreach c generate grp, a.(field1, field2), b.(field1, field2)  ;
{noformat}
you will have in generate's inner plans:-

Project(0  sentinel=true ) 
Project(0,1 sentinel=false)
Project(0,1 sentinel=false)

The second and the third are the same. Because you use projects to select columns from inner
bags, they don't contain information to refer back to the columns those bags come from!! By
having mSentinal seems to make it more difficult to understand because Project now has a few
different meanings 1)Actual Projection  2) Bridging between plans. Isn't it better to introduce
a new LO to work as sentinel? 

5) I think it's time to think about aggregate function in foreach generate. We just have to
add List<AggregateApec> in either Foreach or Generate (which one is better I'm not sure
but ForEach seems to handle more whole bag things so seems more suitable to me)
{noformat}
class AggregateApec {
   AggregateOperator agg ;
   int col ;
}
{noformat}

6) Nested expressions in COGroup doesn't work. For example:-
{noformat}
c = cogroup a by (field1+field2)*field1, b by field1  ;
{noformat}
will throw an error message because the parser thinks "(" is the beginning of tuple. Maybe
we just need more lookahead?

> Make changes to the parser to support new types functionality
> -------------------------------------------------------------
>
>                 Key: PIG-159
>                 URL: https://issues.apache.org/jira/browse/PIG-159
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: parser_chages_v5.patch, parser_chages_v6.patch, parser_chages_v7.patch,
parser_chages_v8.patch, parser_chages_v9.patch
>
>
> In order to support the new types functionality described in http://wiki.apache.org/pig/PigTypesFunctionalSpec,
the parse needs to change in the following ways:
> 1) AS needs to support types in addition to aliases.  So where previously it was legal
to say:
> a = load 'myfile' as a, b, c;
> it will now also be legal to say
> a = load 'myfile' as a integer, b float, c chararray;
> 2) Non string constants need to be supported.  This includes non-string atomic types
(integer, long, float, double) and the non-atomic types bags, tuples, and maps.
> 3) A cast operator needs to be added so that fields can be explicitly casted.
> 4) Changes to DEFINE, to allow users to declare arguments and return types for UDFs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message