hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
Date Wed, 08 Sep 2010 18:17:34 GMT

     [ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1437:
----------------------------

         Assignee: Xuefu Zhang
    Fix Version/s: 0.9.0

> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -----------------------------------------------------------------
>
>                 Key: PIG-1437
>                 URL: https://issues.apache.org/jira/browse/PIG-1437
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Xuefu Zhang
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code} 
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced subsequently in
the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by
this will be a huge win. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message