hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
Date Mon, 02 Nov 2009 01:36:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772404#action_12772404
] 

Ashutosh Chauhan commented on PIG-1038:
---------------------------------------

I think its a useful optimization. I presume this will be implemented as a visitor in MapReduceLauncher
which visits on compiled MR plan. Design looks good. I have few questions:

bq. 1.1 Discover if we use sort/distinct in nested foreach plan.
How are you planning to discover ? Depending on some pattern like LR in map-plan followed
by POPackage, POForeach, POSort  in reduce-plan?

Kind of orthogonal but related to this issue. We have rule-based optimizer framework in front-end,
it seems to me that similar optimizer framework is required in backend too to refactor all
the optimizer visitors we currently have and to add  similar kind of optimizations easily
in future. 
There are seven optimizations in front-end expressed through rules. On the other hand after
addition of this one we will have nine optimization visitors in backend. May be we can think
about it to avoid lot of rework every time such optimization is added.

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary
sort instead of SortedDataBag and DistinctDataBag to optimize the query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a
special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message