pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-5211) Optimize Nested Limited Sort
Date Tue, 04 Apr 2017 07:53:41 GMT

    [ https://issues.apache.org/jira/browse/PIG-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954743#comment-15954743
] 

Daniel Dai commented on PIG-5211:
---------------------------------

Looks pretty good so far. Need to fine tune NestedLimitOptimizer, existence of both LOLimit
and LOSort is not enough, must make sure LOLimit is right after LOSort, or you can follow
LimitOptimizer to push LOLimit all the way up, which is more sophisticated (I am not insisting
this tough). Also SecondaryKeyOptimizer does not recognize limited nested sort currently,
it is possible SecondaryKeyOptimizer optimize limited sort into MR/Tez secondary sort, thus
the limit is lost. So we shall disable SecondaryKeyOptimizer if the nested sort is a limited
sort in SecondaryKeyOptimizer. You can use the following script as the test case which SecondaryKeyOptimizer
is get involved:
{code}
a = load 'studenttab10k' as (name:chararray, age:int, gpa:double);
b = group a by name;
c = foreach b {
    c1 = order a by age;
    c2 = limit c1 5;
    generate c2;
}
explain c;
{code}

> Optimize Nested Limited Sort
> ----------------------------
>
>                 Key: PIG-5211
>                 URL: https://issues.apache.org/jira/browse/PIG-5211
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jin Sun
>            Assignee: Jin Sun
>             Fix For: 0.17.0
>
>         Attachments: PIG-5211-1.patch
>
>
> Currently in FOREACH clause, if both LIMIT and ORDER BY are present, pig stores all elements
and sort them. It should use a priority queue to be more efficient in space. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message