pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
Date Tue, 10 Nov 2009 18:09:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775969#action_12775969

Ashutosh Chauhan commented on PIG-1038:

Another place where Hadoop's secondary sort is useful in Pig is to sort the index entries
for Merge Join. In indexing job of Merge Join, index entries sampled from map tasks are grouped
in one reduce task where they are sorted before being written to disk. Currently, Pig does
the sorting, but Hadoop's secondary sort can be used instead. This may not result in much
performance gains since index is small in any case, but this may be a good test case for secondary
key optimization. This depends on how you are discovering the pattern as I asked in previous
question. If there is POSort immediately following POPackage or POJoinPackage in reducer and
some other conditions are met we can apply Secondary key sorting optimization.

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>         Attachments: PIG-1038-1.patch, PIG-1038-2.patch
> If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary
sort instead of SortedDataBag and DistinctDataBag to optimize the query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a
special version of distinct, which does not do the sorting.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message