hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-980) Optimizing nested order bys
Date Fri, 25 Sep 2009 23:48:15 GMT

    [ https://issues.apache.org/jira/browse/PIG-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759815#action_12759815

Alan Gates commented on PIG-980:

A common pattern for Pig Latin scripts is:

A = load 'bla';
B = group A by $0;
C = foreach B {
    D = order A by $1;

Currently Pig executes this by using POSort on the reduce side, which collects all of the
records out of the bag produced by POPackage into
a SortedBag.  If this bag is large, it will spill both as part of POPackage collecting it
and as part of POSort sorting it.

None of this is necessary however.  Hadoop allows users to specify a sort order for data going
to the reducer in addition to a partition
key.  This can be done by defining the Comparator for the job to compare all the fields you
want sorted, and the Partitioner to only look
at the field you want to partition on.  So in this case the partitioner would be set to look
at $0, and the comparator at $0, and $1.

Beyond avoiding unnecessary sorts and spills, this will also allow us to use the proposed
Accumulator interface (see PIG-979) for these types
of scripts.

> Optimizing nested order bys
> ---------------------------
>                 Key: PIG-980
>                 URL: https://issues.apache.org/jira/browse/PIG-980
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Alan Gates
>            Assignee: Ying He
> Pig needs to take advantage of secondary sort in Hadoop to optimize nested order bys.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message