pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-988) Better implementation of distinct aggs
Date Thu, 01 Oct 2009 17:48:23 GMT

    [ https://issues.apache.org/jira/browse/PIG-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761284#action_12761284

Alan Gates commented on PIG-988:

Consider a script like:

A = load 'bla';
B = group A by $0;
C = foreach B {
       D = A.$1;
       E = distinct D;
       generate group, COUNT(E);

This is count distinct, and a fairly common thing to do.  Currently Pig will use the combiner
to remove as many duplicate values from D as possible.  But a final distinct pass is still
required on the reducer.  Currently DistinctBag is used for this.  In this particular case,
it would be possible to instead use Hadoop's secondary sort to sort the incoming records on
the full tuple, and then use a different implementation of DistinctBag that expected the incoming
records to be sorted and remove duplicates.

Note that this could not be used in conjunction with the order by optimization proposed in

> Better implementation of distinct aggs
> --------------------------------------
>                 Key: PIG-988
>                 URL: https://issues.apache.org/jira/browse/PIG-988
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Alan Gates
> Distinct aggregates by definition cannot use the combiner (though the distinct can be
and is done in the combiner).  Since this is a common use case it would be good to optimize.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message