hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-171) Top K
Date Tue, 08 Jul 2008 17:22:31 GMT

    [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611703#action_12611703

Alan Gates commented on PIG-171:

Daniel, the patch looks good.  A few small comments:

1) in LOLimit, I think Santhosh has gone back and changed all the schema getSchema calls to
just check mIsSchemaComputed, removing the check whether mSchema is null.

2) in POLimit, it's swallowing nulls.  I don't think it should.  Nulls should be returned
and counted as one of the returns records.

This patch also makes use of the combiner.  I want to add general combiner functionality next
week, so I'm going to hold off applying this until I've figured out in general how I want
to push things into the combiner.

> Top K
> -----
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Amir Youssefi
>         Attachments: limit1.patch, limit2.patch
> Frequently, users are interested on Top results (especially Top K rows) . This can be
implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network
Bandwidth/Memory usage.
>  Key point is to prune all data on the map side and keep only small set of rows with
Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only
a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
> Another words implementation is similar to combiners for aggregate functions but instead
of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message