hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-171) Top K
Date Fri, 23 May 2008 13:43:56 GMT

    [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599367#action_12599367
] 

Pi Song commented on PIG-171:
-----------------------------

One more thing. When we do sort, we start with generating percentile ranks. If we could count
the number of records at the same time here, there is another opportunity to get rid of unneeded
data.

> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Amir Youssefi
>            Assignee: Amir Youssefi
>
> Frequently, users are interested on Top results (especially Top K rows) . This can be
implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network
Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of rows with
Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only
a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions but instead
of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify
details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message