hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amir Youssefi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-171) Top K
Date Wed, 21 May 2008 23:31:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598862#action_12598862
] 

Amir Youssefi commented on PIG-171:
-----------------------------------

As Pi Song mentioned LIMIT n,m (or LIMIT n TO m) doesn't show many optimization opportunities
on map side. We have to send all top m rows from mapper to reducer. 

Now consider we have those sent to one reducer. Next step is to run BOTTOM (m-n+1) function
over results of maps already sent to reducer. We can have a BOTTOM AWARE MERGER (which is
reverse of results on one node and we are done. 

I don't recommend making it more complicated but if somebody is concerned about chocking one
reducer with large m row data-sets we can send results of the first step to multiple reducers
and then running a second Map Reduce job with TOP (m-n+1) with comparator of -1 * MY_COMPARATOR()
and single reducer to get the final results.
 
I suggest we think about LIMIT n,m as extension of LIMIT n and do that later.

> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Amir Youssefi
>            Assignee: Amir Youssefi
>
> Frequently, users are interested on Top results (especially Top K rows) . This can be
implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network
Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of rows with
Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only
a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions but instead
of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify
details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message