hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (PIG-171) Top K
Date Fri, 25 Jul 2008 19:37:31 GMT

    [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617023#action_12617023
] 

daijy edited comment on PIG-171 at 7/25/08 12:37 PM:
----------------------------------------------------------

Some explanation on output of first n sample:
1. The output is not always the first n records in input file

2. The output set could be
  * Take first n records from each map, then top n records among those records, if there is
limit operator on map side
  * Top n records globally, otherwise

3. The output set is sorted by hadoop before presenting to the user

Here "first n" means the unsorted result come directly from input, "top n" means sorted result,
the sort order is defined by hadoop sort key.

If limit is combined with other operator, the order could be more complex, here is just a
simplified guildline.

      was (Author: daijy):
    Some explanation on output of first n sample:
1. The output is not always the first n records in input file
2. The output set could be
  * Take first n records from each map, then top n records among those records, if there is
limit operator on map side
  * Top n records globally, otherwise
2. The output set is sorted by hadoop before presenting to the user

Here "first n" means the unsorted result come directly from input, "top n" means sorted result,
the sort order is defined by hadoop sort key.

If limit is combined with other operator, the order could be more complex, here is just a
simplified guildline.
  
> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Amir Youssefi
>             Fix For: types_branch
>
>         Attachments: limit1.patch, limit2.patch, limit3.patch
>
>
> Frequently, users are interested on Top results (especially Top K rows) . This can be
implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network
Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of rows with
Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only
a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions but instead
of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify
details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message