hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-171) Top K
Date Fri, 28 Mar 2008 17:44:24 GMT

    [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583106#action_12583106

Alan Gates commented on PIG-171:

A few questions/comments:

1) In your example of TOP(123, A) rows, what does the A mean?

2) I don't understand the differentiation between the three bullet points you give in the
description.  Could you elaborate and give examples of how each would be used?

3) You propose doing this as a UDF, but that only gives you some of what we really want. 
This will allow pig to use the combiner.  Eventually, to offer full functionality, we'll want
to be able to do this on non-grouped/ordered data (just being able to see the first X records
of a file is great for expirementation and development).  This doesn't mean we can't support
as a UDF for now, and promote it later.  But it does mean we need to think carefully about
how we want to do it.

4) You're counting on using the combiner to make this efficient.  But in the current implementation
the combiner won't be used except in very specific circumstances (a group by followed by a
foreach that includes the group).  General use of the combiner won't be in place until the
pipeline rework is ready.

5) Syntax question, do we want to use TOPK or LIMIT?  I tend to think of TOPK as implying
top results of an aggregation, vs LIMIT just meaning a certain number of rows, not necessarily
implying any grouping.  Maybe others don't use this distinction.  LIMIT also allows an offset
(give me rows 10000-20000) in addition to allowing just the first X rows.  I don't care which
we use, but it seems like we ought to discuss it in case some people have strong views one
way or another.

> Top K
> -----
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Amir Youssefi
>            Assignee: Amir Youssefi
> Frequently, users are interested on Top results (especially Top K rows) . This can be
implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network
Bandwidth/Memory usage.
>  Key point is to prune all data on the map side and keep only small set of rows with
Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only
a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
> Another words implementation is similar to combiners for aggregate functions but instead
of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message