hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: [jira] Commented: (PIG-171) Top K
Date Sun, 08 Jun 2008 15:56:19 GMT
If I want to do a sample, I will typically filter by a uniform random
number, not take the top k.

And if I do take the top K, K is usually fairly small so sorting it by
conventional mechanisms later is fine by me.

On Sun, Jun 8, 2008 at 4:05 AM, Pi Song (JIRA) <jira@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603362#action_12603362]
>
> Pi Song commented on PIG-171:
> -----------------------------
>
> Ted (From mailing-list):
> bq. An efficient implementation of top K without full histogramming would
> still be very, very useful.
>
> Logically (not by experience) I still concern about TOP K without order.
> Does this thing really have a good use? The formal definition of TOP K
> always goes with scoring function. Naturally, we also say we want TOP K
> order by something.
>
> The only use case that I would think people might be doing TOP K without
> order is just to work with sample data. But then doing TOP K is not gonna
> give a statistically good representation. My idea is that it should be
> better if we design the language by not allowing people to do the wrong
> thing.
>
> If people want to do approximate queries I think we'd better provide a
> proper way like adding:-
>
> {code}
> X = SAMPLE 10% OF A ;
> Y = SAMPLE 100 OF B ;
> {code}
>
> What do you think?
>
> > Top K
> > -----
> >
> >                 Key: PIG-171
> >                 URL: https://issues.apache.org/jira/browse/PIG-171
> >             Project: Pig
> >          Issue Type: New Feature
> >            Reporter: Amir Youssefi
> >            Assignee: Amir Youssefi
> >
> > Frequently, users are interested on Top results (especially Top K rows) .
> This can be implemented efficiently in Pig /Map Reduce settings to deliver
> rapid results and low Network Bandwidth/Memory usage.
> >
> >  Key point is to prune all data on the map side and keep only small set
> of rows with Top criteria . We can do it in Algebraic function (combiner)
> with multiple value output. Only a small data-set gets out of mapper node.
> > The same idea is applicable to solve variants of this problem:
> >   - An Algebraic Function for 'Top K Rows'
> >   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense
> Rank K')
> >   - TOP K ORDER BY.
> > Another words implementation is similar to combiners for aggregate
> functions but instead of one value we get multiple ones.
> > I will add a sample implementation for Top K Rows and possibly TOP K
> ORDER BY to clarify details.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
ted

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message