Return-Path: Delivered-To: apmail-incubator-pig-dev-archive@locus.apache.org Received: (qmail 67544 invoked from network); 21 May 2008 23:32:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 May 2008 23:32:17 -0000 Received: (qmail 31713 invoked by uid 500); 21 May 2008 23:32:18 -0000 Delivered-To: apmail-incubator-pig-dev-archive@incubator.apache.org Received: (qmail 31691 invoked by uid 500); 21 May 2008 23:32:18 -0000 Mailing-List: contact pig-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@incubator.apache.org Delivered-To: mailing list pig-dev@incubator.apache.org Received: (qmail 31680 invoked by uid 99); 21 May 2008 23:32:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 May 2008 16:32:18 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 May 2008 23:31:40 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id F3F68234C120 for ; Wed, 21 May 2008 16:31:55 -0700 (PDT) Message-ID: <1804044932.1211412715998.JavaMail.jira@brutus> Date: Wed, 21 May 2008 16:31:55 -0700 (PDT) From: "Amir Youssefi (JIRA)" To: pig-dev@incubator.apache.org Subject: [jira] Commented: (PIG-171) Top K In-Reply-To: <1255317822.1206676824297.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598862#action_12598862 ] Amir Youssefi commented on PIG-171: ----------------------------------- As Pi Song mentioned LIMIT n,m (or LIMIT n TO m) doesn't show many optimization opportunities on map side. We have to send all top m rows from mapper to reducer. Now consider we have those sent to one reducer. Next step is to run BOTTOM (m-n+1) function over results of maps already sent to reducer. We can have a BOTTOM AWARE MERGER (which is reverse of results on one node and we are done. I don't recommend making it more complicated but if somebody is concerned about chocking one reducer with large m row data-sets we can send results of the first step to multiple reducers and then running a second Map Reduce job with TOP (m-n+1) with comparator of -1 * MY_COMPARATOR() and single reducer to get the final results. I suggest we think about LIMIT n,m as extension of LIMIT n and do that later. > Top K > ----- > > Key: PIG-171 > URL: https://issues.apache.org/jira/browse/PIG-171 > Project: Pig > Issue Type: New Feature > Reporter: Amir Youssefi > Assignee: Amir Youssefi > > Frequently, users are interested on Top results (especially Top K rows) . This can be implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network Bandwidth/Memory usage. > > Key point is to prune all data on the map side and keep only small set of rows with Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only a small data-set gets out of mapper node. > The same idea is applicable to solve variants of this problem: > - An Algebraic Function for 'Top K Rows' > - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K') > - TOP K ORDER BY. > Another words implementation is similar to combiners for aggregate functions but instead of one value we get multiple ones. > I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.