Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9CBEA9EE7 for ; Fri, 20 Apr 2012 17:35:45 +0000 (UTC) Received: (qmail 95771 invoked by uid 500); 20 Apr 2012 17:35:44 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 95688 invoked by uid 500); 20 Apr 2012 17:35:44 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 95680 invoked by uid 99); 20 Apr 2012 17:35:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 17:35:44 +0000 X-ASF-Spam-Status: No, hits=3.1 required=5.0 tests=SPF_PASS,URI_OBFU_WWW X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of prvs=mgrover=450b77f51@oanda.com designates 98.158.95.75 as permitted sender) Received: from [98.158.95.75] (HELO ironport-01.sms.scalar.ca) (98.158.95.75) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 17:35:39 +0000 Received: from unknown (HELO sms-zimbra-mta-02.sms.scalar.ca) ([192.168.32.56]) by ironport-01.sms.scalar.ca with ESMTP; 20 Apr 2012 13:35:17 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by sms-zimbra-mta-02.sms.scalar.ca (Postfix) with ESMTP id C4C6787C10 for ; Fri, 20 Apr 2012 13:35:17 -0400 (EDT) Received: from sms-zimbra-mta-02.sms.scalar.ca ([127.0.0.1]) by localhost (sms-zimbra-mta-02.sms.scalar.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HJbwf1wwwq6k for ; Fri, 20 Apr 2012 13:35:17 -0400 (EDT) Received: from sms-zimbra-message-store-03.sms.scalar.ca (unknown [172.17.19.202]) by sms-zimbra-mta-02.sms.scalar.ca (Postfix) with ESMTP id 47EA487C04 for ; Fri, 20 Apr 2012 13:35:17 -0400 (EDT) Date: Fri, 20 Apr 2012 13:35:17 -0400 (EDT) From: Mark Grover To: user@hive.apache.org Message-ID: <1711946691.97253.1334943317145.JavaMail.root@sms-zimbra-message-store-03.sms.scalar.ca> In-Reply-To: Subject: Re: Lifecycle and Configuration of a hive UDF MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [216.235.10.210] X-Mailer: Zimbra 7.1.2_GA_3268 (ZimbraWebClient - SAF3 (Linux)/7.1.2_GA_3268) X-Virus-Checked: Checked by ClamAV on apache.org Hi Rajan and Justin, As per my understanding, the scope of a UDF is only one row of data at a time. Therefore, it can be done all map side without the need for the reducer being involved. Now, depending on where you are storing the result of the query, your query may have reducers that do something. A simple query like Rajan mentioned select MyUDF(field1,field2) from table; should have the UDF execute() being called in the map phase. Now to Justin's question, rank function (http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx) seems to have a sytax like: RANK ( ) OVER ( [ partition_by_clause ] order_by_clause ) Rank function works on a collection of rows (distributed by the some column - the same one you would use in your partition_by_clause in MS SQL). You can accomplish that using UDAF (read more about them at https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a custom reducer (read about that at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform). I don't think rank can be done using a UDF. Good luck! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Justin Coffey" To: user@hive.apache.org Sent: Thursday, April 19, 2012 10:29:11 AM Subject: Re: Lifecycle and Configuration of a hive UDF Hello All, I second this question. I have a MS SQL "rank" function which I would like to run, the results it gives appears to suggest it is executed Mapper side as opposed to reducer side, even when run with "cluster by" constraints. -Justin On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi < ranjan@powerreviews.com > wrote: Hi, What's the lifecycle of a hive udf. If I call select MyUDF(field1,field2) from table; Then MyUDF is instantiated once per mapper, and within each mapper execute(field1, field2) is called for each reducer? I hope this is the case, but I can't find anything about this in the documentation. So I'd like to have some run-time configuration of my UDF: I'm curious how people do this. Is there a way I can send it a value or have it access a file, etc? How about performing a query against the hive store? Thanks, Ranjan -- jqcoffey@gmail.com -----