Return-Path: Delivered-To: apmail-hive-dev-archive@www.apache.org Received: (qmail 75616 invoked from network); 15 Feb 2011 23:55:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Feb 2011 23:55:22 -0000 Received: (qmail 1647 invoked by uid 500); 15 Feb 2011 23:55:22 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 1586 invoked by uid 500); 15 Feb 2011 23:55:21 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 1431 invoked by uid 500); 15 Feb 2011 23:55:21 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 1362 invoked by uid 99); 15 Feb 2011 23:55:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Feb 2011 23:55:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Feb 2011 23:55:19 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 158891A6E92 for ; Tue, 15 Feb 2011 23:54:58 +0000 (UTC) Date: Tue, 15 Feb 2011 23:54:58 +0000 (UTC) From: "John Sichi (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <1659269926.19110.1297814098084.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1882489090.16051.1297724157527.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995066#comment-12995066 ] John Sichi commented on HIVE-1994: ---------------------------------- If stateful is set to true, the UDF should also be treated as non-deterministic (even if the deterministic annotation explicitly returns true). > Support new annotation @UDFType(stateful = true) > ------------------------------------------------ > > Key: HIVE-1994 > URL: https://issues.apache.org/jira/browse/HIVE-1994 > Project: Hive > Issue Type: Improvement > Components: Query Processor, UDF > Reporter: John Sichi > Assignee: John Sichi > > Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum. An example is row_sequence in contrib. > To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation). I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions. > The semantics are as follows: > * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP > * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected. > For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions). > For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it. A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job. So we wouldn't do anything immediately, but the presence of the annotation will help us going forward. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira