hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sichi (JIRA)" <j...@apache.org>
Subject [jira] Created: (HIVE-1994) Support new annotation @UDFType(stateful = true)
Date Mon, 14 Feb 2011 22:55:57 GMT
Support new annotation @UDFType(stateful = true)
------------------------------------------------

                 Key: HIVE-1994
                 URL: https://issues.apache.org/jira/browse/HIVE-1994
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor, UDF
            Reporter: John Sichi
            Assignee: John Sichi


Because Hive does not yet support window functions from SQL/OLAP, people have started hacking
around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence
in contrib.

To clearly mark these, I think we should add a new annotation (with separate semantics from
the existing deterministic annotation).  I'm proposing the name stateful for lack of a better
idea, but I'm open to suggestions.

The semantics are as follows:

* A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
* When a stateful UDF is present in a query, there's an implication that its SELECT needs
to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then
run inside the corresponding reducer to make sure that the results are as expected.

For the first one, an example of why we need this is AND/OR short-circuiting; we don't want
these optimizations to cause the invocation to be skipped in a confusing way, so we should
just ban it outright (which is what SQL/OLAP does for window functions).

For the second one, I'm not entirely certain about the details since some of it is lost in
the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve
backwards compatibility as we start adding new cost-based optimizations which might otherwise
break it.  A specific example would be inserting a materialization step (e.g. for global query
optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful
UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets
generated by the first job.  So we wouldn't do anything immediately, but the presence of the
annotation will help us going forward.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message