hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by ArvindPrabhakar
Date Tue, 13 Jul 2010 23:29:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GenericUDAFCaseStudy" page has been changed by ArvindPrabhakar.


  == Writing the source ==
- As stated above, create a new file called `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java`,
relative to the Hive root directory. Please see the `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java`
for a detailed example of a UDAF.
+ This section gives a high-level outline of how to implement your own generic UDAF. For a
concrete example, look at any of the existing UDAF sources present in `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/`
+ At a high-level, there are two parts to implementing a Generic UDAF. The first is to write
an ''evaluator'', and the second is to create a ''resolver''. An evaluator is the actual implementation
of the generic UDAF with the processing logic in place. The resolver on the other provides
a mechanism for the evaluator to be accessed by the query processing framework.
+ All evaluators must extend from the abstract base class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.
This class provides a few abstract methods that must be implemented by the extending class.
These methods establish the processing semantics followed by the UDAF. Please refer to the
javadocs for the abstract methods to see their exact specifications.
+ The implementation of resolver is done by either implementing the interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2
or extending from the abstract class org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.
There is also an interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver that can be implemented,
but is deprecated as of 0.6.0 release. The key difference between GenericUDAFResolver and
GenericUDAFResovler2 interface is the fact that the later allows the evaluator implementation
to access extra information regarding the function invocation such as the presence of DISTINCT
qualifier or the invocation with the wildcard syntax such as FUNCTION(*). Evaluators that
implement the deprecated GenericUDAFResolver interface will not be able to tell the difference
between an invocation such as FUNCTION() or FUNCTION(*) since the information regarding specification
of the wildcard is not available. Similarly, these implementations will also not be able to
tell the difference between FUNCTION(EXPR) vs FUNCTION(DISTINCT EXPR) since the information
regarding presence of the DISTINCT qualifier too is not available.
+ Note that while the resolvers which implement the GenericUDAFResolver2 interface are provided
the extra information regarding the presence of DISTINCT qualifier of invocation with the
wildcard syntax, they can choose to ignore it completely if it is of no significance to them.
The underlying data manipulation to ensure DISTINCT nature of the expression values is actually
done by the framework and not by the evaluator or resolver. For UDAF implementations that
do not care about this extra information, they could simply extend from the AbstractGenericUDAFResolver
interface which insulates the implementation from this information. It also offers an easy
way to transition previously written UDAF implementations to migrate to the new resolver interface
without having to re-write the implementation since the change from implementing GenericUDAFResolver
interface to extending AbstractGenericUDAFResolver class is fairly minimal. There may be issues
with implementations that are part of a inheritance hierarchy since it may not be easy to
change the base class.
  == Modifying the function registry ==

View raw message