hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by JohnSichi
Date Mon, 23 Aug 2010 18:21:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GenericUDAFCaseStudy" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy?action=diff&rev1=5&rev2=6

--------------------------------------------------

  
  This tutorial walks through the development of the `histogram()` UDAF, which computes a
histogram with a fixed, user-specified number of bins, using a constant amount of memory and
time linear in the input size. It demonstrates a number of features of Generic UDAFs, such
as a complex return type (an array of structures), and type checking on the input. The assumption
is that the reader wants to write a UDAF for eventual submission to the Hive open-source project,
so steps such as modifying the function registry in Hive and writing `.q` tests are also included.
If you just want to write a UDAF, debug and deploy locally, see [[http://wiki.apache.org/hadoop/Hive/HivePlugins
| this page]].
  
- '''NOTE:''' In this tutorial, we walk through the creation of a `histogram()` function.
In upcoming releases of Hive, this will appear as the built-in function `histogram_numeric()`.
+ '''NOTE:''' In this tutorial, we walk through the creation of a `histogram()` function.
Starting with the 0.6.0 release of Hive, this appears as the built-in function `histogram_numeric()`.
  
  <<TableOfContents(3)>>
  
@@ -35, +35 @@

  
  The resolver handles type checking and operator overloading for UDAF queries. The type checking
ensures that the user isn't passing a '''double''' expression where an '''integer''' is expected,
for example, and the operator overloading allows you to have different UDAF logic for different
types of arguments. 
  
- The resolver class must extend '''org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2'''.
There is also an interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver that can be implemented,
but is deprecated as of 0.6.0 release. The key difference between GenericUDAFResolver and
GenericUDAFResovler2 interface is the fact that the later allows the evaluator implementation
to access extra information regarding the function invocation such as the presence of DISTINCT
qualifier or the invocation with the wildcard syntax such as FUNCTION(*). UDAFs that implement
the deprecated GenericUDAFResolver interface will not be able to tell the difference between
an invocation such as FUNCTION() or FUNCTION(*) since the information regarding specification
of the wildcard is not available. Similarly, these implementations will also not be able to
tell the difference between FUNCTION(EXPR) vs FUNCTION(DISTINCT EXPR) since the information
regarding the presence of the DISTINCT qualifier is also not available.
+ The resolver class must extend '''org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2'''
(see [[#Resolver Interface Evolution]] for backwards compatibility information).  We recommend
that you extend the AbstractGenericUDAFResolver base class in order to insulate your UDAF
from future interface changes in Hive.
  
- Note that while the resolvers which implement the GenericUDAFResolver2 interface are provided
the extra information regarding the presence of DISTINCT qualifier of invocation with the
wildcard syntax, they can choose to ignore it completely if it is of no significance to them.
The underlying data manipulation to ensure DISTINCT nature of the expression values is actually
done by the framework and not by the evaluator or resolver. For UDAF implementations that
do not care about this extra information, they could simply extend from the AbstractGenericUDAFResolver
interface which insulates the implementation from this information. It also offers an easy
way to transition previously written UDAF implementations to migrate to the new resolver interface
without having to re-write the implementation since the change from implementing GenericUDAFResolver
interface to extending AbstractGenericUDAFResolver class is fairly minimal. There may be issues
with implementations that are part of a inheritance hierarchy since it may not be easy to
change the base class.
+ Look at one of the existing UDAFs for the '''import'''s you will need.
  
- For now, we'll just assume that you're implementing the GenericUDAFResolver interface. Although
it is deprecated from 0.6.0 onwards, the core functionality is not significantly different.
Look at one of the existing UDAFs for the '''import'''s you will need.
  {{{
  #!Java
- public class GenericUDAFHistogramNumeric implements GenericUDAFResolver {
+ public class GenericUDAFHistogramNumeric extends AbstractGenericUDAFResolver {
    static final Log LOG = LogFactory.getLog(GenericUDAFHistogramNumeric.class.getName());
  
    @Override
-   public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException
{
+   public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info) throws SemanticException
{
      // Type-checking goes here!
  
      return new GenericUDAFHistogramNumericEvaluator();
@@ -58, +57 @@

  }
  }}}
  
- The code above shows the basic skeleton of a UDAF. The first line sets up a Log object that
you can use to write warnings and errors to be fed into the Hive log. The GenericUDAFResolver
class has a single overridden method: '''getEvaluator''', which takes an array of type information
objects as its parameters. For the histogram UDAF, we want two parameters: the numeric column
over which to compute the histogram, and the number of histogram bins requested. The very
first thing to do is to check that we have exactly two parameters (lines 2-5). Then, we check
that the first parameter is a primitive type, and not an array or map, for example (lines
8-12). However, not only do we want it to be a primitive type column, but we also want it
to be numeric, which means that we need to throw an exception if a STRING type is given (lines
13-27). BOOLEAN is excluded because the "histogram" estimation problem can be solved with
a simple COUNT() query. Lines 29-40 illustrate similar type checking for the second parameter
to the histogram() UDAF -- the number of histogram bins. In this case, we insist that the
number of histogram bins is an integer. 
+ The code above shows the basic skeleton of a UDAF. The first line sets up a Log object that
you can use to write warnings and errors to be fed into the Hive log. The GenericUDAFResolver
class has a single overridden method: '''getEvaluator''', which receives information about
how the UDAF is being invoked.  Of most interest is info.getParameters(), which provides an
array of type information objects corresponding to the SQL types of the invocation parameters.
For the histogram UDAF, we want two parameters: the numeric column over which to compute the
histogram, and the number of histogram bins requested. The very first thing to do is to check
that we have exactly two parameters (lines 3-6 below). Then, we check that the first parameter
has a primitive type, and not an array or map, for example (lines 9-13). However, not only
do we want it to be a primitive type column, but we also want it to be numeric, which means
that we need to throw an exception if a STRING type is given (lines 14-28). BOOLEAN is excluded
because the "histogram" estimation problem can be solved with a simple COUNT() query. Lines
30-41 illustrate similar type checking for the second parameter to the histogram() UDAF --
the number of histogram bins. In this case, we insist that the number of histogram bins is
an integer. 
  
  {{{
  #!Java
-   public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException
{
+   public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info) throws SemanticException
{
+     TypeInfo [] parameters = info.getParameters();
      if (parameters.length != 2) {
        throw new UDFArgumentTypeException(parameters.length - 1,
            "Please specify exactly two arguments.");
@@ -262, +262 @@

   * If you're stuck looking for an algorithm to adapt to the terminatePartial/merge paradigm,
divide-and-conquer and parallel algorithms are predictably good places to start.
   * Remember that the tests do a `diff` on the expected and actual output, and fail if there
is any difference at all. An example of where this can fail horribly is a UDAF like `ngrams()`,
where the output is a list of sorted (word,count) pairs. In some cases, different sort implementations
might place words with the same count at different positions in the output. Even though the
output is correct, the test will fail. In these cases, it's better to output (for example)
only the counts, or some appropriate statistic on the counts, like the sum.
  
+ == Resolver Interface Evolution ==
+ 
+ Old interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver was deprecated as of the
0.6.0 release. The key difference between GenericUDAFResolver and GenericUDAFResolver2 interface
is the fact that the latter allows the evaluator implementation to access extra information
regarding the function invocation such as the presence of DISTINCT qualifier or the invocation
with the wildcard syntax such as FUNCTION(*). UDAFs that implement the deprecated GenericUDAFResolver
interface will not be able to tell the difference between an invocation such as FUNCTION()
or FUNCTION(*) since the information regarding specification of the wildcard is not available.
Similarly, these implementations will also not be able to tell the difference between FUNCTION(EXPR)
vs FUNCTION(DISTINCT EXPR) since the information regarding the presence of the DISTINCT qualifier
is also not available.
+ 
+ Note that while resolvers which implement the GenericUDAFResolver2 interface are provided
the extra information regarding the presence of DISTINCT qualifier of invocation with the
wildcard syntax, they can choose to ignore it completely if it is of no significance to them.
The underlying data filtering to compute DISTINCT values is actually done by Hive's core query
processor and not by the evaluator or resolver; the information is provided to the resolver
only for validation purposes. The AbstractGenericUDAFResolver base class offers an easy way
to transition previously written UDAF implementations to migrate to the new resolver interface
without having to re-write the implementation since the change from implementing GenericUDAFResolver
interface to extending AbstractGenericUDAFResolver class is fairly minimal. (There may be
issues with implementations that are part of an inheritance hierarchy since it may not be
easy to change the base class.)
+ 

Mime
View raw message