hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri
Date Mon, 28 Jun 2010 19:31:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GenericUDAFCaseStudy" page has been changed by MayankLahiri.
The comment on this change is: initial version of GenericUDAF tutorial.


New page:
= Writing GenericUDAFs: A Tutorial =

User-Defined Aggregation Functions (UDAFs) are an excellent way to integrate advanced data-processing
into Hive. Hive allows two varieties of UDAFs: simple and generic. Simple UDAFs, as the name
implies, are rather simple to write, but incur performance penalties because of the use of
[[http://java.sun.com/docs/books/tutorial/reflect/index.html | Java Reflection]], and do not
allow features such as variable-length argument lists. Generic UDAFs allow all these features,
but are perhaps not quite as intuitive to write as Simple UDAFs.

This tutorial walks through the development of the `histogram()` UDAF, which computes a histogram
with a fixed, user-specified number of bins, using a constant amount of memory and time linear
in the input size. It demonstrates a number of features of Generic UDAFs, such as a complex
return type (an array of structures), and type checking on the input. The assumption is that
the reader wants to write a UDAF for eventual submission to the Hive open-source project,
so steps such as modifying the function registry in Hive and writing `.q` tests are also included.
If you just want to write a UDAF, debug and deploy locally, see [[http://wiki.apache.org/hadoop/Hive/HivePlugins
| this page]].

'''NOTE:''' In this tutorial, we walk through the creation of a `histogram()` function. In
future (as of July 2010) releases of Hive, this will appear as the built-in function `histogram_numeric()`.


== Preliminaries ==

Make sure you have the latest Hive trunk by running `svn up` in your Hive directory. More
detailed instructions on downloading and setting up Hive can be found at [[http://wiki.apache.org/hadoop/Hive/GettingStarted
| Getting Started ]]. Your local copy of Hive should work by running `build/dist/bin/hive`
from the Hive root directory, and you should have some tables of data loaded into your local
instance for testing whatever UDAF you have in mind. For this example, assume that a table
called `normal` exists with a single `double` column called `val`, containing a large number
of random number drawn from the standard normal distribution.

The files we will be editing or creating are as follows, relative to the Hive root:

|| `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java` |||| the
main source file, to be created by you.||
|| `ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java` |||| the function registry
source file, to be edited by you to register our new `histogram()` UDAF into Hive's built-in
function list.||
|| `ql/src/test/queries/clientpositive/udaf_histogram.q` |||| a file of sample queries for
testing `histogram()` on sample data, to be created by you.||
|| `ql/src/test/results/clientpositive/udaf_histogram.q.out` |||| the expected output from
your sample queries, to be created by `ant` in a later step. ||
|| `ql/src/test/results/clientpositive/show_functions.q.out` |||| the expected output from
the SHOW FUNCTIONS Hive query. Since we're adding a new `histogram()` function, this expected
output will change to reflect the new function. This file will be modified by `ant` in a later
step. ||

== Writing the source ==

As stated above, create a new file called `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java`,
relative to the Hive root directory. Please see the `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java`
for a detailed example of a UDAF.

== Modifying the function registry ==

== Creating the tests ==

== Compiling, testing ==

= Checklist for open source submission =

 * Create an account on the [[ https://issues.apache.org/jira/browse/HIVE | Hive JIRA ]],
create an issue for your new patch under the `Query Processor` component. Solicit discussion,
incorporate feedback.
 * Create your UDAF, integrate it into your local Hive copy.
 * Run `ant package` from the Hive root to compile Hive and your new UDAF.
 * Create `.q` tests and their corresponding `.q.out` output.
 * Modify the function registry if adding a new function.
 * Run `ant checkstyle`, ensure that your source files conform to the coding convention.
 * Run `ant test`, ensure that tests pass.
 * Run `svn up`, ensure no conflicts with the main repository.
 * Run `svn add` for whatever new files you have created.
 * Ensure that you have added `.q` and `.q.out` tests.
 * Ensure that you have run the `.q` tests for all new functionality.
 * If adding a new UDAF, ensure that `show_functions.q.out` has been updated.
 * Run `svn diff > HIVE-NNNN.1.patch` from the Hive root directory, where NNNN is the issue
number the JIRA has assigned to you.
 * Attach your file to the JIRA issue, describe your patch in the comments section.
 * Ask for a code review in the comments.
 * Click '''Submit patch''' on your issue after you have completed the steps above.
 * It is also advisable to '''watch''' your issue to monitor new comments.

View raw message