hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri
Date Thu, 19 Aug 2010 18:57:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GenericUDAFCaseStudy" page has been changed by MayankLahiri.
The comment on this change is: finished the GenericUDAF writing case study.


+ What do all these functions do? The following is a brief summary of each function, in (roughly)
chronological order of being called. It's '''very''' important to remember that the computation
of your aggregation must be arbitrarily divisible over the data. Think of it like writing
a divide-and-conquer algorithm where the partitioning of the data is completely out of your
control and handled by Hive. More formally, given any subset of the input rows, you should
be able to compute a partial result, and also be able to merge any pair of partial results
into another partial result. This naturally makes it difficult to port over many existing
algorithms, but should guarantee researchers jobs for quite some time.
+ || '''Function''' || '''Purpose''' ||
+ || init || Called by Hive to initialize an instance of your UDAF evaluator class. ||
+ || getNewAggregationBuffer || Return an object that will be used to store temporary aggregation
results. ||
+ || iterate || Process a new row of data into the aggregation buffer ||
+ || terminatePartial || Return the contents of the current aggregation in a serializable
way ||
+ || merge || Merge a partial aggregation returned by '''terminatePartial''' into the current
aggregation ||
+ || terminate || Return the final result of the aggregation to Hive ||
+ For writing the `histogram()` function, the following is the strategy that was adopted.
+ ==== getNewAggregationBuffer ====
+ The aggregation buffer for a histogram is a list of (x,y) pairs that represent the histogram's
bin centers and heights. In addition, the aggregation buffer also stores two integers with
the maximum number of bins (a user-specified parameter), and the current number of bins used.
The aggregation buffer is initialized to a 'not ready' state with the number of bins set to
0. This is because Hive makes no distinction between a constant parameter supplied to a UDAF
and a column from a table; thus, we have no way of knowing how many bins the user wants in
their histogram until the first call to `iterate()`.
+ ==== iterate ====
+ The first thing we do in `iterate()` is to check whether the histogram object in our aggregation
buffer is initialized. If it is not, we parse our the second argument to `iterate()`, which
is the number of histogram bins requested by the user. We do this exactly once and initialize
the histogram object. Note that error checking is performed here -- if the user supplied a
negative number or zero for the number of histogram bins, a `HiveException` is thrown at this
point and computation terminates.
+ Next, we parse out the actual input data item (a number) and add it to our histogram estimation
in the aggregation buffer. See the `GenericUDAFHistogramNumeric.java` file for details on
the heuristic used to construct a histogram.
+ ==== terminatePartial ====
+ The current histogram approximation is serialized as a list of `DoubleWritable` objects.
The first two doubles in the list indicate the maximum number of histogram bins specified
by the user and number of bins current used. The remaining entries are (x,y) pairs from the
current histogram approximation.
+ ==== merge ====
+ At this point, we have a (possibly uninitialized) histogram estimation, and have been requested
to merge it with another estimation performed on a separate subset of the rows. If '''N'''
is the number of histogram bins specified by the user, the current heuristic first builds
a histogram with all '''2N''' bins from both estimations, and then iteratively merges the
closest pair of bins until only '''N''' bins remain.
+ ==== terminate ====
+ The final return type from the `histogram()` function is an array of (x,y) pairs representing
histogram bin centers and heights. These can be `explode()`ed into a separate table, or parsed
using a script and passed to Gnuplot (for example) to visualize the histogram.
  == Modifying the function registry ==
+ Once the code for the UDAF has been written and the source file placed in `ql/src/java/org/apache/hadoop/hive/ql/udf/generic`,
it's time to modify the function registry and incorporate the new function into Hive's list
of functions. This simply involves editing `ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java`
to import your UDAF class and register it's name.
+ Please note that you will have to run the following command to update the output of the
`show functions` Hive call:
+ {{{ant test -Dtestcase=TestCliDriver -Dqfile=show_functions.q -Doverwrite=true}}}
+ == Compiling and running ==
+ {{{
+ ant package
+ build/dist/bin/hive
+ }}}
  == Creating the tests ==
- == Compiling, testing ==
+ System-level tests consist of writing some sample queries that operate on sample data, generating
the expected output from the queries, and making sure that things don't break in the future
in terms of expected output. Note that the expected output is passed through `diff` with the
actual output from Hive, so nondeterministic algorithms will have to compute some sort of
statistic and then only keep the most significant digits (for example).
+ These are the simple steps needed for creating test cases for your new UDAF/UDF:
+  1. Create a file in `ql/src/test/queries/clientpositive/udaf_XXXXX.q` where `XXXXX` is
your UDAF's name.
+  2. Put some queries in the `.q` file -- hopefully enough to cover the full range of functionality
and special cases.
+  3. For sample data, put your own in `hive/data/files` and load it using `LOAD DATA LOCAL
INPATH...`, or reuse one of the files already there (grep for LOAD in the queries directory
to see table names).
+  4. `touch ql/src/test/results/clientpositive/udaf_XXXX.q.out`
+  5. Run the following command to generate the output into the `.q.out` result file.
+ {{{
+ ant test -Dtestcase=TestCliDriver -Dqfile=udaf_XXXXX.q -Doverwrite=true
+ }}}
+  6. Run the following command to make sure your test runs fine.
+ {{{
+ ant test -Dtestcase=TestCliDriver -Dqfile=udaf_XXXXX.q
+ }}}
  = Checklist for open source submission =
@@ -178, +240 @@

   * Run `ant package` from the Hive root to compile Hive and your new UDAF.
   * Create `.q` tests and their corresponding `.q.out` output.
   * Modify the function registry if adding a new function.
-  * Run `ant checkstyle`, ensure that your source files conform to the coding convention.
+  * Run `ant checkstyle` and examine `build/checkstyle/checkstyle-errors.html`, ensure that
your source files conform to the Sun Java coding convention (with the 100 character line length
   * Run `ant test`, ensure that tests pass.
   * Run `svn up`, ensure no conflicts with the main repository.
   * Run `svn add` for whatever new files you have created.
   * Ensure that you have added `.q` and `.q.out` tests.
   * Ensure that you have run the `.q` tests for all new functionality.
-  * If adding a new UDAF, ensure that `show_functions.q.out` has been updated.
+  * If adding a new UDAF, ensure that `show_functions.q.out` has been updated. Run `ant test
-Dtestcase=TestCliDriver -Dqfile=show_functions.q -Doverwrite=true` to do this.
   * Run `svn diff > HIVE-NNNN.1.patch` from the Hive root directory, where NNNN is the
issue number the JIRA has assigned to you.
   * Attach your file to the JIRA issue, describe your patch in the comments section.
   * Ask for a code review in the comments.
   * Click '''Submit patch''' on your issue after you have completed the steps above.
   * It is also advisable to '''watch''' your issue to monitor new comments.
+ = Tips, Tricks, Best Practices =
+  * Hive can have unexpected behavior sometimes. It is best to first run `ant clean` if you're
seeing something weird, ranging from unexplained exceptions to strings being incorrectly double-quoted.
+  * When serializing the aggregation buffer in a `terminatePartial()` call, if your UDAF
only uses a few variables to represent the buffer (such as average), consider serializing
them into a list of doubles, for example, instead of complicated named structures.
+  * Strongly cast generics wherever you can.
+  * Abstract core functionality from multiple UDAFs into its own class. Examples are `histogram_numeric()`
and `percentile_approx()`, which both use the same core histogram estimation functionality.
+  * If you're stuck looking for an algorithm to adapt to the terminatePartial/merge paradigm,
divide-and-conquer and parallel algorithms are predictably good places to start.
+  * Remember that the tests do a `diff` on the expected and actual output, and fail if there
is any difference at all. An example of where this can fail horribly is a UDAF like `ngrams()`,
where the output is a list of sorted (word,count) pairs. In some cases, different sort implementations
might place words with the same count at different positions in the output. Even though the
output is correct, the test will fail. In these cases, it's better to output (for example)
only the counts, or some appropriate statistic on the counts, like the sum.

View raw message