drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6074) Corrections to UDF tutorial documentation page
Date Sun, 07 Jan 2018 08:32:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315105#comment-16315105
] 

Paul Rogers commented on DRILL-6074:
------------------------------------

Much more detailed information on UDFs can be found [here|https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information].

> Corrections to UDF tutorial documentation page
> ----------------------------------------------
>
>                 Key: DRILL-6074
>                 URL: https://issues.apache.org/jira/browse/DRILL-6074
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Documentation
>            Reporter: Paul Rogers
>            Assignee: Bridget Bevens
>            Priority: Minor
>              Labels: doc-impacting
>
> Consider the [UDF Tutorial|http://drill.apache.org/docs/tutorial-develop-a-simple-function/].
Some of the details are a bit off.
> Step 3:
> bq. The function will be generated dynamically, as you can see in the DrillSimpleFuncHolder,
and the input parameters and output holders are defined using holders by annotations. Define
the parameters using the \@Param annotation.
> Better: Drill uses your function template to in-line your function code into Drill's
own generated code. The \@Param annotation identifies the input arguments. The order of the
annotated fields indicates the order of the function parameters. Each parameter field must
be one of Drill's holder types.
> bq. Use a holder classes to provide a buffer to manage larger objects in an efficient
way: VarCharHolder or NullableVarCharHolder.
> Better: Our function template tells Drill to handle nulls, so all three of our arguments
can be declared using the VarCharHolder type.
> (Then, fix the code to use that type. The bit about larger objects is probably obsolete:
holders are the only way to work with any value: large or otherwise.)
> bq. NOTE: Drill doesn’t actually use the Java heap for data being processed in a query
but instead keeps this data off the heap and manages the life-cycle for us without using the
Java garbage collector.
> Better: NOTE: VARCHAR data is stored in direct memory. The DrillBuf object in the VarCharHolder
provides access to the data for the VARCHAR.
> (For context: simple types, such as INT, are stored on the heap when passed to a UDF,
so we don't want to make a blanket statement.)
> Step 4.
> bq. Also, using the \@Output annotation, define the returned value as VarCharHolder type.
Because you are manipulating a VarChar, you also have to inject a buffer that Drill uses for
the output.
> Better: Identify the function's return value using the \@Output annotation. Like parameters,
the output must be a holder type. Drill, however, does not provide the output buffer; we have
to request one using the \@Inject annotation. The injected field must be of type DrillBuf.
Then, in our code, we set the output holder to point to the injected buffer.
> Step 5. The code is inefficient and not a good example. Replace this:
> {code}
>     out.end = outputValue.getBytes().length;
>     buffer.setBytes(0, outputValue.getBytes());
> {code}
> With this:
> {code}
>     byte result[] = outputValue.getBytes();
>     out.end = result.length;
>     buffer.setBytes(0, result);
> {code}
> While we are at it, we might as well make another line a bit more readable.
> {code}
>     String outputValue = (new StringBuilder(maskSubString)).append(stringValue.substring(numberOfCharToReplace)).toString();
> {code}
> Should be rewritten as:
> {code}
>     String outputValue = new StringBuilder(maskSubString)
>         .append(stringValue.substring(numberOfCharToReplace)
>         .toString();
> {code}
> Then in the list of steps:
> bq. Gets the number of character to replace
> The word "character" should be "characters" (plural)
> And:
> bq. Creates and populates the output buffer
> Better:
> * Copies the new string into the temporary DrillBuf
> * Sets up the output holder to point to the data in the DrillBuf
> Then:
> bq. Even to a seasoned Java developer, the eval() method might look a bit strange because
Drill generates the final code on the fly to fulfill a query request. This technique leverages
Java’s just-in-time (JIT) compiler for maximum speed.
> Better: Even to a seasoned Java developer, the eval() method might look a bit strange.
It is best to think of the UDF declaration as a Domain-Specific Language (DSL) that Drill
uses to describe the function. Drill uses the declaration to in-line your function into generated
code. That is, Drill does not call your function code; instead Drill extracts the code and
copies it into Drill's own generated code.
> (Note: the bit about the JIT compiler is plain wrong. Drills code generation has nothing
to do with Java's JIT compiler.)
> Basic Coding Rules
> bq. To leverage Java’s just-in-time (JIT) compiler for maximum speed, you need to adhere
to some basic rules.
> Better: Drill's code generation mechanism supports a restricted subset of Java, meaning
that you must adhere to some basic rules.
> bq. Do not use imports. Instead, use the fully qualified class name as required by the
Google Guava API packaged in Apache Drill and as shown in "Step 3: Declare input parameters".
> (This mixes up a couple of ideas.) Better: Do not use imports. Instead, use the fully
qualified class name.
> bq. Manipulate the ValueHolders classes, for example VarCharHolder and IntHolder, as
structs by calling helper methods, such as getStringFromVarCharHolder and toStringFromUTF8
as shown in "Step 5: Implement the eval() function".
> bq. Do not call methods such as toString because this causes serious problems.
> Better: Do not call any methods on the holder classes. The holders will be optimized
away by Drill's scalar replacement mechanism.
> Some additional restrictions:
> * All class fields (member variables) must be preceded by one of the three annotations
discussed above (\@Param, \@Output or \@Inject), or by the \@Workspace annotation which identifies
internal temporary fields. (If you omit the annotations, then functions using your query will
fail at runtime.)
> * Do not use static fields (such as to declare constants.) If you must declare constants,
declare them in a class other than the UDF class.
> Prepare the Package
> bq. Because Drill generates the source, ...
> Better: Because Drill copies your code into is own generated code, ...
> Basic Coding Rules
> Build and Deploy the Function
> Test the New Function
> The above three lines probably want to be a heading; it appears as normal text.
> bq. Add the JAR files to Drill, by copying them to the following location: <Drill
installation directory>/jars/3rdparty
> Perhaps add the following: Be sure to copy the jars into the above folder each time you
rebuild, reinstall or upgrade Drill. If running in a cluster, copy the jars to the Drill installation
on every node.
> As an alternative, you can create a site directory as described (need link. Do we describe
this anywhere except in the Drill-on-YARN PR?) Copy your files into the {{$DRILL_SITE/jars}}
folder. This way, you need not remember to copy the jars each time you reinstall Drill.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message