drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Bates <jba...@maprtech.com>
Subject Re: Some questions on UDFs
Date Sun, 05 Jul 2015 01:40:42 GMT
I did get a new RepeatedBigIntHolder built and added a BigIntVector added
to it. I'll try it in the UDF tomorrow and see if there is a difference in
the ways I found to get a BufferAllocator.

.
.
.
@Inject DrillBuf buffer;
@Workspace RepeatedBigIntHolder yList;
.
.
.
@Override
public void setup() {
.
.
.
//org.apache.drill.exec.memory.BufferAllocator allocator =
buffer.getAllocator();
org.apache.drill.exec.memory.BufferAllocator allocator =  new
org.apache.drill.exec.memory.TopLevelAllocator();
yList = new RepeatedBigIntHolder();
yList.vector = new
org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
allocator);
.
.
.
}



On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jbates@maprtech.com> wrote:

> I still have issues finding the correct way to create and use a
> RepeatedHolder and Writers are a non starter for Workspace values. I can
> make do with creating a concatenated string in a VarCharHolder for small
> data sets to get past this in the short term and finish testing the output
> values I expect but won't be able to do any scale till I figure out how to
> make a repeated list.
>
> On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jbates@maprtech.com> wrote:
>
>> Well... Converting from string to integers anyway... To many 4th of July
>> Hot Dogs. going into nitrate overload. :)
>>
>> I am pulling an array of string values from json data. The string values
>> are actually integers. I am converting to integers and summing each
>> array entry to the final tally.
>>
>> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jbates@maprtech.com> wrote:
>>
>>> Ted,
>>>
>>> Yes, I started out just getting a basic count to work. I am trying to
>>> keep the workflow as close to a basic user as possible. As such, I am
>>> building and using the MapR Apache Drill sandbox to test.
>>>
>>>
>>>    1. Always look at the drillbits.log file to see if drill had any
>>>    issues loading your UDF. That was where I learned that all workspace values
>>>    needed to be holders
>>>       -
>>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
>>>       function class
>>>       com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, field
>>>       xList. Aggregate function 'MyLinearRegression1' workspace variable 'xList'
>>>       is of type 'interface
>>>       org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
>>>       Please change it to Holder type.
>>>    2. Error messages:
>>>       - If you get an error in this format it means that Drill can not
>>>       find your function so it probably didn't load it. back to step 1:
>>>          -
>>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
>>>          match found for function signature MyFunctionName(<ANY>)
>>>       - If you get an error in this format it means that the function
>>>       is there but Drill could not find a signature that matched the param types
>>>       or param numbers you were passing it. The exact wording will change but
>>>       the Missing function implementation is the key phrase to look for:
>>>          -
>>>          - Error: SYSTEM ERROR:
>>>          org.apache.drill.exec.exception.SchemaChangeException: Failure while
trying
>>>          to materialize incoming schema.  Errors:
>>>          - Error in expression at index -1.  Error: Missing function
>>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full expression: --UNKNOWN
>>>          EXPRESSION--
>>>       3. In your function definition for aggregate functions you need
>>>    to set null processing to internal and your isRandom to false. Example
>>>    below:
>>>       -
>>>       - @FunctionTemplate(name = "MyFunctionName", scope =
>>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>>>       isBinaryCommutative = false, costCategory =
>>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
>>>
>>> Below is an example from the Apache Drill tutorial data sets contained
>>> in the MapR Apache Drill sandbox. I am pulling an array if string values
>>> from json data. The string values are actually integers. I am converting to
>>> string and summing each array entry to the final tally. This in no way
>>> represents what this data was for but it did become a handy way for me to
>>> peck out the "correct" way to build an aggregation UDF function
>>>
>>> @FunctionTemplate(name = "MyArraySum", scope =
>>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>>> isBinaryCommutative = false, costCategory =
>>> FunctionTemplate.FunctionCostCategory.COMPLEX)
>>> public static class MyArraySum implements DrillAggFunc {
>>>
>>> @Param RepeatedVarCharHolder listToSearch;
>>> @Workspace NullableBigIntHolder count;
>>> @Workspace NullableBigIntHolder sum;
>>> @Workspace NullableVarCharHolder vc;
>>> @Output BigIntHolder out;
>>>
>>> @Override
>>> public void setup() {
>>> count.value=0;
>>> sum.value = 0;
>>> }
>>>
>>> @Override
>>> public void add() {
>>> int c = listToSearch.end - listToSearch.start;
>>> int val = 0;
>>> try {
>>> for(int i=0; i<c; i++){
>>> listToSearch.vector.getAccessor().get(i, vc);
>>> String inputStr =
>>> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
>>> vc.end, vc.buffer);
>>> val = Integer.parseInt(inputStr);
>>> sum.value = sum.value + val;
>>> }
>>> } catch (Exception e) {
>>> val = 0;
>>> }
>>> count.value = count.value + 1;
>>> }
>>>
>>> Example select statement:
>>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
>>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
>>>
>>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <ted.dunning@gmail.com>
>>> wrote:
>>>
>>>> Jim,
>>>>
>>>> I think that you may be having trouble with aggregators in general.
>>>>
>>>> Have you been able to build *any* aggregator of anything?  I haven't.
>>>>
>>>> When I try to build an aggregator of int's or doubles, I get a very
>>>> persistent problem with Drill even seeing my aggregates:
>>>>
>>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
>>>> cp.`employee.json`;*
>>>>
>>>> Jul 04, 2015 4:19:35 PM
>>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>>
>>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>>> found for function signature sum_int(<ANY>)
>>>>
>>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
>>>> <init>
>>>>
>>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>>> column 8 to line 1, column 27: No match found for function signature
>>>> sum_int(<ANY>)
>>>>
>>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No
>>>> match
>>>> found for function signature sum_int(<ANY>)*
>>>>
>>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
>>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>>
>>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
>>>> cp.`employee.json`*;
>>>>
>>>> Jul 04, 2015 4:19:45 PM
>>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>>
>>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>>> found for function signature sum_int(<NUMERIC>)
>>>>
>>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
>>>> <init>
>>>>
>>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>>> column 8 to line 1, column 40: No match found for function signature
>>>> sum_int(<NUMERIC>)
>>>>
>>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No
>>>> match
>>>> found for function signature sum_int(<NUMERIC>)*
>>>>
>>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
>>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>>
>>>> 0: jdbc:drill:zk=local>
>>>>
>>>>
>>>> It looks like there is some undocumented subtlety about how to register
>>>> an
>>>> aggregator.
>>>>
>>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jbates@maprtech.com> wrote:
>>>>
>>>> > I'm working on the same thing. I want to aggregate a list of values.
>>>> It has
>>>> > been a search and guess game for the most part. I'm still stuck in the
>>>> > process of getting the values all into a list. The writers look
>>>> interesting
>>>> > but for aggregation functions  it looks like the input is the param
>>>> and
>>>> > output objects can't hold the aggregations steps. The Workspace is
>>>> where
>>>> > that happens. If I try and use a Writer in a workspace it won't load
>>>> and
>>>> > tells me to change it to Holders which was why I was using them to
>>>> start
>>>> > with. Maybe I'm missing the architecture of the agg function. It
>>>> looked
>>>> > like it was....
>>>> >
>>>> > @Param comes in -> initialize @Workspace vars in setup -> process
data
>>>> > through @Workspace vars in add -> finalize @Output in output.
>>>> >
>>>> > So I'm back to trying to figure out how to create a
>>>> RepeatedBigIntHolder or
>>>> > a RepeatedVarCharHolder...
>>>> >
>>>> >
>>>> >
>>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <ted.dunning@gmail.com>
>>>> wrote:
>>>> >
>>>> > > I am working on trying to build any kind of list constructing
>>>> aggregator
>>>> > > and having absolute fits.
>>>> > >
>>>> > > To simplify life, I decided to just build a generic list builder
>>>> that is
>>>> > a
>>>> > > scalar function that returns a list containing its argument.  Thus
>>>> > zoop(3)
>>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
>>>> > >
>>>> > > The ComplexWriter looks like the place to go. As usual, the
>>>> complete lack
>>>> > > of comments in most of Drill makes this very hard since I have
to
>>>> guess
>>>> > > what works and what doesn't.
>>>> > >
>>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
>>>> method.  I
>>>> > > used this in zip and it works nicely to construct lists for
>>>> output.  I
>>>> > note
>>>> > > that the resulting ListWriter has a method copyReader(FieldReader
>>>> var1)
>>>> > > which looks really good.
>>>> > >
>>>> > > Unfortunately, the only implementation of copyReader() is in
>>>> > > AbstractFieldWriter and it looks this:
>>>> > >
>>>> > > public void copyReader(FieldReader reader) {
>>>> > >     this.fail("Copy FieldReader");
>>>> > > }
>>>> > >
>>>> > > I would like to formally say at this point "WTF"?
>>>> > >
>>>> > > In digging in further, I see other methods that look handy like
>>>> > >
>>>> > > public void write(IntHolder holder) {
>>>> > >     this.fail("Int");
>>>> > > }
>>>> > >
>>>> > > And then in looking at implementations, it looks like there is
a
>>>> > > combinatorial explosion because every type seems to need a write
>>>> method
>>>> > for
>>>> > > every other type.
>>>> > >
>>>> > > What is the thought here?  How can I copy an arbitrary value into
a
>>>> list?
>>>> > >
>>>> > > My next thought was to build code that dispatches on type.  There
>>>> is a
>>>> > > method called getType() on the FieldReader.  Unfortunately, that
>>>> drives
>>>> > > into code generated by protoc and I see no way to dispatch on the
>>>> type of
>>>> > > an incoming value.
>>>> > >
>>>> > >
>>>> > > How is this supposed to work?
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <baid.mehant@gmail.com>
>>>> > wrote:
>>>> > >
>>>> > > > For a detailed example on using ComplexWriter interface you
can
>>>> take a
>>>> > > look
>>>> > > > at the Mappify
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
>>>> > > > >
>>>> > > > (kvgen) function. The function itself is very simple however
it
>>>> makes
>>>> > use
>>>> > > > of the utility methods in MappifyUtility
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
>>>> > > > >
>>>> > > > and MapUtility
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
>>>> > > > >
>>>> > > > which perform most of the work.
>>>> > > >
>>>> > > > Currently we don't have a generic infrastructure to handle
errors
>>>> > coming
>>>> > > > out of functions. However there is UserException, which when
>>>> raised
>>>> > will
>>>> > > > make sure that Drill does not gobble up the error message
in that
>>>> > > > exception. So you can probably throw a UserException with
the
>>>> failing
>>>> > > input
>>>> > > > in your function to make sure it propagates to the user.
>>>> > > >
>>>> > > > Thanks
>>>> > > > Mehant
>>>> > > >
>>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
>>>> jacques@apache.org>
>>>> > > wrote:
>>>> > > >
>>>> > > > > *Holders are for both input and output.  You can also
use
>>>> > CompleWriter
>>>> > > > for
>>>> > > > > output and FieldReader for input if you want to write
or read a
>>>> > complex
>>>> > > > > value.
>>>> > > > >
>>>> > > > > I don't think we've provided a really clean way to construct
a
>>>> > > > > Repeated*Holder for output purposes.  You can probably
do it by
>>>> > > reaching
>>>> > > > > into a bunch of internal interfaces in Drill.  However,
I would
>>>> > > recommend
>>>> > > > > using the ComplexWriter output pattern for now.  This
will be a
>>>> > little
>>>> > > > less
>>>> > > > > efficient but substantially less brittle.  I suggest
you open
>>>> up a
>>>> > jira
>>>> > > > for
>>>> > > > > using a Repeated*Holder as an output.
>>>> > > > >
>>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
>>>> ted.dunning@gmail.com>
>>>> > > > wrote:
>>>> > > > >
>>>> > > > > > Holders are for input, I think.
>>>> > > > > >
>>>> > > > > > Try the different kinds of writers.
>>>> > > > > >
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
>>>> jbates@maprtech.com>
>>>> > > > wrote:
>>>> > > > > >
>>>> > > > > > > Using a repeatedholder as a @param I've got
working. I was
>>>> > working
>>>> > > > on a
>>>> > > > > > > custom aggregator function using DrillAggFunc.
In this I
>>>> can do
>>>> > > > simple
>>>> > > > > > > things but If I want to build a list values
and do
>>>> something with
>>>> > > it
>>>> > > > in
>>>> > > > > > the
>>>> > > > > > > final output method I think I need to use RepeatedHolders
>>>> in the
>>>> > > > > > > @Workspace. To do that I need to create a new
one in the
>>>> setup
>>>> > > > method.
>>>> > > > > I
>>>> > > > > > > can't get one built. They all require a BufferAllocator
to
>>>> be
>>>> > > passed
>>>> > > > in
>>>> > > > > > to
>>>> > > > > > > build it. I have not found a way to get an
allocator yet.
>>>> Any
>>>> > > > > > suggestions?
>>>> > > > > > >
>>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning
<
>>>> > ted.dunning@gmail.com
>>>> > > >
>>>> > > > > > wrote:
>>>> > > > > > >
>>>> > > > > > > > If you look at the zip function in
>>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions
you
>>>> can
>>>> > > have
>>>> > > > an
>>>> > > > > > > > example of building a structure.
>>>> > > > > > > >
>>>> > > > > > > > The basic idea is that your output is
denoted as
>>>> > > > > > > >
>>>> > > > > > > >         @Output
>>>> > > > > > > >         BaseWriter.ComplexWriter writer;
>>>> > > > > > > >
>>>> > > > > > > > The pattern for building a list of lists
of integers is
>>>> like
>>>> > > this:
>>>> > > > > > > >
>>>> > > > > > > >         writer.setValueCount(n);
>>>> > > > > > > >         ...
>>>> > > > > > > >         BaseWriter.ListWriter outer =
writer.rootAsList();
>>>> > > > > > > >         outer.start(); // [ outer list
>>>> > > > > > > >         ...
>>>> > > > > > > >         // for each inner list
>>>> > > > > > > >             BaseWriter.ListWriter inner
= outer.list();
>>>> > > > > > > >             inner.start();
>>>> > > > > > > >             // for each inner list element
>>>> > > > > > > >                 inner.integer().writeInt(accessor.get(i));
>>>> > > > > > > >             }
>>>> > > > > > > >             inner.end();   // ] inner
list
>>>> > > > > > > >         }
>>>> > > > > > > >         outer.end(); // ] outer list
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates
<
>>>> > jbates@maprtech.com>
>>>> > > > > > wrote:
>>>> > > > > > > >
>>>> > > > > > > > > I have working aggregation and simple
UDFs. I've been
>>>> trying
>>>> > to
>>>> > > > > > > document
>>>> > > > > > > > > and understand each of the options
available in a Drill
>>>> UDF.
>>>> > > > > > > > Understanding
>>>> > > > > > > > > the different FunctionScope's, the
ones that are
>>>> allowed, the
>>>> > > > ones
>>>> > > > > > that
>>>> > > > > > > > are
>>>> > > > > > > > > not. The impact of different cost
categories. The
>>>> different
>>>> > > > steps
>>>> > > > > > > needed
>>>> > > > > > > > > to understand handling any of the
supported data types
>>>> and
>>>> > > > > > structures
>>>> > > > > > > in
>>>> > > > > > > > > drill.
>>>> > > > > > > > >
>>>> > > > > > > > > Here are a few of my current road
blocks. Any pointers
>>>> would
>>>> > be
>>>> > > > > > greatly
>>>> > > > > > > > > appreciated.
>>>> > > > > > > > >
>>>> > > > > > > > >
>>>> > > > > > > > >    1. I've been trying to understand
how to correctly
>>>> use
>>>> > > > > > > RepeatedHolders
>>>> > > > > > > > >    of whatever type. For this discussion
lets start
>>>> with a
>>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying
to figure out the
>>>> best
>>>> > way
>>>> > > to
>>>> > > > > > > create
>>>> > > > > > > > a
>>>> > > > > > > > > new
>>>> > > > > > > > >    one. I have not figured out where
in the existing
>>>> drill
>>>> > code
>>>> > > > > > someone
>>>> > > > > > > > > does
>>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder
as a Workspace
>>>> > object
>>>> > > > is
>>>> > > > > is
>>>> > > > > > > > null
>>>> > > > > > > > > to
>>>> > > > > > > > >    start with. I created a new one
in the startup
>>>> section of
>>>> > > the
>>>> > > > > udf
>>>> > > > > > > but
>>>> > > > > > > > > the
>>>> > > > > > > > >    vector was null. I can find no
reference in creating
>>>> a new
>>>> > > > > > > > BigIntVector.
>>>> > > > > > > > >    There is a way to create a BigIntVector
and I did
>>>> find an
>>>> > > > > example
>>>> > > > > > of
>>>> > > > > > > > >    creating a new VarCharVector but
I can't do that
>>>> using the
>>>> > > > drill
>>>> > > > > > jar
>>>> > > > > > > > > files
>>>> > > > > > > > >    from 1.0. The
>>>> org.apache.drill.common.types.TypeProtos and
>>>> > > > > > > > >    the
>>>> org.apache.drill.common.types.TypeProtos.MinorType
>>>> > > classes
>>>> > > > > do
>>>> > > > > > > not
>>>> > > > > > > > >    appear to be accessible from the
drill jar files.
>>>> > > > > > > > >    2. What is the best way to close
out a UDF in the
>>>> event it
>>>> > > > > > generates
>>>> > > > > > > > an
>>>> > > > > > > > >    exception? Are there specific
steps one should
>>>> follow to
>>>> > > make
>>>> > > > a
>>>> > > > > > > clean
>>>> > > > > > > > > exit
>>>> > > > > > > > >    in a catch block that are beneficial
to Drill?
>>>> > > > > > > > >
>>>> > > > > > > >
>>>> > > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message