drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Some questions on UDFs
Date Sun, 05 Jul 2015 18:36:13 GMT
That was impressively non-obvious.



On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jbates@maprtech.com> wrote:

> I did get a new RepeatedBigIntHolder built and added a BigIntVector added
> to it. I'll try it in the UDF tomorrow and see if there is a difference in
> the ways I found to get a BufferAllocator.
>
> .
> .
> .
> @Inject DrillBuf buffer;
> @Workspace RepeatedBigIntHolder yList;
> .
> .
> .
> @Override
> public void setup() {
> .
> .
> .
> //org.apache.drill.exec.memory.BufferAllocator allocator =
> buffer.getAllocator();
> org.apache.drill.exec.memory.BufferAllocator allocator =  new
> org.apache.drill.exec.memory.TopLevelAllocator();
> yList = new RepeatedBigIntHolder();
> yList.vector = new
>
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
>
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
>
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> allocator);
> .
> .
> .
> }
>
>
>
> On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jbates@maprtech.com> wrote:
>
> > I still have issues finding the correct way to create and use a
> > RepeatedHolder and Writers are a non starter for Workspace values. I can
> > make do with creating a concatenated string in a VarCharHolder for small
> > data sets to get past this in the short term and finish testing the
> output
> > values I expect but won't be able to do any scale till I figure out how
> to
> > make a repeated list.
> >
> > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jbates@maprtech.com> wrote:
> >
> >> Well... Converting from string to integers anyway... To many 4th of July
> >> Hot Dogs. going into nitrate overload. :)
> >>
> >> I am pulling an array of string values from json data. The string values
> >> are actually integers. I am converting to integers and summing each
> >> array entry to the final tally.
> >>
> >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jbates@maprtech.com> wrote:
> >>
> >>> Ted,
> >>>
> >>> Yes, I started out just getting a basic count to work. I am trying to
> >>> keep the workflow as close to a basic user as possible. As such, I am
> >>> building and using the MapR Apache Drill sandbox to test.
> >>>
> >>>
> >>>    1. Always look at the drillbits.log file to see if drill had any
> >>>    issues loading your UDF. That was where I learned that all
> workspace values
> >>>    needed to be holders
> >>>       -
> >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
> >>>       function class
> >>>
>  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, field
> >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> variable 'xList'
> >>>       is of type 'interface
> >>>
>  org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> >>>       Please change it to Holder type.
> >>>    2. Error messages:
> >>>       - If you get an error in this format it means that Drill can not
> >>>       find your function so it probably didn't load it. back to step 1:
> >>>          -
> >>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
> >>>          match found for function signature MyFunctionName(<ANY>)
> >>>       - If you get an error in this format it means that the function
> >>>       is there but Drill could not find a signature that matched the
> param types
> >>>       or param numbers you were passing it. The exact wording will
> change but
> >>>       the Missing function implementation is the key phrase to look
> for:
> >>>          -
> >>>          - Error: SYSTEM ERROR:
> >>>          org.apache.drill.exec.exception.SchemaChangeException:
> Failure while trying
> >>>          to materialize incoming schema.  Errors:
> >>>          - Error in expression at index -1.  Error: Missing function
> >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> expression: --UNKNOWN
> >>>          EXPRESSION--
> >>>       3. In your function definition for aggregate functions you need
> >>>    to set null processing to internal and your isRandom to false.
> Example
> >>>    below:
> >>>       -
> >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> >>>       isBinaryCommutative = false, costCategory =
> >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> >>>
> >>> Below is an example from the Apache Drill tutorial data sets contained
> >>> in the MapR Apache Drill sandbox. I am pulling an array if string
> values
> >>> from json data. The string values are actually integers. I am
> converting to
> >>> string and summing each array entry to the final tally. This in no way
> >>> represents what this data was for but it did become a handy way for me
> to
> >>> peck out the "correct" way to build an aggregation UDF function
> >>>
> >>> @FunctionTemplate(name = "MyArraySum", scope =
> >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> >>> isBinaryCommutative = false, costCategory =
> >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> >>> public static class MyArraySum implements DrillAggFunc {
> >>>
> >>> @Param RepeatedVarCharHolder listToSearch;
> >>> @Workspace NullableBigIntHolder count;
> >>> @Workspace NullableBigIntHolder sum;
> >>> @Workspace NullableVarCharHolder vc;
> >>> @Output BigIntHolder out;
> >>>
> >>> @Override
> >>> public void setup() {
> >>> count.value=0;
> >>> sum.value = 0;
> >>> }
> >>>
> >>> @Override
> >>> public void add() {
> >>> int c = listToSearch.end - listToSearch.start;
> >>> int val = 0;
> >>> try {
> >>> for(int i=0; i<c; i++){
> >>> listToSearch.vector.getAccessor().get(i, vc);
> >>> String inputStr =
> >>>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> >>> vc.end, vc.buffer);
> >>> val = Integer.parseInt(inputStr);
> >>> sum.value = sum.value + val;
> >>> }
> >>> } catch (Exception e) {
> >>> val = 0;
> >>> }
> >>> count.value = count.value + 1;
> >>> }
> >>>
> >>> Example select statement:
> >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
> >>>
> >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <ted.dunning@gmail.com>
> >>> wrote:
> >>>
> >>>> Jim,
> >>>>
> >>>> I think that you may be having trouble with aggregators in general.
> >>>>
> >>>> Have you been able to build *any* aggregator of anything?  I haven't.
> >>>>
> >>>> When I try to build an aggregator of int's or doubles, I get a very
> >>>> persistent problem with Drill even seeing my aggregates:
> >>>>
> >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> >>>> cp.`employee.json`;*
> >>>>
> >>>> Jul 04, 2015 4:19:35 PM
> >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> match
> >>>> found for function signature sum_int(<ANY>)
> >>>>
> >>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
> >>>> <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line
> 1,
> >>>> column 8 to line 1, column 27: No match found for function signature
> >>>> sum_int(<ANY>)
> >>>>
> >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No
> >>>> match
> >>>> found for function signature sum_int(<ANY>)*
> >>>>
> >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
> >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> >>>>
> >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int))
from
> >>>> cp.`employee.json`*;
> >>>>
> >>>> Jul 04, 2015 4:19:45 PM
> >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> match
> >>>> found for function signature sum_int(<NUMERIC>)
> >>>>
> >>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
> >>>> <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line
> 1,
> >>>> column 8 to line 1, column 40: No match found for function signature
> >>>> sum_int(<NUMERIC>)
> >>>>
> >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No
> >>>> match
> >>>> found for function signature sum_int(<NUMERIC>)*
> >>>>
> >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
> >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> >>>>
> >>>> 0: jdbc:drill:zk=local>
> >>>>
> >>>>
> >>>> It looks like there is some undocumented subtlety about how to
> register
> >>>> an
> >>>> aggregator.
> >>>>
> >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jbates@maprtech.com>
> wrote:
> >>>>
> >>>> > I'm working on the same thing. I want to aggregate a list of values.
> >>>> It has
> >>>> > been a search and guess game for the most part. I'm still stuck
in
> the
> >>>> > process of getting the values all into a list. The writers look
> >>>> interesting
> >>>> > but for aggregation functions  it looks like the input is the param
> >>>> and
> >>>> > output objects can't hold the aggregations steps. The Workspace
is
> >>>> where
> >>>> > that happens. If I try and use a Writer in a workspace it won't
load
> >>>> and
> >>>> > tells me to change it to Holders which was why I was using them
to
> >>>> start
> >>>> > with. Maybe I'm missing the architecture of the agg function. It
> >>>> looked
> >>>> > like it was....
> >>>> >
> >>>> > @Param comes in -> initialize @Workspace vars in setup ->
process
> data
> >>>> > through @Workspace vars in add -> finalize @Output in output.
> >>>> >
> >>>> > So I'm back to trying to figure out how to create a
> >>>> RepeatedBigIntHolder or
> >>>> > a RepeatedVarCharHolder...
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <ted.dunning@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > > I am working on trying to build any kind of list constructing
> >>>> aggregator
> >>>> > > and having absolute fits.
> >>>> > >
> >>>> > > To simplify life, I decided to just build a generic list builder
> >>>> that is
> >>>> > a
> >>>> > > scalar function that returns a list containing its argument.
 Thus
> >>>> > zoop(3)
> >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) =>
[[1,2,3]].
> >>>> > >
> >>>> > > The ComplexWriter looks like the place to go. As usual, the
> >>>> complete lack
> >>>> > > of comments in most of Drill makes this very hard since I
have to
> >>>> guess
> >>>> > > what works and what doesn't.
> >>>> > >
> >>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
> >>>> method.  I
> >>>> > > used this in zip and it works nicely to construct lists for
> >>>> output.  I
> >>>> > note
> >>>> > > that the resulting ListWriter has a method copyReader(FieldReader
> >>>> var1)
> >>>> > > which looks really good.
> >>>> > >
> >>>> > > Unfortunately, the only implementation of copyReader() is
in
> >>>> > > AbstractFieldWriter and it looks this:
> >>>> > >
> >>>> > > public void copyReader(FieldReader reader) {
> >>>> > >     this.fail("Copy FieldReader");
> >>>> > > }
> >>>> > >
> >>>> > > I would like to formally say at this point "WTF"?
> >>>> > >
> >>>> > > In digging in further, I see other methods that look handy
like
> >>>> > >
> >>>> > > public void write(IntHolder holder) {
> >>>> > >     this.fail("Int");
> >>>> > > }
> >>>> > >
> >>>> > > And then in looking at implementations, it looks like there
is a
> >>>> > > combinatorial explosion because every type seems to need a
write
> >>>> method
> >>>> > for
> >>>> > > every other type.
> >>>> > >
> >>>> > > What is the thought here?  How can I copy an arbitrary value
into
> a
> >>>> list?
> >>>> > >
> >>>> > > My next thought was to build code that dispatches on type.
 There
> >>>> is a
> >>>> > > method called getType() on the FieldReader.  Unfortunately,
that
> >>>> drives
> >>>> > > into code generated by protoc and I see no way to dispatch
on the
> >>>> type of
> >>>> > > an incoming value.
> >>>> > >
> >>>> > >
> >>>> > > How is this supposed to work?
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> baid.mehant@gmail.com>
> >>>> > wrote:
> >>>> > >
> >>>> > > > For a detailed example on using ComplexWriter interface
you can
> >>>> take a
> >>>> > > look
> >>>> > > > at the Mappify
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> >>>> > > > >
> >>>> > > > (kvgen) function. The function itself is very simple
however it
> >>>> makes
> >>>> > use
> >>>> > > > of the utility methods in MappifyUtility
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> >>>> > > > >
> >>>> > > > and MapUtility
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> >>>> > > > >
> >>>> > > > which perform most of the work.
> >>>> > > >
> >>>> > > > Currently we don't have a generic infrastructure to handle
> errors
> >>>> > coming
> >>>> > > > out of functions. However there is UserException, which
when
> >>>> raised
> >>>> > will
> >>>> > > > make sure that Drill does not gobble up the error message
in
> that
> >>>> > > > exception. So you can probably throw a UserException
with the
> >>>> failing
> >>>> > > input
> >>>> > > > in your function to make sure it propagates to the user.
> >>>> > > >
> >>>> > > > Thanks
> >>>> > > > Mehant
> >>>> > > >
> >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> >>>> jacques@apache.org>
> >>>> > > wrote:
> >>>> > > >
> >>>> > > > > *Holders are for both input and output.  You can
also use
> >>>> > CompleWriter
> >>>> > > > for
> >>>> > > > > output and FieldReader for input if you want to
write or read
> a
> >>>> > complex
> >>>> > > > > value.
> >>>> > > > >
> >>>> > > > > I don't think we've provided a really clean way
to construct a
> >>>> > > > > Repeated*Holder for output purposes.  You can probably
do it
> by
> >>>> > > reaching
> >>>> > > > > into a bunch of internal interfaces in Drill.  However,
I
> would
> >>>> > > recommend
> >>>> > > > > using the ComplexWriter output pattern for now.
 This will be
> a
> >>>> > little
> >>>> > > > less
> >>>> > > > > efficient but substantially less brittle.  I suggest
you open
> >>>> up a
> >>>> > jira
> >>>> > > > for
> >>>> > > > > using a Repeated*Holder as an output.
> >>>> > > > >
> >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> >>>> ted.dunning@gmail.com>
> >>>> > > > wrote:
> >>>> > > > >
> >>>> > > > > > Holders are for input, I think.
> >>>> > > > > >
> >>>> > > > > > Try the different kinds of writers.
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates
<
> >>>> jbates@maprtech.com>
> >>>> > > > wrote:
> >>>> > > > > >
> >>>> > > > > > > Using a repeatedholder as a @param I've
got working. I was
> >>>> > working
> >>>> > > > on a
> >>>> > > > > > > custom aggregator function using DrillAggFunc.
In this I
> >>>> can do
> >>>> > > > simple
> >>>> > > > > > > things but If I want to build a list values
and do
> >>>> something with
> >>>> > > it
> >>>> > > > in
> >>>> > > > > > the
> >>>> > > > > > > final output method I think I need to
use RepeatedHolders
> >>>> in the
> >>>> > > > > > > @Workspace. To do that I need to create
a new one in the
> >>>> setup
> >>>> > > > method.
> >>>> > > > > I
> >>>> > > > > > > can't get one built. They all require
a BufferAllocator to
> >>>> be
> >>>> > > passed
> >>>> > > > in
> >>>> > > > > > to
> >>>> > > > > > > build it. I have not found a way to get
an allocator yet.
> >>>> Any
> >>>> > > > > > suggestions?
> >>>> > > > > > >
> >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning
<
> >>>> > ted.dunning@gmail.com
> >>>> > > >
> >>>> > > > > > wrote:
> >>>> > > > > > >
> >>>> > > > > > > > If you look at the zip function in
> >>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions
> you
> >>>> can
> >>>> > > have
> >>>> > > > an
> >>>> > > > > > > > example of building a structure.
> >>>> > > > > > > >
> >>>> > > > > > > > The basic idea is that your output
is denoted as
> >>>> > > > > > > >
> >>>> > > > > > > >         @Output
> >>>> > > > > > > >         BaseWriter.ComplexWriter
writer;
> >>>> > > > > > > >
> >>>> > > > > > > > The pattern for building a list of
lists of integers is
> >>>> like
> >>>> > > this:
> >>>> > > > > > > >
> >>>> > > > > > > >         writer.setValueCount(n);
> >>>> > > > > > > >         ...
> >>>> > > > > > > >         BaseWriter.ListWriter outer
=
> writer.rootAsList();
> >>>> > > > > > > >         outer.start(); // [ outer
list
> >>>> > > > > > > >         ...
> >>>> > > > > > > >         // for each inner list
> >>>> > > > > > > >             BaseWriter.ListWriter
inner = outer.list();
> >>>> > > > > > > >             inner.start();
> >>>> > > > > > > >             // for each inner list
element
> >>>> > > > > > > >
>  inner.integer().writeInt(accessor.get(i));
> >>>> > > > > > > >             }
> >>>> > > > > > > >             inner.end();   // ] inner
list
> >>>> > > > > > > >         }
> >>>> > > > > > > >         outer.end(); // ] outer list
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM,
Jim Bates <
> >>>> > jbates@maprtech.com>
> >>>> > > > > > wrote:
> >>>> > > > > > > >
> >>>> > > > > > > > > I have working aggregation and
simple UDFs. I've been
> >>>> trying
> >>>> > to
> >>>> > > > > > > document
> >>>> > > > > > > > > and understand each of the options
available in a
> Drill
> >>>> UDF.
> >>>> > > > > > > > Understanding
> >>>> > > > > > > > > the different FunctionScope's,
the ones that are
> >>>> allowed, the
> >>>> > > > ones
> >>>> > > > > > that
> >>>> > > > > > > > are
> >>>> > > > > > > > > not. The impact of different
cost categories. The
> >>>> different
> >>>> > > > steps
> >>>> > > > > > > needed
> >>>> > > > > > > > > to understand handling any of
the supported data types
> >>>> and
> >>>> > > > > > structures
> >>>> > > > > > > in
> >>>> > > > > > > > > drill.
> >>>> > > > > > > > >
> >>>> > > > > > > > > Here are a few of my current
road blocks. Any pointers
> >>>> would
> >>>> > be
> >>>> > > > > > greatly
> >>>> > > > > > > > > appreciated.
> >>>> > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > > >    1. I've been trying to understand
how to correctly
> >>>> use
> >>>> > > > > > > RepeatedHolders
> >>>> > > > > > > > >    of whatever type. For this
discussion lets start
> >>>> with a
> >>>> > > > > > > > >    RepeatedBigIntHolder. I'm
trying to figure out the
> >>>> best
> >>>> > way
> >>>> > > to
> >>>> > > > > > > create
> >>>> > > > > > > > a
> >>>> > > > > > > > > new
> >>>> > > > > > > > >    one. I have not figured out
where in the existing
> >>>> drill
> >>>> > code
> >>>> > > > > > someone
> >>>> > > > > > > > > does
> >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder
as a
> Workspace
> >>>> > object
> >>>> > > > is
> >>>> > > > > is
> >>>> > > > > > > > null
> >>>> > > > > > > > > to
> >>>> > > > > > > > >    start with. I created a new
one in the startup
> >>>> section of
> >>>> > > the
> >>>> > > > > udf
> >>>> > > > > > > but
> >>>> > > > > > > > > the
> >>>> > > > > > > > >    vector was null. I can find
no reference in
> creating
> >>>> a new
> >>>> > > > > > > > BigIntVector.
> >>>> > > > > > > > >    There is a way to create
a BigIntVector and I did
> >>>> find an
> >>>> > > > > example
> >>>> > > > > > of
> >>>> > > > > > > > >    creating a new VarCharVector
but I can't do that
> >>>> using the
> >>>> > > > drill
> >>>> > > > > > jar
> >>>> > > > > > > > > files
> >>>> > > > > > > > >    from 1.0. The
> >>>> org.apache.drill.common.types.TypeProtos and
> >>>> > > > > > > > >    the
> >>>> org.apache.drill.common.types.TypeProtos.MinorType
> >>>> > > classes
> >>>> > > > > do
> >>>> > > > > > > not
> >>>> > > > > > > > >    appear to be accessible from
the drill jar files.
> >>>> > > > > > > > >    2. What is the best way to
close out a UDF in the
> >>>> event it
> >>>> > > > > > generates
> >>>> > > > > > > > an
> >>>> > > > > > > > >    exception? Are there specific
steps one should
> >>>> follow to
> >>>> > > make
> >>>> > > > a
> >>>> > > > > > > clean
> >>>> > > > > > > > > exit
> >>>> > > > > > > > >    in a catch block that are
beneficial to Drill?
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message