drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shadi Khalifa <khal...@cs.queensu.ca>
Subject Re: Passing multiple columns to a UDAF
Date Wed, 01 Apr 2015 11:06:23 GMT
Thanks Jason for all this information!! Really appreciate it!!! 
Regards
Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
I'm just a neuron in the society collective brain

01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101 01100111 01111001
01110000 01110100 
P Please consider your environmental responsibility before printing this e-mail

 


     On Tuesday, March 31, 2015 4:04 PM, Jason Altekruse <altekrusejason@gmail.com>
wrote:
   

 Hi Shadi,

Unfortunately that isn't going to be a good strategy. We actually removed the RecordBatch
entirely from the UDF interfaces recently to prevent exposing so much information to UDFs.
To do something like this we would want to
 define a new interface to UDFs.

One shortcoming that I believe is related to what you are trying to do, is the inability to
consider the top level schema of a Drill record in the same way we currently consider the
complex map type. Drill currently supports passing non-scalar values in the form of maps and
repeated types into UDFs (these maps and lists can be nested within one another to make nearly
arbitrarily complex data structures). The interface for passing in these structures is the
FieldReader, which is much like an iterator/visitor over a tree structure. The two functions
that use this interface today are convertTo_JSON and kvgen (also called mappify). Both of
these functions take a complex object as input, convertTo_JSON produces a VarChar with the
JSON representation and kvgen applies a transformation to the data to make the key values
in a map queryable (more information in the wiki link below[1]).

The important thing to note, is that these functions can only be invoked on a particular field
in the schema. It would make sense to allow them to be invoked on the entire root schema,
treating it like a map itself, possibly with syntax like convertTo_JSON(*) (NOTE: this is
not supported right now, and hasn't even been in a design doc, this will not work today)

For example, these two datasets:

flat schema:
----------------
{
    "a" : 1,
    "b" : 2
}

complex schema:
-----------------------
{
    "data" : {
          "a" : 1,
          "b"
    }
}

The first dataset can only be used to access the individual data members with the syntax:
table_name.a

However the second one can pass multiple fields into a function for processing, because the
data is stored under a map at the root of the schema, such as producing JSON in a varchar
using: convertTo_JSON(data)

If you are willing to change the structure of your incoming data, I think that this might
be a viable strategy for passing a variable number of arguments into a function. This will
have the restriction today of having a single data type within any lists used, but if there
is a discrete number of possible traits you should be able to use a map instead of a list,
and nested field within a map can have different data types, i.e you cannot currently have
a mixed type array like [1, true, "a string"], but you could put them either in their own
fields { "a_number" :1, "a_bool" : true, "a_str" : "a string"} or have lists for each type
nested down inside of the map { "list_numbers" : [1], "list_bools" : [ true ], "list_strings"
: ["a string"] }

As long as I've written this much, I should say that this alternate strategy will currently
only work if you change the source data. We do not support the concept of re-nesting data
within the query. Say you wanted to use an array to pass a variable number of arguments. If
the source data had the data in separate fields, we currently *do not* support something like
select field_1 as new_list[0], field_2 as new_list[1]. Again as before, this hasn't even been
fully discussed, so this will not work today and this doesn't represent a declaration of how
this may work in Drill in the future, its just to demonstrate what we don't do today. If this
feature existed, you could use this new list in an outer query and pass it in as your variable
length argument to your function. To do something like this today, you have to modify the
source data to put it in this form.

To see how the FieldReader is used, check out this function definition in the Drill source:
org.apache.drill.exec.expr.fn.impl.Mappify

Documentation on its usage in queries
[1] https://cwiki.apache.org/confluence/display/DRILL/KVGEN+Function

On Tue, Mar 31, 2015 at 10:24 AM, Shadi Khalifa <khalifa@cs.queensu.ca> wrote:

I wonder if I can extract this data from the RecordBatch? any ideas? 
Regards
Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
I'm just a neuron in the society collective brain

01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101 01100111 01111001
01110000 01110100 
P Please consider your environmental responsibility before printing this e-mail




     On Tuesday, March 31, 2015 1:16 PM, Jacques Nadeau <jacques@apache.org> wrote:


 It isn't yet supported but is something I think a lot of people would find
useful.  Depending on how ambitious you are, maybe you could pick it up?

On Tue, Mar 31, 2015 at 10:05 AM, Shadi Khalifa <khalifa@cs.queensu.ca>
wrote:

> Hello everyone,
> I wonder if there is a way to send a variable number (Array) of attributes
> (columns) to a custom user defined aggregate function.
> I want to be able to have something like:Select myAggrFn(col1,col2,...,
> coln) from mytable;
>
> I wonder if there is something like the following or anything else that
> can handle this case:@FunctionTemplate(name = "myAggrFn", scope =
> FunctionTemplate.FunctionScope.POINT_AGGREGATE)public static class
> MyAggrFnimplements DrillAggFunc{  @Param ObjectHolder[] in;
>
>  I know it's weird to have a function like that, but I'm implementing
> machine learning into Drill and need to pass some columns or maybe the
> whole row to the aggregate function to train and use the model.
> Regards
> Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
> I'm just a neuron in the society collective brain
>
> 01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
> 01100111 01111001 01110000 01110100
> P Please consider your environmental responsibility before printing this
> e-mail
>
>


  



  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message