drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@maprtech.com>
Subject Re: Dynamic UDFs support
Date Tue, 21 Jun 2016 02:01:53 GMT
Good enough, as long as we document the limitation that this feature can’t work with YARN
deployment as users generally do not have access to the temporary “localization” directories
where the Drill code is placed by YARN.

Note that the jar distribution race condition issue occurs with the proposed design: I believe
I sketched out a scenario in one of the earlier comments. Drillbit A receives the CREATE FUNCTION
command. It tells Drillbit B. While informing the other Drillbits, Drillbit B plans and launches
a query that uses the function. Drillbit Z starts execution of the query before it learns
from A about the new function. This will be rare — just rare enough to create very hard
to reproduce bugs.

The only reliable solution is to do the work in multiple passes:

Pass 1: Ask each node to load the function, but not make it available to the planner. (it
would be available to the execution engine.)
Pass 2: Await confirmation from each node that this is done.
Pass 3: Alert every node that it is now free to plan queries with the function.

Finally, I wonder if we should design the SQL syntax based on a long-term design, even if
the feature itself is a short-term work-around. Changing the syntax later might break scripts
that users might write.

So, the question for the group is this: is the value of semi-complete feature sufficient to
justify the potential problems?

- Paul

> On Jun 20, 2016, at 6:15 PM, Parth Chandra <pchandra@maprtech.com> wrote:
> 
> Moving discussion to dev.
> 
> I believe the aim is to do a simple implementation without the complexity
> of distributing the UDF. I think the document should make this limitation
> clear.
> 
> Per Paul's point on there being a simpler solution of just having each
> drillbit detect the if a UDF is present, I think the problem is if a UDF
> get's deployed to some but not all drillbits. A query can then start
> executing but not run successfully. The intent of the create commands would
> be to ensure that all drillbits have the UDF or none would.
> 
> I think Jacques' point about ownership conflicts is not addressed clearly.
> Also, the unloading is not clear. The delete command should probably remove
> the UDF and unload it.
> 
> 
> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers <progers@maprtech.com> wrote:
> 
>> Reviewed the spec; many comments posted. Three primary comments for the
>> community to consider.
>> 
>> 1. The design conflicts with the Drill-on-YARN project. Is this a specific
>> fix for one unique problem, or is it worth expanding the solution to work
>> with Drill-on-YARN deployments? Might be hard to make the two work together
>> later. See comments in docs for details.
>> 
>> 2. Have we, by chance, looked at how other projects handle code
>> distribution? Spark, Storm and others automatically deploy code across the
>> cluster; no manual distribution to each node. The key difference between
>> Drill and others is that, for Storm, say, code is associated with a job
>> (“topology” in Storm terms.) But, in Drill, functions are global and have
>> no obvious life cycle that suggests when the code can be unloaded.
>> 
>> 3. Have considered the class loader, dependency and name space isolation
>> issues addressed by such products as Tomcat (web apps) or Eclipse
>> (plugins)? Putting user code in the same namespace as Drill code  is quick
>> & dirty. It turns out, however, that doing so leads to problems that
>> require long, frustrating debugging sessions to resolve.
>> 
>> Addressing item 1 might expand scope a bit. Addressing items 2 and 3 are a
>> big increase in scope, so I won’t be surprised if we leave those issues for
>> later. (Though, addressing item 2 might be the best way to address item 1.)
>> 
>> If we want a very simple solution that requires minimal change, perhaps we
>> can use an even simpler solution. In the proposed design, the user still
>> must distribute code to all the nodes. The primary change is to tell Drill
>> to load (or unload) that code. Can accomplish the same result easier simply
>> by having Drill periodically scan certain directories looking for new (or
>> removed) jars? Still won’t work with YARN, or solve the name space issues,
>> but will work for existing non-YARN Drill users without new SQL syntax.
>> 
>> Thanks,
>> 
>> - Paul
>> 
>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau <jacques@dremio.com> wrote:
>>> 
>>> Two quick thoughts:
>>> 
>>> - (user) In the design document I didn't see any discussion of
>>> ownership/conflicts or unloading. Would be helpful to see the thinking
>> there
>>> - (dev) There is a row oriented facade via the
>>> FieldReader/FieldWriter/ComplexWriter classes. That would be a good place
>>> to start when trying to implement an alternative interface.
>>> 
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik <john@omernik.com> wrote:
>>> 
>>>> Honestly, I don't see it as a priority issue. I think some of the ideas
>>>> around community java UDFs could be a better approach. I'd hate to take
>>>> away from other work to hack in something like this.
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers <progers@maprtech.com>
>> wrote:
>>>> 
>>>>> Ted refers to source code transformation. Drill gains its speed from
>>>> value
>>>>> vectors. However, VVs are a far cry from the row-based interface that
>>>> most
>>>>> mere mortals are accustomed to using. Since VVs are very type specific,
>>>>> code is typically generated to handle the specifics of each type.
>>>> Accessing
>>>>> VVs in Jython may be a bit of a challenge because of the "impedence
>>>>> mismatch" between how VVs work and the row-and-column view expected by
>>>> most
>>>>> (non-Drill) developers.
>>>>> 
>>>>> I wonder if we've considered providing a row-oriented "facade" that can
>>>> be
>>>>> used by roll-your own data sources and user-defined row transforms?
>> Might
>>>>> be a hiccup in the fast VV pipeline, but might be handy for users
>> willing
>>>>> to trade a bit of speed for convenience. With such a facade, the Jython
>>>> row
>>>>> transforms that John mentions could be quite simple.
>>>>> 
>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning <ted.dunning@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Since UDF's use source code transformation, using Jython would be
>>>>>> difficult.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva <
>>>>>> arina.yelchiyeva@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Charles,
>>>>>>> 
>>>>>>> not that I am aware of. Proposed solution doesn't invent anything
>>>> new,
>>>>>> just
>>>>>>> adds possibility to add UDFs without drillbit restart. But
>>>>> contributions
>>>>>>> are welcomed.
>>>>>>> 
>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles Givre <cgivre@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Arina,
>>>>>>>> Has there been any discussion about making it possible via
Jython
>>>> or
>>>>>>>> something for users to write simple UDFs in Python?
>>>>>>>> My ideal would be to have this capability integrated in the
web GUI
>>>>>> such
>>>>>>>> that a user could write their UDF (in Python) right there,
submit
>>>> it
>>>>>> and
>>>>>>> it
>>>>>>>> would be deployed to Drill if it passes validation tests.
>>>>>>>> —C
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jun 16, 2016, at 09:34, Arina Yelchiyeva <
>>>>>>> arina.yelchiyeva@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi all!
>>>>>>>>> 
>>>>>>>>> I have created Jira to allow dynamic UDFs support in
Drill (
>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726). There
is a
>>>> link
>>>>>> to
>>>>>>>>> design document in Jira description.
>>>>>>>>> Comments or suggestions are welcomed.
>>>>>>>>> 
>>>>>>>>> Kind regards
>>>>>>>>> Arina
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 


Mime
View raw message