drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yuliya Feldman <yufeld...@yahoo.com.INVALID>
Subject Re: Dynamic UDFs support
Date Tue, 21 Jun 2016 16:15:00 GMT
Just thoughts:
You can try to reuse distributed cache Let Drill AM do the needful in terms of orchestrating
UDF jars distribution.
But
I would be inclined to have a common path that is independent of the fact that it is Drill
on YARN or not, as maintaining two separate ways of dealing with loading/unloading UDFs will
be painful and error prone.
One more note (I left a comment in the doc) - not sure about authorization model here - we
need to have some.
Just my 2cThanks

      From: Paul Rogers <progers@maprtech.com>
 To: "dev@drill.apache.org" <dev@drill.apache.org> 
 Sent: Monday, June 20, 2016 7:32 PM
 Subject: Re: Dynamic UDFs support
   
Hi Neeraja,

The proposal calls for the user to copy the jar file to each Drillbit node. The jar would
go into a new $DRILL_HOME/jars/3rdparty/udf directory.

In Drill-on-YARN (DoY), YARN is responsible for copying Drill code to each node (which is
good.) YARN puts that code in a location known only to YARN. Since the location is private
to YARN, the user can’t easily hunt down the location in order to add the udf jar. Even
if the user did find the location, the next Drillbit to start would create a new copy of the
Drill software, without the udf jar.

Second, in DoY we have separated user files from Drill software. This makes it much easier
to distribute the software to each node: we give the Drill distribution tar archive to YARN,
and YARN copies it to each node and untars the Drill files. We make a separate copy of the
(far smaller) set of user config files.

If the udf jar goes into a Drill folder ($DRILL_HOME/jars/3rdparty/udf), then the user would
have to rebuild the Drill tar file each time they add a udf jar. When I tried this myself
when building DoY, I found it to be slow and error-prone.

So, the solution is to place the udf code in the new “site” directory: $DRILL_SITE/jars.
That’s what that is for. Then, let DoY automatically distribute the code to every node.
Perfect! Except that it does not work to dynamically distribute code after Drill starts.

For DoY, the solution requirements are:

1. Distribute code using Drill itself, rather than manually copying jars to (unknown) Drill
directories.
2. Ensure the solution works even if another Drillbit is spun up later, and uses the original
Drill tar file.

I’m thinking we want to leverage DFS: place udf files into a well-known DFS directory. Register
the udf into, say, ZK. When a new Drillbit starts, it looks for new udf jars in ZK, copies
the file to a temporary location, and launches. An existing Drill is notified of the change
and does the same download process. Clean-up is needed at some point to remove ZK entries
if the udf jar becomes statically available on the next launch. That needs more thought.

We’d still need the phases mentioned earlier to ensure consistency.

Suggestions anyone as to how to do this super simply & still get it to work with DoY?

Thanks,

- Paul
 
> On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala <nrentachintala@maprtech.com>
wrote:
> 
> This will need to work with YARN (Once Drill is YARN enabled, I would
> expect a lot of users using it in conjunction with YARN).
> Paul, I am not clear why this wouldn't work with YARN. Can you elaborate.
> 
> -Neeraja
> 
> On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers <progers@maprtech.com> wrote:
> 
>> Good enough, as long as we document the limitation that this feature can’t
>> work with YARN deployment as users generally do not have access to the
>> temporary “localization” directories where the Drill code is placed by YARN.
>> 
>> Note that the jar distribution race condition issue occurs with the
>> proposed design: I believe I sketched out a scenario in one of the earlier
>> comments. Drillbit A receives the CREATE FUNCTION command. It tells
>> Drillbit B. While informing the other Drillbits, Drillbit B plans and
>> launches a query that uses the function. Drillbit Z starts execution of the
>> query before it learns from A about the new function. This will be rare —
>> just rare enough to create very hard to reproduce bugs.
>> 
>> The only reliable solution is to do the work in multiple passes:
>> 
>> Pass 1: Ask each node to load the function, but not make it available to
>> the planner. (it would be available to the execution engine.)
>> Pass 2: Await confirmation from each node that this is done.
>> Pass 3: Alert every node that it is now free to plan queries with the
>> function.
>> 
>> Finally, I wonder if we should design the SQL syntax based on a long-term
>> design, even if the feature itself is a short-term work-around. Changing
>> the syntax later might break scripts that users might write.
>> 
>> So, the question for the group is this: is the value of semi-complete
>> feature sufficient to justify the potential problems?
>> 
>> - Paul
>> 
>>> On Jun 20, 2016, at 6:15 PM, Parth Chandra <pchandra@maprtech.com>
>> wrote:
>>> 
>>> Moving discussion to dev.
>>> 
>>> I believe the aim is to do a simple implementation without the complexity
>>> of distributing the UDF. I think the document should make this limitation
>>> clear.
>>> 
>>> Per Paul's point on there being a simpler solution of just having each
>>> drillbit detect the if a UDF is present, I think the problem is if a UDF
>>> get's deployed to some but not all drillbits. A query can then start
>>> executing but not run successfully. The intent of the create commands
>> would
>>> be to ensure that all drillbits have the UDF or none would.
>>> 
>>> I think Jacques' point about ownership conflicts is not addressed
>> clearly.
>>> Also, the unloading is not clear. The delete command should probably
>> remove
>>> the UDF and unload it.
>>> 
>>> 
>>> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers <progers@maprtech.com>
>> wrote:
>>> 
>>>> Reviewed the spec; many comments posted. Three primary comments for the
>>>> community to consider.
>>>> 
>>>> 1. The design conflicts with the Drill-on-YARN project. Is this a
>> specific
>>>> fix for one unique problem, or is it worth expanding the solution to
>> work
>>>> with Drill-on-YARN deployments? Might be hard to make the two work
>> together
>>>> later. See comments in docs for details.
>>>> 
>>>> 2. Have we, by chance, looked at how other projects handle code
>>>> distribution? Spark, Storm and others automatically deploy code across
>> the
>>>> cluster; no manual distribution to each node. The key difference between
>>>> Drill and others is that, for Storm, say, code is associated with a job
>>>> (“topology” in Storm terms.) But, in Drill, functions are global and
>> have
>>>> no obvious life cycle that suggests when the code can be unloaded.
>>>> 
>>>> 3. Have considered the class loader, dependency and name space isolation
>>>> issues addressed by such products as Tomcat (web apps) or Eclipse
>>>> (plugins)? Putting user code in the same namespace as Drill code  is
>> quick
>>>> & dirty. It turns out, however, that doing so leads to problems that
>>>> require long, frustrating debugging sessions to resolve.
>>>> 
>>>> Addressing item 1 might expand scope a bit. Addressing items 2 and 3
>> are a
>>>> big increase in scope, so I won’t be surprised if we leave those issues
>> for
>>>> later. (Though, addressing item 2 might be the best way to address item
>> 1.)
>>>> 
>>>> If we want a very simple solution that requires minimal change, perhaps
>> we
>>>> can use an even simpler solution. In the proposed design, the user still
>>>> must distribute code to all the nodes. The primary change is to tell
>> Drill
>>>> to load (or unload) that code. Can accomplish the same result easier
>> simply
>>>> by having Drill periodically scan certain directories looking for new
>> (or
>>>> removed) jars? Still won’t work with YARN, or solve the name space
>> issues,
>>>> but will work for existing non-YARN Drill users without new SQL syntax.
>>>> 
>>>> Thanks,
>>>> 
>>>> - Paul
>>>> 
>>>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau <jacques@dremio.com>
>> wrote:
>>>>> 
>>>>> Two quick thoughts:
>>>>> 
>>>>> - (user) In the design document I didn't see any discussion of
>>>>> ownership/conflicts or unloading. Would be helpful to see the thinking
>>>> there
>>>>> - (dev) There is a row oriented facade via the
>>>>> FieldReader/FieldWriter/ComplexWriter classes. That would be a good
>> place
>>>>> to start when trying to implement an alternative interface.
>>>>> 
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik <john@omernik.com>
>> wrote:
>>>>> 
>>>>>> Honestly, I don't see it as a priority issue. I think some of the
>> ideas
>>>>>> around community java UDFs could be a better approach. I'd hate to
>> take
>>>>>> away from other work to hack in something like this.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers <progers@maprtech.com>
>>>> wrote:
>>>>>> 
>>>>>>> Ted refers to source code transformation. Drill gains its speed
from
>>>>>> value
>>>>>>> vectors. However, VVs are a far cry from the row-based interface
that
>>>>>> most
>>>>>>> mere mortals are accustomed to using. Since VVs are very type
>> specific,
>>>>>>> code is typically generated to handle the specifics of each type.
>>>>>> Accessing
>>>>>>> VVs in Jython may be a bit of a challenge because of the "impedence
>>>>>>> mismatch" between how VVs work and the row-and-column view expected
>> by
>>>>>> most
>>>>>>> (non-Drill) developers.
>>>>>>> 
>>>>>>> I wonder if we've considered providing a row-oriented "facade"
that
>> can
>>>>>> be
>>>>>>> used by roll-your own data sources and user-defined row transforms?
>>>> Might
>>>>>>> be a hiccup in the fast VV pipeline, but might be handy for users
>>>> willing
>>>>>>> to trade a bit of speed for convenience. With such a facade,
the
>> Jython
>>>>>> row
>>>>>>> transforms that John mentions could be quite simple.
>>>>>>> 
>>>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning <ted.dunning@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Since UDF's use source code transformation, using Jython
would be
>>>>>>>> difficult.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva <
>>>>>>>> arina.yelchiyeva@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Charles,
>>>>>>>>> 
>>>>>>>>> not that I am aware of. Proposed solution doesn't invent
anything
>>>>>> new,
>>>>>>>> just
>>>>>>>>> adds possibility to add UDFs without drillbit restart.
But
>>>>>>> contributions
>>>>>>>>> are welcomed.
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles Givre <cgivre@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Arina,
>>>>>>>>>> Has there been any discussion about making it possible
via Jython
>>>>>> or
>>>>>>>>>> something for users to write simple UDFs in Python?
>>>>>>>>>> My ideal would be to have this capability integrated
in the web
>> GUI
>>>>>>>> such
>>>>>>>>>> that a user could write their UDF (in Python) right
there, submit
>>>>>> it
>>>>>>>> and
>>>>>>>>> it
>>>>>>>>>> would be deployed to Drill if it passes validation
tests.
>>>>>>>>>> —C
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jun 16, 2016, at 09:34, Arina Yelchiyeva <
>>>>>>>>> arina.yelchiyeva@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi all!
>>>>>>>>>>> 
>>>>>>>>>>> I have created Jira to allow dynamic UDFs support
in Drill (
>>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726).
There is a
>>>>>> link
>>>>>>>> to
>>>>>>>>>>> design document in Jira description.
>>>>>>>>>>> Comments or suggestions are welcomed.
>>>>>>>>>>> 
>>>>>>>>>>> Kind regards
>>>>>>>>>>> Arina
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message