drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arina Yelchiyeva <arina.yelchiy...@gmail.com>
Subject Re: Dynamic UDFs support
Date Fri, 24 Jun 2016 17:00:33 GMT
Hi all!

I have updated design document.
Main changes:
1. Add to Drill’s config цшер  the staging and registration DFS locations.
2. User is no longer is responsible for copying jars into drillbit nodes.
Now user needs to copy jars into staging DFS location from where drillbits
will copy them to local fs.
2. During UDFs registration jars will be moved to DFS registration area.
3. During start up drillbit will copy all jars from registration area, so
newly added drillbit will have all UDFs as others.
4. Security issues - probably they will be added later as enhancement.

More detains in the document:
https://docs.google.com/document/d/1MluM17EKajvNP_x8U4aymcOihhUm8BMm8t_hM0jEFWk/edit

Kind regards
Arina

On Fri, Jun 17, 2016 at 1:25 AM Paul Rogers <progers@maprtech.com> wrote:

> Hi All,
>
> To answer Arina on item 3: there is actually no good location on any local
> node to put the UDFs. Reason: DoY allows the admin to start a Drillbit on
> any available node. When it starts, a new, fresh copy of Drill will be
> downloaded, and this can happen after the user issued the CREATE command.
>
> What we need is a shared, secure distributed storage location from which
> Drillbits can download the needed jar files. Something like… DFS! Indeed,
> this is how YARN stores the Drill archive from which it creates the Drill
> install directory on each node. We can’t quite use YARN’s mechanism (YARN
> is aware only of the files uploaded when launching an app), but we can do
> something similar.
>
> So, brainstorming a bit…
>
> 1. Store the UDF jar in a pre-defined DFS location.
>
> 2. The CREATE function 1) uploads the jar to the DFS location, and 2)
> creates some kind of registry entry.
>
> 3. The DELETE function 1) deregisters the jar (and function), but 2) does
> not delete the jar (this allows in-flight queries to complete.)
>
> 3. Drillbits periodically check DFS for changed registrations, downloading
> any needed jars. (YARN, Spark, Storm and others already do something
> similar.)
>
> 4. Registry check is “forced” when processing a query with a function that
> is not currently registered. (Doing so resolves any possible race
> conditions.)
>
> 5. Some process (perhaps time based) removes old, unregistered jar files.
> (Or, we could get fancy and use reference counts. The reference count would
> be required if the user wants to delete, then recreate, the same function
> and jar to avoid conflict with in-flight queries.)
>
> We can build security on this as follows:
>
> 1. Define permissions for who can write to the DFS location. Or, indeed,
> have subdirectories by user and grant each user permission only on their
> own UDF directory.
>
> 2. Provide separate registries for per-user functions (private) and global
> functions (public). Only the admin can add global functions. But, only the
> user that uploads a private function can use it.
>
> 3. Leverage the Java class loader to isolate UDFs in their own name space
> (see Eclipse & Tomcat for examples). That is, Drill can call into a UDF,
> UDFs can call selected Drill code, but UDFs can’t shadow Drill classes
> (accidentally or maliciously.) Plus, my function Foo won’t clash with your
> function Foo if both are private.
>
> Sorry that this has wandered a bit far from the original simple design,
> but the above may capture much of what folks expect in modern distributed
> big data systems.
>
> I wonder if a good next step might be to review the notes in the design
> doc, in the JIRA, and in this e-mail chain and to prepare a summary of
> technical requirements, and a proposed design. Postpone, at least for now,
> concerns about the amount of work; we can worry about that once folks agree
> on your revised design.
>
> Thanks,
>
> - Paul
>
>
> > On Jun 21, 2016, at 9:48 AM, Arina Yelchiyeva <
> arina.yelchiyeva@gmail.com> wrote:
> >
> > 4. Authorization model mentioned by Julia and John
> > If user won't have rights to copy jars to UDF classpath, which can be
> > restricted by file system, he won't be able to do much harm by running
> > CREATE command. If UDFs from jar were already registered, CREATE
> statement
> > will fail. CREATE OR REPLACE will just re-register UDFs.
> > But DELETE command is not safe. If user knows jar name, he can delete all
> > associated with it UDFs, as well as the binary and source jars. That's
> > where we'll probably need to impose restrictions.
> >
> > On Tue, Jun 21, 2016 at 7:34 PM Arina Yelchiyeva <
> arina.yelchiyeva@gmail.com>
> > wrote:
> >
> >> 1. DELETE command - I missed to indicate it document but had it in my
> >> mind. When user issues DELETE command, all UDF associated with indicated
> >> jar is removed from DrillFunctionRegistry. And then binary and source
> >> files are also deleted from UDF classpath.
> >>
> >> 2. Distribution race condition described by Paul
> >> User issues CREATE command and gets confirmation that UDFs is registered
> >> only if all drilllbits have confirmed that registration was successful.
> >> I don't expect user to start using UDFs in queries prior to CREATE
> command
> >> success / failure result, which is possible but strange.
> >>
> >> 3. DoY
> >> @Paul
> >> If instead of using $DRILL_HOME/jars/3rdparty/udf directly we use
> >> $DRILL_UDF environment variable which will be set during drillbit start
> >> (like $DRILL_LOG_DIR). Location stored in this variable will be added to
> >> Drill classpath during start.
> >> Will it ease DoY integration somehow?
> >>
> >> Kind regards
> >> Arina
> >>
> >> On Tue, Jun 21, 2016 at 7:15 PM yuliya Feldman
> <yufeldman@yahoo.com.invalid>
> >> wrote:
> >>
> >>> Just thoughts:
> >>> You can try to reuse distributed cache Let Drill AM do the needful in
> >>> terms of orchestrating UDF jars distribution.
> >>> But
> >>> I would be inclined to have a common path that is independent of the
> fact
> >>> that it is Drill on YARN or not, as maintaining two separate ways of
> >>> dealing with loading/unloading UDFs will be painful and error prone.
> >>> One more note (I left a comment in the doc) - not sure about
> >>> authorization model here - we need to have some.
> >>> Just my 2cThanks
> >>>
> >>>      From: Paul Rogers <progers@maprtech.com>
> >>> To: "dev@drill.apache.org" <dev@drill.apache.org>
> >>> Sent: Monday, June 20, 2016 7:32 PM
> >>> Subject: Re: Dynamic UDFs support
> >>>
> >>> Hi Neeraja,
> >>>
> >>> The proposal calls for the user to copy the jar file to each Drillbit
> >>> node. The jar would go into a new $DRILL_HOME/jars/3rdparty/udf
> directory.
> >>>
> >>> In Drill-on-YARN (DoY), YARN is responsible for copying Drill code to
> >>> each node (which is good.) YARN puts that code in a location known
> only to
> >>> YARN. Since the location is private to YARN, the user can’t easily hunt
> >>> down the location in order to add the udf jar. Even if the user did
> find
> >>> the location, the next Drillbit to start would create a new copy of the
> >>> Drill software, without the udf jar.
> >>>
> >>> Second, in DoY we have separated user files from Drill software. This
> >>> makes it much easier to distribute the software to each node: we give
> the
> >>> Drill distribution tar archive to YARN, and YARN copies it to each
> node and
> >>> untars the Drill files. We make a separate copy of the (far smaller)
> set of
> >>> user config files.
> >>>
> >>> If the udf jar goes into a Drill folder
> ($DRILL_HOME/jars/3rdparty/udf),
> >>> then the user would have to rebuild the Drill tar file each time they
> add a
> >>> udf jar. When I tried this myself when building DoY, I found it to be
> slow
> >>> and error-prone.
> >>>
> >>> So, the solution is to place the udf code in the new “site” directory:
> >>> $DRILL_SITE/jars. That’s what that is for. Then, let DoY automatically
> >>> distribute the code to every node. Perfect! Except that it does not
> work to
> >>> dynamically distribute code after Drill starts.
> >>>
> >>> For DoY, the solution requirements are:
> >>>
> >>> 1. Distribute code using Drill itself, rather than manually copying
> jars
> >>> to (unknown) Drill directories.
> >>> 2. Ensure the solution works even if another Drillbit is spun up later,
> >>> and uses the original Drill tar file.
> >>>
> >>> I’m thinking we want to leverage DFS: place udf files into a well-known
> >>> DFS directory. Register the udf into, say, ZK. When a new Drillbit
> starts,
> >>> it looks for new udf jars in ZK, copies the file to a temporary
> location,
> >>> and launches. An existing Drill is notified of the change and does the
> same
> >>> download process. Clean-up is needed at some point to remove ZK
> entries if
> >>> the udf jar becomes statically available on the next launch. That needs
> >>> more thought.
> >>>
> >>> We’d still need the phases mentioned earlier to ensure consistency.
> >>>
> >>> Suggestions anyone as to how to do this super simply & still get it
to
> >>> work with DoY?
> >>>
> >>> Thanks,
> >>>
> >>> - Paul
> >>>
> >>>> On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala <
> >>> nrentachintala@maprtech.com> wrote:
> >>>>
> >>>> This will need to work with YARN (Once Drill is YARN enabled, I would
> >>>> expect a lot of users using it in conjunction with YARN).
> >>>> Paul, I am not clear why this wouldn't work with YARN. Can you
> >>> elaborate.
> >>>>
> >>>> -Neeraja
> >>>>
> >>>> On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers <progers@maprtech.com>
> >>> wrote:
> >>>>
> >>>>> Good enough, as long as we document the limitation that this feature
> >>> can’t
> >>>>> work with YARN deployment as users generally do not have access
to
> the
> >>>>> temporary “localization” directories where the Drill code is
placed
> by
> >>> YARN.
> >>>>>
> >>>>> Note that the jar distribution race condition issue occurs with
the
> >>>>> proposed design: I believe I sketched out a scenario in one of the
> >>> earlier
> >>>>> comments. Drillbit A receives the CREATE FUNCTION command. It tells
> >>>>> Drillbit B. While informing the other Drillbits, Drillbit B plans
and
> >>>>> launches a query that uses the function. Drillbit Z starts execution
> >>> of the
> >>>>> query before it learns from A about the new function. This will
be
> >>> rare —
> >>>>> just rare enough to create very hard to reproduce bugs.
> >>>>>
> >>>>> The only reliable solution is to do the work in multiple passes:
> >>>>>
> >>>>> Pass 1: Ask each node to load the function, but not make it available
> >>> to
> >>>>> the planner. (it would be available to the execution engine.)
> >>>>> Pass 2: Await confirmation from each node that this is done.
> >>>>> Pass 3: Alert every node that it is now free to plan queries with
the
> >>>>> function.
> >>>>>
> >>>>> Finally, I wonder if we should design the SQL syntax based on a
> >>> long-term
> >>>>> design, even if the feature itself is a short-term work-around.
> >>> Changing
> >>>>> the syntax later might break scripts that users might write.
> >>>>>
> >>>>> So, the question for the group is this: is the value of semi-complete
> >>>>> feature sufficient to justify the potential problems?
> >>>>>
> >>>>> - Paul
> >>>>>
> >>>>>> On Jun 20, 2016, at 6:15 PM, Parth Chandra <pchandra@maprtech.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Moving discussion to dev.
> >>>>>>
> >>>>>> I believe the aim is to do a simple implementation without the
> >>> complexity
> >>>>>> of distributing the UDF. I think the document should make this
> >>> limitation
> >>>>>> clear.
> >>>>>>
> >>>>>> Per Paul's point on there being a simpler solution of just having
> each
> >>>>>> drillbit detect the if a UDF is present, I think the problem
is if a
> >>> UDF
> >>>>>> get's deployed to some but not all drillbits. A query can then
start
> >>>>>> executing but not run successfully. The intent of the create
> commands
> >>>>> would
> >>>>>> be to ensure that all drillbits have the UDF or none would.
> >>>>>>
> >>>>>> I think Jacques' point about ownership conflicts is not addressed
> >>>>> clearly.
> >>>>>> Also, the unloading is not clear. The delete command should
probably
> >>>>> remove
> >>>>>> the UDF and unload it.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers <progers@maprtech.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>>> Reviewed the spec; many comments posted. Three primary comments
for
> >>> the
> >>>>>>> community to consider.
> >>>>>>>
> >>>>>>> 1. The design conflicts with the Drill-on-YARN project.
Is this a
> >>>>> specific
> >>>>>>> fix for one unique problem, or is it worth expanding the
solution
> to
> >>>>> work
> >>>>>>> with Drill-on-YARN deployments? Might be hard to make the
two work
> >>>>> together
> >>>>>>> later. See comments in docs for details.
> >>>>>>>
> >>>>>>> 2. Have we, by chance, looked at how other projects handle
code
> >>>>>>> distribution? Spark, Storm and others automatically deploy
code
> >>> across
> >>>>> the
> >>>>>>> cluster; no manual distribution to each node. The key difference
> >>> between
> >>>>>>> Drill and others is that, for Storm, say, code is associated
with a
> >>> job
> >>>>>>> (“topology” in Storm terms.) But, in Drill, functions
are global
> and
> >>>>> have
> >>>>>>> no obvious life cycle that suggests when the code can be
unloaded.
> >>>>>>>
> >>>>>>> 3. Have considered the class loader, dependency and name
space
> >>> isolation
> >>>>>>> issues addressed by such products as Tomcat (web apps) or
Eclipse
> >>>>>>> (plugins)? Putting user code in the same namespace as Drill
code
> is
> >>>>> quick
> >>>>>>> & dirty. It turns out, however, that doing so leads
to problems
> that
> >>>>>>> require long, frustrating debugging sessions to resolve.
> >>>>>>>
> >>>>>>> Addressing item 1 might expand scope a bit. Addressing items
2 and
> 3
> >>>>> are a
> >>>>>>> big increase in scope, so I won’t be surprised if we leave
those
> >>> issues
> >>>>> for
> >>>>>>> later. (Though, addressing item 2 might be the best way
to address
> >>> item
> >>>>> 1.)
> >>>>>>>
> >>>>>>> If we want a very simple solution that requires minimal
change,
> >>> perhaps
> >>>>> we
> >>>>>>> can use an even simpler solution. In the proposed design,
the user
> >>> still
> >>>>>>> must distribute code to all the nodes. The primary change
is to
> tell
> >>>>> Drill
> >>>>>>> to load (or unload) that code. Can accomplish the same result
> easier
> >>>>> simply
> >>>>>>> by having Drill periodically scan certain directories looking
for
> new
> >>>>> (or
> >>>>>>> removed) jars? Still won’t work with YARN, or solve the
name space
> >>>>> issues,
> >>>>>>> but will work for existing non-YARN Drill users without
new SQL
> >>> syntax.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> - Paul
> >>>>>>>
> >>>>>>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau <jacques@dremio.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> Two quick thoughts:
> >>>>>>>>
> >>>>>>>> - (user) In the design document I didn't see any discussion
of
> >>>>>>>> ownership/conflicts or unloading. Would be helpful to
see the
> >>> thinking
> >>>>>>> there
> >>>>>>>> - (dev) There is a row oriented facade via the
> >>>>>>>> FieldReader/FieldWriter/ComplexWriter classes. That
would be a
> good
> >>>>> place
> >>>>>>>> to start when trying to implement an alternative interface.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Jacques Nadeau
> >>>>>>>> CTO and Co-Founder, Dremio
> >>>>>>>>
> >>>>>>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik <john@omernik.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Honestly, I don't see it as a priority issue. I
think some of the
> >>>>> ideas
> >>>>>>>>> around community java UDFs could be a better approach.
I'd hate
> to
> >>>>> take
> >>>>>>>>> away from other work to hack in something like this.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers <
> progers@maprtech.com
> >>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Ted refers to source code transformation. Drill
gains its speed
> >>> from
> >>>>>>>>> value
> >>>>>>>>>> vectors. However, VVs are a far cry from the
row-based interface
> >>> that
> >>>>>>>>> most
> >>>>>>>>>> mere mortals are accustomed to using. Since
VVs are very type
> >>>>> specific,
> >>>>>>>>>> code is typically generated to handle the specifics
of each
> type.
> >>>>>>>>> Accessing
> >>>>>>>>>> VVs in Jython may be a bit of a challenge because
of the
> >>> "impedence
> >>>>>>>>>> mismatch" between how VVs work and the row-and-column
view
> >>> expected
> >>>>> by
> >>>>>>>>> most
> >>>>>>>>>> (non-Drill) developers.
> >>>>>>>>>>
> >>>>>>>>>> I wonder if we've considered providing a row-oriented
"facade"
> >>> that
> >>>>> can
> >>>>>>>>> be
> >>>>>>>>>> used by roll-your own data sources and user-defined
row
> >>> transforms?
> >>>>>>> Might
> >>>>>>>>>> be a hiccup in the fast VV pipeline, but might
be handy for
> users
> >>>>>>> willing
> >>>>>>>>>> to trade a bit of speed for convenience. With
such a facade, the
> >>>>> Jython
> >>>>>>>>> row
> >>>>>>>>>> transforms that John mentions could be quite
simple.
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning
<
> >>> ted.dunning@gmail.com
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Since UDF's use source code transformation,
using Jython would
> be
> >>>>>>>>>>> difficult.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva
<
> >>>>>>>>>>> arina.yelchiyeva@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Charles,
> >>>>>>>>>>>>
> >>>>>>>>>>>> not that I am aware of. Proposed solution
doesn't invent
> >>> anything
> >>>>>>>>> new,
> >>>>>>>>>>> just
> >>>>>>>>>>>> adds possibility to add UDFs without
drillbit restart. But
> >>>>>>>>>> contributions
> >>>>>>>>>>>> are welcomed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles
Givre <
> cgivre@gmail.com
> >>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Arina,
> >>>>>>>>>>>>> Has there been any discussion about
making it possible via
> >>> Jython
> >>>>>>>>> or
> >>>>>>>>>>>>> something for users to write simple
UDFs in Python?
> >>>>>>>>>>>>> My ideal would be to have this capability
integrated in the
> web
> >>>>> GUI
> >>>>>>>>>>> such
> >>>>>>>>>>>>> that a user could write their UDF
(in Python) right there,
> >>> submit
> >>>>>>>>> it
> >>>>>>>>>>> and
> >>>>>>>>>>>> it
> >>>>>>>>>>>>> would be deployed to Drill if it
passes validation tests.
> >>>>>>>>>>>>> —C
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jun 16, 2016, at 09:34, Arina
Yelchiyeva <
> >>>>>>>>>>>> arina.yelchiyeva@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have created Jira to allow
dynamic UDFs support in Drill (
> >>>>>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726).
There
> is a
> >>>>>>>>> link
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> design document in Jira description.
> >>>>>>>>>>>>>> Comments or suggestions are
welcomed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Kind regards
> >>>>>>>>>>>>>> Arina
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message