drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parth Chandra <pchan...@maprtech.com>
Subject Re: Dynamic UDFs support
Date Wed, 20 Jul 2016 21:37:24 GMT
My notes from the hangout with Arina and Paul -

Notes -

There are two invariants for the registration process -
1) There is a registration/validated directory in the DFS that contains
UDFS that have been validated by the registering foreman. All drillbits
will have access to this directory and on startup and/or UDF registration,
the jars in this directory are sync'd up with a local UDF directory
2) During the process of registration, the registering foreman creates a
Zookeeper node that indicates that one or more drillbits has not yet
registered the UDF.

The basic workflow is that UDF jars are copied from the staging directory
to the registration directory and validated. Once they are validated, the
available drillbits are told to register the UDF. Registering the UDF
consists of copying the node to a local UDF directory and updating the
local (in-memory) udf registry. A sentinel node in zookeeper is used to
track when all the drillbits have registered the UDF.

There were two main suggestions : Immediate registration and lazy
registration,

Immediate registration -
  Foreman tells all drillbits to register. Creates a Zookeeper node to
track.
  Every drillbit makes a local copy and updates zookeeper node to show it
is done.
  Foreman checks the zookeeper node and when all available drillbits have
acknowledged, sends a message to all drillbits to complete registration.
   Foreman removes ZK node.
   All Drillbits update their local UDF registry
   Drillbit startup will block if there is a ZK node indicating
registration is in progress.
   This approach needs to be validated to see if any race conditions exist.

Lazy registration
   Once a UDF is copied to the registration folder, the UDF is essentially
registered. On first use, a drillbit may hit a classnotfound exception in
which case it will look for the UDF in the registration directory. If
found, it will copy to the local directory and add the UDF to it's local
registry.
   This approach should be investigated to see if it fits in with the
current UDF execution code.


On Mon, Jul 18, 2016 at 3:36 PM, Parth Chandra <pchandra@maprtech.com>
wrote:

> +1 on simplifying the design and postpone the items Paul has suggested.
>
> Arina, Paul, I think we need to work out some of the design related to
> registering the UDF. Are you guys open for a quick hangout @10 a.m PDT
> tomorrow?
>
>
>
> On Thu, Jul 14, 2016 at 1:46 PM, Paul Rogers <progers@maprtech.com> wrote:
>
>> Hi All,
>>
>> We’ve had quite a lively debate in the “comments” section of Arina’s
>> wonderful design doc. Zelaine made a great suggestion: summarize the user
>> experience as a way of making sense of the wealth of detailed comments.
>>
>> IMHO, the most important user experience goals are:
>>
>> 1. When a user submits a CREATE FUNCTION command, the command returns
>> quickly (within a few seconds at most.)
>> 2. If the above user then issues a query using that function (to the same
>> Foreman), that query is guaranteed to successfully use the new function on
>> all nodes.
>> 3. Other users, connecting to any Foreman will see a very clean behavior
>> when submitting a query with the new function. Before some point in time
>> (can be different for each Foreman), a query with the function fails in
>> planning. After that point, queries are guaranteed to successfully use the
>> new function on all nodes.
>>
>> Basically, this says that CREATE FUNCTION can’t (potentially) take a long
>> time. Use of functions can’t result in random failures during the time that
>> the function is propagated across Drillbits.
>>
>> The goals we can perhaps postpone are:
>>
>> 1. Class name space isolation. (Allows two data scientists to define the
>> same class without collisions.)
>> 2. Function name spaces. (Allows me to define “paul.foo” and you to
>> define “bob.foo” with out collisions. (Needed if many people develop
>> functions independently. Else, we need a global name space.)
>> 3. Dynamic DROP FUNCTION operation. (The issues here are messy, and it
>> requires unloading classes and name space cleanup.) (Just let the cleanup
>> happen offline.)
>> 4. Dependency jars (e.g. third party libraries, etc.) (We require those
>> to be statically added to the class path before Drill starts.)
>>
>> We are not creating per-user name spaces, or allowing people to use
>> production clusters to try/revise functions. We’re just sampling deployment
>> of simple functions.
>>
>> That’s my suggestion, what do others suggest?
>>
>> Thanks,
>>
>> - Paul
>>
>> > On Jul 7, 2016, at 12:32 PM, Arina Yelchiyeva <
>> arina.yelchiyeva@gmail.com> wrote:
>> >
>> > I also agree on using Zookeeper. I have re-worked dynamic UDF support
>> > document taking into account Zookeeper usage.
>> >
>> > Link to the document -
>> >
>> https://docs.google.com/document/d/1MluM17EKajvNP_x8U4aymcOihhUm8BMm8t_hM0jEFWk/edit
>> >
>> > Kind regards
>> > Arina
>> >
>> > On Tue, Jun 28, 2016 at 12:55 AM Paul Rogers <progers@maprtech.com>
>> wrote:
>> >
>> >> Great idea! We already use ZK to track storage plugins. ZK is perhaps
>> >> better suited to register each jar and/or function that using files in
>> DFS.
>> >> Still need to work out the proper sequencing. But you are right, this
>> is
>> >> the kind of thing that ZK is supposed to solve.
>> >>
>> >> - Paul
>> >>
>> >>
>> >>> On Jun 27, 2016, at 2:01 PM, Parth Chandra <parthc@apache.org>
wrote:
>> >>>
>> >>> Reading thru some of Paul's comments on maintaining a consistent state
>> >> for
>> >>> the registration of the UDF, it looks like we need a consensus
>> protocol
>> >> for
>> >>> determining that all the Drillbits have the UDF deployed.
>> >>> I believe Zookeeper can provide a stronger guarantee than a 2 phase
>> >>> approach. Should we look into that?
>> >>>
>> >>> On Fri, Jun 24, 2016 at 10:00 AM, Arina Yelchiyeva <
>> >>> arina.yelchiyeva@gmail.com> wrote:
>> >>>
>> >>>> Hi all!
>> >>>>
>> >>>> I have updated design document.
>> >>>> Main changes:
>> >>>> 1. Add to Drill’s config цшер  the staging and registration
DFS
>> >> locations.
>> >>>> 2. User is no longer is responsible for copying jars into drillbit
>> >> nodes.
>> >>>> Now user needs to copy jars into staging DFS location from where
>> >> drillbits
>> >>>> will copy them to local fs.
>> >>>> 2. During UDFs registration jars will be moved to DFS registration
>> area.
>> >>>> 3. During start up drillbit will copy all jars from registration
>> area,
>> >> so
>> >>>> newly added drillbit will have all UDFs as others.
>> >>>> 4. Security issues - probably they will be added later as
>> enhancement.
>> >>>>
>> >>>> More detains in the document:
>> >>>>
>> >>>>
>> >>
>> https://docs.google.com/document/d/1MluM17EKajvNP_x8U4aymcOihhUm8BMm8t_hM0jEFWk/edit
>> >>>>
>> >>>> Kind regards
>> >>>> Arina
>> >>>>
>> >>>> On Fri, Jun 17, 2016 at 1:25 AM Paul Rogers <progers@maprtech.com>
>> >> wrote:
>> >>>>
>> >>>>> Hi All,
>> >>>>>
>> >>>>> To answer Arina on item 3: there is actually no good location
on any
>> >>>> local
>> >>>>> node to put the UDFs. Reason: DoY allows the admin to start
a
>> Drillbit
>> >> on
>> >>>>> any available node. When it starts, a new, fresh copy of Drill
will
>> be
>> >>>>> downloaded, and this can happen after the user issued the CREATE
>> >> command.
>> >>>>>
>> >>>>> What we need is a shared, secure distributed storage location
from
>> >> which
>> >>>>> Drillbits can download the needed jar files. Something like…
DFS!
>> >> Indeed,
>> >>>>> this is how YARN stores the Drill archive from which it creates
the
>> >> Drill
>> >>>>> install directory on each node. We can’t quite use YARN’s
mechanism
>> >> (YARN
>> >>>>> is aware only of the files uploaded when launching an app),
but we
>> can
>> >> do
>> >>>>> something similar.
>> >>>>>
>> >>>>> So, brainstorming a bit…
>> >>>>>
>> >>>>> 1. Store the UDF jar in a pre-defined DFS location.
>> >>>>>
>> >>>>> 2. The CREATE function 1) uploads the jar to the DFS location,
and
>> 2)
>> >>>>> creates some kind of registry entry.
>> >>>>>
>> >>>>> 3. The DELETE function 1) deregisters the jar (and function),
but 2)
>> >> does
>> >>>>> not delete the jar (this allows in-flight queries to complete.)
>> >>>>>
>> >>>>> 3. Drillbits periodically check DFS for changed registrations,
>> >>>> downloading
>> >>>>> any needed jars. (YARN, Spark, Storm and others already do something
>> >>>>> similar.)
>> >>>>>
>> >>>>> 4. Registry check is “forced” when processing a query with
a
>> function
>> >>>> that
>> >>>>> is not currently registered. (Doing so resolves any possible
race
>> >>>>> conditions.)
>> >>>>>
>> >>>>> 5. Some process (perhaps time based) removes old, unregistered
jar
>> >> files.
>> >>>>> (Or, we could get fancy and use reference counts. The reference
>> count
>> >>>> would
>> >>>>> be required if the user wants to delete, then recreate, the
same
>> >> function
>> >>>>> and jar to avoid conflict with in-flight queries.)
>> >>>>>
>> >>>>> We can build security on this as follows:
>> >>>>>
>> >>>>> 1. Define permissions for who can write to the DFS location.
Or,
>> >> indeed,
>> >>>>> have subdirectories by user and grant each user permission only
on
>> >> their
>> >>>>> own UDF directory.
>> >>>>>
>> >>>>> 2. Provide separate registries for per-user functions (private)
and
>> >>>> global
>> >>>>> functions (public). Only the admin can add global functions.
But,
>> only
>> >>>> the
>> >>>>> user that uploads a private function can use it.
>> >>>>>
>> >>>>> 3. Leverage the Java class loader to isolate UDFs in their own
name
>> >> space
>> >>>>> (see Eclipse & Tomcat for examples). That is, Drill can
call into a
>> >> UDF,
>> >>>>> UDFs can call selected Drill code, but UDFs can’t shadow Drill
>> classes
>> >>>>> (accidentally or maliciously.) Plus, my function Foo won’t
clash
>> with
>> >>>> your
>> >>>>> function Foo if both are private.
>> >>>>>
>> >>>>> Sorry that this has wandered a bit far from the original simple
>> design,
>> >>>>> but the above may capture much of what folks expect in modern
>> >> distributed
>> >>>>> big data systems.
>> >>>>>
>> >>>>> I wonder if a good next step might be to review the notes in
the
>> design
>> >>>>> doc, in the JIRA, and in this e-mail chain and to prepare a
summary
>> of
>> >>>>> technical requirements, and a proposed design. Postpone, at
least
>> for
>> >>>> now,
>> >>>>> concerns about the amount of work; we can worry about that once
>> folks
>> >>>> agree
>> >>>>> on your revised design.
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> - Paul
>> >>>>>
>> >>>>>
>> >>>>>> On Jun 21, 2016, at 9:48 AM, Arina Yelchiyeva <
>> >>>>> arina.yelchiyeva@gmail.com> wrote:
>> >>>>>>
>> >>>>>> 4. Authorization model mentioned by Julia and John
>> >>>>>> If user won't have rights to copy jars to UDF classpath,
which can
>> be
>> >>>>>> restricted by file system, he won't be able to do much harm
by
>> running
>> >>>>>> CREATE command. If UDFs from jar were already registered,
CREATE
>> >>>>> statement
>> >>>>>> will fail. CREATE OR REPLACE will just re-register UDFs.
>> >>>>>> But DELETE command is not safe. If user knows jar name,
he can
>> delete
>> >>>> all
>> >>>>>> associated with it UDFs, as well as the binary and source
jars.
>> That's
>> >>>>>> where we'll probably need to impose restrictions.
>> >>>>>>
>> >>>>>> On Tue, Jun 21, 2016 at 7:34 PM Arina Yelchiyeva <
>> >>>>> arina.yelchiyeva@gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> 1. DELETE command - I missed to indicate it document
but had it
>> in my
>> >>>>>>> mind. When user issues DELETE command, all UDF associated
with
>> >>>> indicated
>> >>>>>>> jar is removed from DrillFunctionRegistry. And then
binary and
>> source
>> >>>>>>> files are also deleted from UDF classpath.
>> >>>>>>>
>> >>>>>>> 2. Distribution race condition described by Paul
>> >>>>>>> User issues CREATE command and gets confirmation that
UDFs is
>> >>>> registered
>> >>>>>>> only if all drilllbits have confirmed that registration
was
>> >>>> successful.
>> >>>>>>> I don't expect user to start using UDFs in queries prior
to CREATE
>> >>>>> command
>> >>>>>>> success / failure result, which is possible but strange.
>> >>>>>>>
>> >>>>>>> 3. DoY
>> >>>>>>> @Paul
>> >>>>>>> If instead of using $DRILL_HOME/jars/3rdparty/udf directly
we use
>> >>>>>>> $DRILL_UDF environment variable which will be set during
drillbit
>> >>>> start
>> >>>>>>> (like $DRILL_LOG_DIR). Location stored in this variable
will be
>> added
>> >>>> to
>> >>>>>>> Drill classpath during start.
>> >>>>>>> Will it ease DoY integration somehow?
>> >>>>>>>
>> >>>>>>> Kind regards
>> >>>>>>> Arina
>> >>>>>>>
>> >>>>>>> On Tue, Jun 21, 2016 at 7:15 PM yuliya Feldman
>> >>>>> <yufeldman@yahoo.com.invalid>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Just thoughts:
>> >>>>>>>> You can try to reuse distributed cache Let Drill
AM do the
>> needful
>> >> in
>> >>>>>>>> terms of orchestrating UDF jars distribution.
>> >>>>>>>> But
>> >>>>>>>> I would be inclined to have a common path that is
independent of
>> the
>> >>>>> fact
>> >>>>>>>> that it is Drill on YARN or not, as maintaining
two separate
>> ways of
>> >>>>>>>> dealing with loading/unloading UDFs will be painful
and error
>> prone.
>> >>>>>>>> One more note (I left a comment in the doc) - not
sure about
>> >>>>>>>> authorization model here - we need to have some.
>> >>>>>>>> Just my 2cThanks
>> >>>>>>>>
>> >>>>>>>>    From: Paul Rogers <progers@maprtech.com>
>> >>>>>>>> To: "dev@drill.apache.org" <dev@drill.apache.org>
>> >>>>>>>> Sent: Monday, June 20, 2016 7:32 PM
>> >>>>>>>> Subject: Re: Dynamic UDFs support
>> >>>>>>>>
>> >>>>>>>> Hi Neeraja,
>> >>>>>>>>
>> >>>>>>>> The proposal calls for the user to copy the jar
file to each
>> >> Drillbit
>> >>>>>>>> node. The jar would go into a new $DRILL_HOME/jars/3rdparty/udf
>> >>>>> directory.
>> >>>>>>>>
>> >>>>>>>> In Drill-on-YARN (DoY), YARN is responsible for
copying Drill
>> code
>> >> to
>> >>>>>>>> each node (which is good.) YARN puts that code in
a location
>> known
>> >>>>> only to
>> >>>>>>>> YARN. Since the location is private to YARN, the
user can’t
>> easily
>> >>>> hunt
>> >>>>>>>> down the location in order to add the udf jar. Even
if the user
>> did
>> >>>>> find
>> >>>>>>>> the location, the next Drillbit to start would create
a new copy
>> of
>> >>>> the
>> >>>>>>>> Drill software, without the udf jar.
>> >>>>>>>>
>> >>>>>>>> Second, in DoY we have separated user files from
Drill software.
>> >> This
>> >>>>>>>> makes it much easier to distribute the software
to each node: we
>> >> give
>> >>>>> the
>> >>>>>>>> Drill distribution tar archive to YARN, and YARN
copies it to
>> each
>> >>>>> node and
>> >>>>>>>> untars the Drill files. We make a separate copy
of the (far
>> smaller)
>> >>>>> set of
>> >>>>>>>> user config files.
>> >>>>>>>>
>> >>>>>>>> If the udf jar goes into a Drill folder
>> >>>>> ($DRILL_HOME/jars/3rdparty/udf),
>> >>>>>>>> then the user would have to rebuild the Drill tar
file each time
>> >> they
>> >>>>> add a
>> >>>>>>>> udf jar. When I tried this myself when building
DoY, I found it
>> to
>> >> be
>> >>>>> slow
>> >>>>>>>> and error-prone.
>> >>>>>>>>
>> >>>>>>>> So, the solution is to place the udf code in the
new “site”
>> >>>> directory:
>> >>>>>>>> $DRILL_SITE/jars. That’s what that is for. Then,
let DoY
>> >>>> automatically
>> >>>>>>>> distribute the code to every node. Perfect! Except
that it does
>> not
>> >>>>> work to
>> >>>>>>>> dynamically distribute code after Drill starts.
>> >>>>>>>>
>> >>>>>>>> For DoY, the solution requirements are:
>> >>>>>>>>
>> >>>>>>>> 1. Distribute code using Drill itself, rather than
manually
>> copying
>> >>>>> jars
>> >>>>>>>> to (unknown) Drill directories.
>> >>>>>>>> 2. Ensure the solution works even if another Drillbit
is spun up
>> >>>> later,
>> >>>>>>>> and uses the original Drill tar file.
>> >>>>>>>>
>> >>>>>>>> I’m thinking we want to leverage DFS: place udf
files into a
>> >>>> well-known
>> >>>>>>>> DFS directory. Register the udf into, say, ZK. When
a new
>> Drillbit
>> >>>>> starts,
>> >>>>>>>> it looks for new udf jars in ZK, copies the file
to a temporary
>> >>>>> location,
>> >>>>>>>> and launches. An existing Drill is notified of the
change and
>> does
>> >>>> the
>> >>>>> same
>> >>>>>>>> download process. Clean-up is needed at some point
to remove ZK
>> >>>>> entries if
>> >>>>>>>> the udf jar becomes statically available on the
next launch. That
>> >>>> needs
>> >>>>>>>> more thought.
>> >>>>>>>>
>> >>>>>>>> We’d still need the phases mentioned earlier to
ensure
>> consistency.
>> >>>>>>>>
>> >>>>>>>> Suggestions anyone as to how to do this super simply
& still get
>> it
>> >>>> to
>> >>>>>>>> work with DoY?
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>>
>> >>>>>>>> - Paul
>> >>>>>>>>
>> >>>>>>>>> On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala
<
>> >>>>>>>> nrentachintala@maprtech.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> This will need to work with YARN (Once Drill
is YARN enabled, I
>> >>>> would
>> >>>>>>>>> expect a lot of users using it in conjunction
with YARN).
>> >>>>>>>>> Paul, I am not clear why this wouldn't work
with YARN. Can you
>> >>>>>>>> elaborate.
>> >>>>>>>>>
>> >>>>>>>>> -Neeraja
>> >>>>>>>>>
>> >>>>>>>>> On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers
<
>> progers@maprtech.com
>> >>>
>> >>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Good enough, as long as we document the
limitation that this
>> >>>> feature
>> >>>>>>>> can’t
>> >>>>>>>>>> work with YARN deployment as users generally
do not have
>> access to
>> >>>>> the
>> >>>>>>>>>> temporary “localization” directories
where the Drill code is
>> >> placed
>> >>>>> by
>> >>>>>>>> YARN.
>> >>>>>>>>>>
>> >>>>>>>>>> Note that the jar distribution race condition
issue occurs with
>> >> the
>> >>>>>>>>>> proposed design: I believe I sketched out
a scenario in one of
>> the
>> >>>>>>>> earlier
>> >>>>>>>>>> comments. Drillbit A receives the CREATE
FUNCTION command. It
>> >> tells
>> >>>>>>>>>> Drillbit B. While informing the other Drillbits,
Drillbit B
>> plans
>> >>>> and
>> >>>>>>>>>> launches a query that uses the function.
Drillbit Z starts
>> >>>> execution
>> >>>>>>>> of the
>> >>>>>>>>>> query before it learns from A about the
new function. This
>> will be
>> >>>>>>>> rare —
>> >>>>>>>>>> just rare enough to create very hard to
reproduce bugs.
>> >>>>>>>>>>
>> >>>>>>>>>> The only reliable solution is to do the
work in multiple
>> passes:
>> >>>>>>>>>>
>> >>>>>>>>>> Pass 1: Ask each node to load the function,
but not make it
>> >>>> available
>> >>>>>>>> to
>> >>>>>>>>>> the planner. (it would be available to the
execution engine.)
>> >>>>>>>>>> Pass 2: Await confirmation from each node
that this is done.
>> >>>>>>>>>> Pass 3: Alert every node that it is now
free to plan queries
>> with
>> >>>> the
>> >>>>>>>>>> function.
>> >>>>>>>>>>
>> >>>>>>>>>> Finally, I wonder if we should design the
SQL syntax based on a
>> >>>>>>>> long-term
>> >>>>>>>>>> design, even if the feature itself is a
short-term work-around.
>> >>>>>>>> Changing
>> >>>>>>>>>> the syntax later might break scripts that
users might write.
>> >>>>>>>>>>
>> >>>>>>>>>> So, the question for the group is this:
is the value of
>> >>>> semi-complete
>> >>>>>>>>>> feature sufficient to justify the potential
problems?
>> >>>>>>>>>>
>> >>>>>>>>>> - Paul
>> >>>>>>>>>>
>> >>>>>>>>>>> On Jun 20, 2016, at 6:15 PM, Parth Chandra
<
>> >> pchandra@maprtech.com
>> >>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Moving discussion to dev.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I believe the aim is to do a simple
implementation without the
>> >>>>>>>> complexity
>> >>>>>>>>>>> of distributing the UDF. I think the
document should make this
>> >>>>>>>> limitation
>> >>>>>>>>>>> clear.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Per Paul's point on there being a simpler
solution of just
>> having
>> >>>>> each
>> >>>>>>>>>>> drillbit detect the if a UDF is present,
I think the problem
>> is
>> >>>> if a
>> >>>>>>>> UDF
>> >>>>>>>>>>> get's deployed to some but not all drillbits.
A query can then
>> >>>> start
>> >>>>>>>>>>> executing but not run successfully.
The intent of the create
>> >>>>> commands
>> >>>>>>>>>> would
>> >>>>>>>>>>> be to ensure that all drillbits have
the UDF or none would.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I think Jacques' point about ownership
conflicts is not
>> addressed
>> >>>>>>>>>> clearly.
>> >>>>>>>>>>> Also, the unloading is not clear. The
delete command should
>> >>>> probably
>> >>>>>>>>>> remove
>> >>>>>>>>>>> the UDF and unload it.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Fri, Jun 17, 2016 at 11:19 AM, Paul
Rogers <
>> >>>> progers@maprtech.com
>> >>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Reviewed the spec; many comments
posted. Three primary
>> comments
>> >>>> for
>> >>>>>>>> the
>> >>>>>>>>>>>> community to consider.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 1. The design conflicts with the
Drill-on-YARN project. Is
>> this
>> >> a
>> >>>>>>>>>> specific
>> >>>>>>>>>>>> fix for one unique problem, or is
it worth expanding the
>> >> solution
>> >>>>> to
>> >>>>>>>>>> work
>> >>>>>>>>>>>> with Drill-on-YARN deployments?
Might be hard to make the two
>> >>>> work
>> >>>>>>>>>> together
>> >>>>>>>>>>>> later. See comments in docs for
details.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 2. Have we, by chance, looked at
how other projects handle
>> code
>> >>>>>>>>>>>> distribution? Spark, Storm and others
automatically deploy
>> code
>> >>>>>>>> across
>> >>>>>>>>>> the
>> >>>>>>>>>>>> cluster; no manual distribution
to each node. The key
>> difference
>> >>>>>>>> between
>> >>>>>>>>>>>> Drill and others is that, for Storm,
say, code is associated
>> >>>> with a
>> >>>>>>>> job
>> >>>>>>>>>>>> (“topology” in Storm terms.)
But, in Drill, functions are
>> global
>> >>>>> and
>> >>>>>>>>>> have
>> >>>>>>>>>>>> no obvious life cycle that suggests
when the code can be
>> >>>> unloaded.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 3. Have considered the class loader,
dependency and name
>> space
>> >>>>>>>> isolation
>> >>>>>>>>>>>> issues addressed by such products
as Tomcat (web apps) or
>> >> Eclipse
>> >>>>>>>>>>>> (plugins)? Putting user code in
the same namespace as Drill
>> code
>> >>>>> is
>> >>>>>>>>>> quick
>> >>>>>>>>>>>> & dirty. It turns out, however,
that doing so leads to
>> problems
>> >>>>> that
>> >>>>>>>>>>>> require long, frustrating debugging
sessions to resolve.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Addressing item 1 might expand scope
a bit. Addressing items
>> 2
>> >>>> and
>> >>>>> 3
>> >>>>>>>>>> are a
>> >>>>>>>>>>>> big increase in scope, so I won’t
be surprised if we leave
>> those
>> >>>>>>>> issues
>> >>>>>>>>>> for
>> >>>>>>>>>>>> later. (Though, addressing item
2 might be the best way to
>> >>>> address
>> >>>>>>>> item
>> >>>>>>>>>> 1.)
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> If we want a very simple solution
that requires minimal
>> change,
>> >>>>>>>> perhaps
>> >>>>>>>>>> we
>> >>>>>>>>>>>> can use an even simpler solution.
In the proposed design, the
>> >>>> user
>> >>>>>>>> still
>> >>>>>>>>>>>> must distribute code to all the
nodes. The primary change is
>> to
>> >>>>> tell
>> >>>>>>>>>> Drill
>> >>>>>>>>>>>> to load (or unload) that code. Can
accomplish the same result
>> >>>>> easier
>> >>>>>>>>>> simply
>> >>>>>>>>>>>> by having Drill periodically scan
certain directories looking
>> >> for
>> >>>>> new
>> >>>>>>>>>> (or
>> >>>>>>>>>>>> removed) jars? Still won’t work
with YARN, or solve the name
>> >>>> space
>> >>>>>>>>>> issues,
>> >>>>>>>>>>>> but will work for existing non-YARN
Drill users without new
>> SQL
>> >>>>>>>> syntax.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> - Paul
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> On Jun 16, 2016, at 2:07 PM,
Jacques Nadeau <
>> >> jacques@dremio.com
>> >>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Two quick thoughts:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> - (user) In the design document
I didn't see any discussion
>> of
>> >>>>>>>>>>>>> ownership/conflicts or unloading.
Would be helpful to see
>> the
>> >>>>>>>> thinking
>> >>>>>>>>>>>> there
>> >>>>>>>>>>>>> - (dev) There is a row oriented
facade via the
>> >>>>>>>>>>>>> FieldReader/FieldWriter/ComplexWriter
classes. That would
>> be a
>> >>>>> good
>> >>>>>>>>>> place
>> >>>>>>>>>>>>> to start when trying to implement
an alternative interface.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> --
>> >>>>>>>>>>>>> Jacques Nadeau
>> >>>>>>>>>>>>> CTO and Co-Founder, Dremio
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Thu, Jun 16, 2016 at 11:32
AM, John Omernik <
>> >>>> john@omernik.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Honestly, I don't see it
as a priority issue. I think some
>> of
>> >>>> the
>> >>>>>>>>>> ideas
>> >>>>>>>>>>>>>> around community java UDFs
could be a better approach. I'd
>> >> hate
>> >>>>> to
>> >>>>>>>>>> take
>> >>>>>>>>>>>>>> away from other work to
hack in something like this.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Thu, Jun 16, 2016 at
1:19 PM, Paul Rogers <
>> >>>>> progers@maprtech.com
>> >>>>>>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Ted refers to source
code transformation. Drill gains its
>> >>>> speed
>> >>>>>>>> from
>> >>>>>>>>>>>>>> value
>> >>>>>>>>>>>>>>> vectors. However, VVs
are a far cry from the row-based
>> >>>> interface
>> >>>>>>>> that
>> >>>>>>>>>>>>>> most
>> >>>>>>>>>>>>>>> mere mortals are accustomed
to using. Since VVs are very
>> type
>> >>>>>>>>>> specific,
>> >>>>>>>>>>>>>>> code is typically generated
to handle the specifics of
>> each
>> >>>>> type.
>> >>>>>>>>>>>>>> Accessing
>> >>>>>>>>>>>>>>> VVs in Jython may be
a bit of a challenge because of the
>> >>>>>>>> "impedence
>> >>>>>>>>>>>>>>> mismatch" between how
VVs work and the row-and-column view
>> >>>>>>>> expected
>> >>>>>>>>>> by
>> >>>>>>>>>>>>>> most
>> >>>>>>>>>>>>>>> (non-Drill) developers.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I wonder if we've considered
providing a row-oriented
>> >> "facade"
>> >>>>>>>> that
>> >>>>>>>>>> can
>> >>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>> used by roll-your own
data sources and user-defined row
>> >>>>>>>> transforms?
>> >>>>>>>>>>>> Might
>> >>>>>>>>>>>>>>> be a hiccup in the fast
VV pipeline, but might be handy
>> for
>> >>>>> users
>> >>>>>>>>>>>> willing
>> >>>>>>>>>>>>>>> to trade a bit of speed
for convenience. With such a
>> facade,
>> >>>> the
>> >>>>>>>>>> Jython
>> >>>>>>>>>>>>>> row
>> >>>>>>>>>>>>>>> transforms that John
mentions could be quite simple.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Thu, Jun 16, 2016
at 10:36 AM, Ted Dunning <
>> >>>>>>>> ted.dunning@gmail.com
>> >>>>>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Since UDF's use
source code transformation, using Jython
>> >>>> would
>> >>>>> be
>> >>>>>>>>>>>>>>>> difficult.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Thu, Jun 16,
2016 at 9:42 AM, Arina Yelchiyeva <
>> >>>>>>>>>>>>>>>> arina.yelchiyeva@gmail.com>
wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hi Charles,
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> not that I am
aware of. Proposed solution doesn't invent
>> >>>>>>>> anything
>> >>>>>>>>>>>>>> new,
>> >>>>>>>>>>>>>>>> just
>> >>>>>>>>>>>>>>>>> adds possibility
to add UDFs without drillbit restart.
>> But
>> >>>>>>>>>>>>>>> contributions
>> >>>>>>>>>>>>>>>>> are welcomed.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Thu, Jun
16, 2016 at 4:52 PM Charles Givre <
>> >>>>> cgivre@gmail.com
>> >>>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Arina,
>> >>>>>>>>>>>>>>>>>> Has there
been any discussion about making it possible
>> via
>> >>>>>>>> Jython
>> >>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>>> something
for users to write simple UDFs in Python?
>> >>>>>>>>>>>>>>>>>> My ideal
would be to have this capability integrated in
>> >> the
>> >>>>> web
>> >>>>>>>>>> GUI
>> >>>>>>>>>>>>>>>> such
>> >>>>>>>>>>>>>>>>>> that a user
could write their UDF (in Python) right
>> there,
>> >>>>>>>> submit
>> >>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>> would be
deployed to Drill if it passes validation
>> tests.
>> >>>>>>>>>>>>>>>>>> —C
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On Jun
16, 2016, at 09:34, Arina Yelchiyeva <
>> >>>>>>>>>>>>>>>>> arina.yelchiyeva@gmail.com>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Hi all!
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> I have
created Jira to allow dynamic UDFs support in
>> >>>> Drill (
>> >>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726).
>> There
>> >>>>> is a
>> >>>>>>>>>>>>>> link
>> >>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>> design
document in Jira description.
>> >>>>>>>>>>>>>>>>>>> Comments
or suggestions are welcomed.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Kind
regards
>> >>>>>>>>>>>>>>>>>>> Arina
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message