drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: [DISCUSS] Most storage plug-ins don't belong in the Drill repo
Date Tue, 25 Aug 2015 04:30:08 GMT
Is seems like you have a couple requirements:

- Support multiple versions of a plugin against a particular system (e.g.
Hive, HBase, etc)
- Support loading these multiple versions in the same Drillbit

I'm entirely in support of these goals and requirements.  We've been
talking about adding a classloading containerization system to better
encapsulate individual plugins so that we no longer have to choose only one
version.  If you want to put together some proposals around this, I think
that would be great for the community.

On the flipside, I see the idea of taking plugins out of the Drill repo as
completely orthogonal to the issues/requirements above.  In fact, it would
be a mistake to separate the code at this point.  It wouldn't provide new
value to end users and would make Drill harder to use.  It would also lower
the quality of the product.

As someone who has worked on all of the current storage plugins, the
interface is still maturing.  As we integrate new types of data sources,
the model around optimization continues to develop.  For example, I'm still
working through the enhancements required to support the JDBC interface in
the right way to control which phases certain rules are injected into the
query planning.  This is a set of core storage plugin enhancements. By
having the code all in one place, I'll make the fixes to any other storage
plugins as necessary since I know the meaning of these (to be documented)
enhancements.  If we had these as disconnected modules, coordinating this
type of change would be very difficult. There is also substantial value
from the other side: by including the storage plugins in the general build,
we can also ensure that a core change doesn't have an unintended
consequence to those plugins.

If storage plugins start to become a huge burden *and* we have found the
storage plugin API to be extremely stable, this might make sense with
tertiary plugins.  However, for now, I strongly recommend we focus on the
items at the top of this email and don't start slicing up the codebase.

TL;DR. I'm -1 on the statement "Most storage plug-ins don't belong in the
Drill repo".

I think that's exactly where they belong.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Aug 24, 2015 at 2:09 PM, Chris Westin <chriswestin42@gmail.com>
wrote:

> I'd like to propose that we move most of the storage plug-ins out of the
> main
> drill codebase/repo.
>
> The storage plug-ins don't belong with the drill code. They can live
> anywhere,
> and can have independent releases. This allows them to be more aligned with
> the storage systems they represent, and removes the need for review by
> Drill
> committers who may not necessarily have any expertise with the target
> storage
> systems.
>
> Here's an example. We have a customer who wants to use current Drill with
> an older version of Hive (0.13). My understanding is that the current Hive
> plug-in can only work with newer versions of Hive (1.0+) because of API
> incompatibilities with Hive. However, unless the storage plug-in interface
> has changed in that time, there's no reason why they shouldn't be able to
> use the old 0.13 plug-in with current drill. It's just not built and
> packaged
> that way. If the plug-in source were separate, then it would be easier to
> just use the old plug-in.
>
> Another example, again involving Hive. We have a customer who has two Hive
> clusters of different versions (because they belong to different
> departments).
> They want to use current drill to join data between the two. Given the
> Hive API incompatibilities, I've suggested that we find a way to use both
> versions of the Hive plugin (configured with different workspace prefixes)
> at
> the same time. Assuming the storage plug-in interface hasn't changed in
> that
> time, it seems like this should work. (The Hive folks have mentioned that
> there may be library dependency incompatibilities between the two plug-in
> implementations, but it seems like we should be able to handle that with
> some
> adjustment to use separate class loaders for the storage plug-ins, if that
> happens).
>
> I would suggest that the Drill source only keep a few basic plug-ins, such
> as the text/csv, text/json ones, and possibly the parquet one. These don't
> depend on anything other than the file system, and are useful for immediate
> testing. Other plug-ins (e.g., Hive, MongoDB, Cassandra, etc) should live
> somewhere else, and have their own independent existence. The other
> plug-ins
> can then be released on their own schedule, possibly co-inciding with
> significant changes to the storage systems they provide access to.
>
> Assuming we do this, there are some logistics questions:
> (*) Where do we put the source?
> Does the Apache process have a provision for plug-in architectures like
> this?
> Or would each plug-in have to go through the whole incubation process? That
> seems pretty heavyweight for these, so is there something else? Or should
> people
> just put them on Github (or their own favorite public repo).
>
> (*) Versioning the storage plug-in interface.
> It's possible that the storage plug-in interface will change over time. (I
> think there are already plans to make it possible to get more metadata
> and/or
> statistics from a storage system, if it supports them, in order to do a
> better
> job of optimizing queries.) The interface is really the only thing whose
> version matters here. We need to take some steps to handle that. We might
> add
> an annotation to it, or require that we use a new name, or add a digit
> suffix
> to the name. Other ideas? Ideally we'd have adapters that allow the use of
> older plug-ins (with less capable interfaces) with newer Drill so that
> users
> aren't held back from updating drill if there aren't newer plug-ins for
> their
> storage of choice.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message