drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: [DISCUSS] Most storage plug-ins don't belong in the Drill repo
Date Tue, 25 Aug 2015 15:14:35 GMT
I think Jacques has a good point about the code belonging together, but we
should then talk about how to solve this problem. Users are going to have
different versions of these datasources that they have deployed, and I
don't think we should consider it an afterthought as to how they get Drill
to work with them.

We should make part of our testing suite automate the tasks of building
against multiple different versions and testing each. While this might
require some creativity if we are working with changing API's, I think this
is a problem that needs to be solved directly with targeted effort to make
maintaining these dependencies easy.

Although this general goal is not trivial, the last time we upgraded hive
it was only a pom.xml file change that was needed [1]. So for starters we
can simply work on making the build capable of interacting with several
versions of Hive simultaneously. I am pretty sure our last Hbase upgrade
did require changes to how Drill interacted with the Hbase plugin. I don't
think it is strictly necessary, but I think it would be good to look at how
we can maintain code that stretches across API versions.

[1]
https://github.com/apache/drill/commit/93533835bdcaff018a6b6ee6ea5999f3c5659d70

On Mon, Aug 24, 2015 at 9:30 PM, Jacques Nadeau <jacques@dremio.com> wrote:

> Is seems like you have a couple requirements:
>
> - Support multiple versions of a plugin against a particular system (e.g.
> Hive, HBase, etc)
> - Support loading these multiple versions in the same Drillbit
>
> I'm entirely in support of these goals and requirements.  We've been
> talking about adding a classloading containerization system to better
> encapsulate individual plugins so that we no longer have to choose only one
> version.  If you want to put together some proposals around this, I think
> that would be great for the community.
>
> On the flipside, I see the idea of taking plugins out of the Drill repo as
> completely orthogonal to the issues/requirements above.  In fact, it would
> be a mistake to separate the code at this point.  It wouldn't provide new
> value to end users and would make Drill harder to use.  It would also lower
> the quality of the product.
>
> As someone who has worked on all of the current storage plugins, the
> interface is still maturing.  As we integrate new types of data sources,
> the model around optimization continues to develop.  For example, I'm still
> working through the enhancements required to support the JDBC interface in
> the right way to control which phases certain rules are injected into the
> query planning.  This is a set of core storage plugin enhancements. By
> having the code all in one place, I'll make the fixes to any other storage
> plugins as necessary since I know the meaning of these (to be documented)
> enhancements.  If we had these as disconnected modules, coordinating this
> type of change would be very difficult. There is also substantial value
> from the other side: by including the storage plugins in the general build,
> we can also ensure that a core change doesn't have an unintended
> consequence to those plugins.
>
> If storage plugins start to become a huge burden *and* we have found the
> storage plugin API to be extremely stable, this might make sense with
> tertiary plugins.  However, for now, I strongly recommend we focus on the
> items at the top of this email and don't start slicing up the codebase.
>
> TL;DR. I'm -1 on the statement "Most storage plug-ins don't belong in the
> Drill repo".
>
> I think that's exactly where they belong.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Aug 24, 2015 at 2:09 PM, Chris Westin <chriswestin42@gmail.com>
> wrote:
>
> > I'd like to propose that we move most of the storage plug-ins out of the
> > main
> > drill codebase/repo.
> >
> > The storage plug-ins don't belong with the drill code. They can live
> > anywhere,
> > and can have independent releases. This allows them to be more aligned
> with
> > the storage systems they represent, and removes the need for review by
> > Drill
> > committers who may not necessarily have any expertise with the target
> > storage
> > systems.
> >
> > Here's an example. We have a customer who wants to use current Drill with
> > an older version of Hive (0.13). My understanding is that the current
> Hive
> > plug-in can only work with newer versions of Hive (1.0+) because of API
> > incompatibilities with Hive. However, unless the storage plug-in
> interface
> > has changed in that time, there's no reason why they shouldn't be able to
> > use the old 0.13 plug-in with current drill. It's just not built and
> > packaged
> > that way. If the plug-in source were separate, then it would be easier to
> > just use the old plug-in.
> >
> > Another example, again involving Hive. We have a customer who has two
> Hive
> > clusters of different versions (because they belong to different
> > departments).
> > They want to use current drill to join data between the two. Given the
> > Hive API incompatibilities, I've suggested that we find a way to use both
> > versions of the Hive plugin (configured with different workspace
> prefixes)
> > at
> > the same time. Assuming the storage plug-in interface hasn't changed in
> > that
> > time, it seems like this should work. (The Hive folks have mentioned that
> > there may be library dependency incompatibilities between the two plug-in
> > implementations, but it seems like we should be able to handle that with
> > some
> > adjustment to use separate class loaders for the storage plug-ins, if
> that
> > happens).
> >
> > I would suggest that the Drill source only keep a few basic plug-ins,
> such
> > as the text/csv, text/json ones, and possibly the parquet one. These
> don't
> > depend on anything other than the file system, and are useful for
> immediate
> > testing. Other plug-ins (e.g., Hive, MongoDB, Cassandra, etc) should live
> > somewhere else, and have their own independent existence. The other
> > plug-ins
> > can then be released on their own schedule, possibly co-inciding with
> > significant changes to the storage systems they provide access to.
> >
> > Assuming we do this, there are some logistics questions:
> > (*) Where do we put the source?
> > Does the Apache process have a provision for plug-in architectures like
> > this?
> > Or would each plug-in have to go through the whole incubation process?
> That
> > seems pretty heavyweight for these, so is there something else? Or should
> > people
> > just put them on Github (or their own favorite public repo).
> >
> > (*) Versioning the storage plug-in interface.
> > It's possible that the storage plug-in interface will change over time.
> (I
> > think there are already plans to make it possible to get more metadata
> > and/or
> > statistics from a storage system, if it supports them, in order to do a
> > better
> > job of optimizing queries.) The interface is really the only thing whose
> > version matters here. We need to take some steps to handle that. We might
> > add
> > an annotation to it, or require that we use a new name, or add a digit
> > suffix
> > to the name. Other ideas? Ideally we'd have adapters that allow the use
> of
> > older plug-ins (with less capable interfaces) with newer Drill so that
> > users
> > aren't held back from updating drill if there aren't newer plug-ins for
> > their
> > storage of choice.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message