drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: [DISCUSS] Most storage plug-ins don't belong in the Drill repo
Date Tue, 25 Aug 2015 15:45:47 GMT
I don't think it needs to be really complex, I was just making a case that
we should just go ahead and do it so that we can make sure future releases
are compatible with whatever versions we are supporting.

Historically we have supported a single version in a given release, this
was a reasonable way to approach it in the earlier stages of the project,
but we now are going to be running into more cases like this as the user
base expands and we should prioritize making the Drill experience easy for
everyone.

On Tue, Aug 25, 2015 at 8:30 AM, Jacques Nadeau <jacques@dremio.com> wrote:

> I don't think we need to have any complex build infrastructure to support
> building multiple versions.  Most storage plugins depend on Drill core
> rather than the other way around.  As such, you could have 7 different hive
> modules that all depend on Drill core and conflict with each other.  As
> long as they never source their peers, the testing should work fine.  Each
> storage plugin module should be responsible for how to compose itself based
> on its needs.
>
> Focusing on Hive specifically: it seems like minor pom variations could be
> captured using profiles and then the hive module should manage those and
> how to rerun surefire tests with each active profile.  For major changes, I
> think you need to have separate modules.  (For example HBase 94 versus 98+)
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Aug 25, 2015 at 8:14 AM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > I think Jacques has a good point about the code belonging together, but
> we
> > should then talk about how to solve this problem. Users are going to have
> > different versions of these datasources that they have deployed, and I
> > don't think we should consider it an afterthought as to how they get
> Drill
> > to work with them.
> >
> > We should make part of our testing suite automate the tasks of building
> > against multiple different versions and testing each. While this might
> > require some creativity if we are working with changing API's, I think
> this
> > is a problem that needs to be solved directly with targeted effort to
> make
> > maintaining these dependencies easy.
> >
> > Although this general goal is not trivial, the last time we upgraded hive
> > it was only a pom.xml file change that was needed [1]. So for starters we
> > can simply work on making the build capable of interacting with several
> > versions of Hive simultaneously. I am pretty sure our last Hbase upgrade
> > did require changes to how Drill interacted with the Hbase plugin. I
> don't
> > think it is strictly necessary, but I think it would be good to look at
> how
> > we can maintain code that stretches across API versions.
> >
> > [1]
> >
> >
> https://github.com/apache/drill/commit/93533835bdcaff018a6b6ee6ea5999f3c5659d70
> >
> > On Mon, Aug 24, 2015 at 9:30 PM, Jacques Nadeau <jacques@dremio.com>
> > wrote:
> >
> > > Is seems like you have a couple requirements:
> > >
> > > - Support multiple versions of a plugin against a particular system
> (e.g.
> > > Hive, HBase, etc)
> > > - Support loading these multiple versions in the same Drillbit
> > >
> > > I'm entirely in support of these goals and requirements.  We've been
> > > talking about adding a classloading containerization system to better
> > > encapsulate individual plugins so that we no longer have to choose only
> > one
> > > version.  If you want to put together some proposals around this, I
> think
> > > that would be great for the community.
> > >
> > > On the flipside, I see the idea of taking plugins out of the Drill repo
> > as
> > > completely orthogonal to the issues/requirements above.  In fact, it
> > would
> > > be a mistake to separate the code at this point.  It wouldn't provide
> new
> > > value to end users and would make Drill harder to use.  It would also
> > lower
> > > the quality of the product.
> > >
> > > As someone who has worked on all of the current storage plugins, the
> > > interface is still maturing.  As we integrate new types of data
> sources,
> > > the model around optimization continues to develop.  For example, I'm
> > still
> > > working through the enhancements required to support the JDBC interface
> > in
> > > the right way to control which phases certain rules are injected into
> the
> > > query planning.  This is a set of core storage plugin enhancements. By
> > > having the code all in one place, I'll make the fixes to any other
> > storage
> > > plugins as necessary since I know the meaning of these (to be
> documented)
> > > enhancements.  If we had these as disconnected modules, coordinating
> this
> > > type of change would be very difficult. There is also substantial value
> > > from the other side: by including the storage plugins in the general
> > build,
> > > we can also ensure that a core change doesn't have an unintended
> > > consequence to those plugins.
> > >
> > > If storage plugins start to become a huge burden *and* we have found
> the
> > > storage plugin API to be extremely stable, this might make sense with
> > > tertiary plugins.  However, for now, I strongly recommend we focus on
> the
> > > items at the top of this email and don't start slicing up the codebase.
> > >
> > > TL;DR. I'm -1 on the statement "Most storage plug-ins don't belong in
> the
> > > Drill repo".
> > >
> > > I think that's exactly where they belong.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Mon, Aug 24, 2015 at 2:09 PM, Chris Westin <chriswestin42@gmail.com
> >
> > > wrote:
> > >
> > > > I'd like to propose that we move most of the storage plug-ins out of
> > the
> > > > main
> > > > drill codebase/repo.
> > > >
> > > > The storage plug-ins don't belong with the drill code. They can live
> > > > anywhere,
> > > > and can have independent releases. This allows them to be more
> aligned
> > > with
> > > > the storage systems they represent, and removes the need for review
> by
> > > > Drill
> > > > committers who may not necessarily have any expertise with the target
> > > > storage
> > > > systems.
> > > >
> > > > Here's an example. We have a customer who wants to use current Drill
> > with
> > > > an older version of Hive (0.13). My understanding is that the current
> > > Hive
> > > > plug-in can only work with newer versions of Hive (1.0+) because of
> API
> > > > incompatibilities with Hive. However, unless the storage plug-in
> > > interface
> > > > has changed in that time, there's no reason why they shouldn't be
> able
> > to
> > > > use the old 0.13 plug-in with current drill. It's just not built and
> > > > packaged
> > > > that way. If the plug-in source were separate, then it would be
> easier
> > to
> > > > just use the old plug-in.
> > > >
> > > > Another example, again involving Hive. We have a customer who has two
> > > Hive
> > > > clusters of different versions (because they belong to different
> > > > departments).
> > > > They want to use current drill to join data between the two. Given
> the
> > > > Hive API incompatibilities, I've suggested that we find a way to use
> > both
> > > > versions of the Hive plugin (configured with different workspace
> > > prefixes)
> > > > at
> > > > the same time. Assuming the storage plug-in interface hasn't changed
> in
> > > > that
> > > > time, it seems like this should work. (The Hive folks have mentioned
> > that
> > > > there may be library dependency incompatibilities between the two
> > plug-in
> > > > implementations, but it seems like we should be able to handle that
> > with
> > > > some
> > > > adjustment to use separate class loaders for the storage plug-ins, if
> > > that
> > > > happens).
> > > >
> > > > I would suggest that the Drill source only keep a few basic plug-ins,
> > > such
> > > > as the text/csv, text/json ones, and possibly the parquet one. These
> > > don't
> > > > depend on anything other than the file system, and are useful for
> > > immediate
> > > > testing. Other plug-ins (e.g., Hive, MongoDB, Cassandra, etc) should
> > live
> > > > somewhere else, and have their own independent existence. The other
> > > > plug-ins
> > > > can then be released on their own schedule, possibly co-inciding with
> > > > significant changes to the storage systems they provide access to.
> > > >
> > > > Assuming we do this, there are some logistics questions:
> > > > (*) Where do we put the source?
> > > > Does the Apache process have a provision for plug-in architectures
> like
> > > > this?
> > > > Or would each plug-in have to go through the whole incubation
> process?
> > > That
> > > > seems pretty heavyweight for these, so is there something else? Or
> > should
> > > > people
> > > > just put them on Github (or their own favorite public repo).
> > > >
> > > > (*) Versioning the storage plug-in interface.
> > > > It's possible that the storage plug-in interface will change over
> time.
> > > (I
> > > > think there are already plans to make it possible to get more
> metadata
> > > > and/or
> > > > statistics from a storage system, if it supports them, in order to
> do a
> > > > better
> > > > job of optimizing queries.) The interface is really the only thing
> > whose
> > > > version matters here. We need to take some steps to handle that. We
> > might
> > > > add
> > > > an annotation to it, or require that we use a new name, or add a
> digit
> > > > suffix
> > > > to the name. Other ideas? Ideally we'd have adapters that allow the
> use
> > > of
> > > > older plug-ins (with less capable interfaces) with newer Drill so
> that
> > > > users
> > > > aren't held back from updating drill if there aren't newer plug-ins
> for
> > > > their
> > > > storage of choice.
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message