Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
MIME-Version: 1.0
Sender: mdrob@cloudera.com
In-Reply-To: <CAN5cbe7kBaQ3zKd6r3UkLNFQiNmyuZ178HyS=sR8JKw6kb9HuQ@mail.gmail.com>
References: <CAN5cbe4QV3GmTHiKAFYoceCye4PmAWgW+jduzXZGS5tZFtXC0A@mail.gmail.com>
 <CADcMMgEvsVdFnYBAm+nSyd4ZX-Ssr7==RA+sJW3GUAEgXJFcmA@mail.gmail.com>
 <CA+RK=_DBRRWbAHU7U3HFmvmKEx=dp-F8w9M=znU8ksnEQPhX7w@mail.gmail.com>
 <CADcMMgHQHbAkz3bONpgpfr1XUjqZw0WOiAyMG5gp9y8x19_Upg@mail.gmail.com> <CAN5cbe7kBaQ3zKd6r3UkLNFQiNmyuZ178HyS=sR8JKw6kb9HuQ@mail.gmail.com>
From: Mike Drob <mdrob@apache.org>
Date: Thu, 22 Jun 2017 10:00:44 -0500
Message-ID: <CAFAMeYJqpbC7yk86Qa4aqyd2DHkvVmGReTh-hms=QTBf+XQzNQ@mail.gmail.com>
Subject: Re: [DISCUSS] status of and plans for our hbase-spark integration
To: dev <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary="f4030435cf04fe846c05528dc066"
archived-at: Thu, 22 Jun 2017 15:01:14 -0000

--f4030435cf04fe846c05528dc066
Content-Type: text/plain; charset="UTF-8"

That's a lot of ground you're trying to cover, Sean, thanks for putting
this together.

> 1) Branch-1 releases
> Is there anything else we ought to be tracking here?

We currently have code in the o.a.spark namespace. I don't think there is a
JIRA for it yet, but this seems like cross-project trouble waiting to
happen. https://github.com/apache/hbase/tree/master/
hbase-spark/src/main/scala/org/apache/spark

> The way I see it, the options are a) ship both 1.6 and 2.y support, b)
> ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in
> branch-2. Does anyone have preferences here?

I think I prefer option B here as well. It sounds like Spark 2.2 will be
out Very Soon, so we should almost certainly have a story for that. If
there are no compatibility issues, then we can support >= 2.0 or 2.1,
otherwise there's no reason to try and hit the moving target and we can
focus on supporting the newest release. Like you said earlier, there's been
no official release of this module yet, so I have to imagine that the
current consumers are knowingly bleeding edge and can handle an upgrade or
recompile on their own.

> 4) Packaging all this probably will be a pain no matter what we do

Do we have to package this in our assembly at all? Currently, we include
the hbase-spark module in the branch-2 and master assembly, but I'm not
convinced this needs to be the case. Is it too much to ask users to build a
jar with dependencies (which I think we already do) and include the
appropriate spark/scala/hbase jars in it (pulled from maven)? I think this
problem can be better solved through docs and client tooling rather than
going through awkward gymnastics to package m*n versions in our tarball
_and_ making sure that we get all the classpaths right.

 > 5) Do we have the right collection of Spark API(s):

Agree with Yi Liang here, release what we have then worry about adding
things later.


On Thu, Jun 22, 2017 at 8:26 AM, Sean Busbey <busbey@apache.org> wrote:

> On Wed, Jun 21, 2017 at 10:37 PM, Stack <stack@duboce.net> wrote:
> > On Wed, Jun 21, 2017 at 5:26 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
> >
> >> I seem to recall that what eventually was committed to master as
> >> hbase-spark was first shopped to the Spark project, who felt the same,
> that
> >> it should be hosted elsewhere.
> >
> >
> > I have the same remembrance.
> >
> >
> >> ....  I would draw an analogy with
> >> mapreduce: we had what we called 'first class' mapreduce integration,
> spark
> >> is the alleged successor to mapreduce, we should evolve that support as
> >> such. I'd like to know if that reasoning, or other rationale, is
> sufficient
> >> at this time.
> >>
> >>
> > Spark should be first-class on equal footing with MR if not more so (our
> MR
> > integration is too tightly bound up with our internals badly in need of
> > untangling).
> >
> > Reading over the scope of work Sean outlines -- the variants, pom
> profiles,
> > the module profusion, and the uncertainties -- makes me queasy pulling it
> > all in.
> >
> > I'm working on a little mini-hbase project at the mo to shade guava,
> etc.,
> > and it is easy going. Made me think we could do a mini-project to host
> > spark so we could contain it should it go up in flames.
> >
> > S
>
> I think the current approach of keeping all the spark related stuff in
> a set of modules that we don't depend on for our other bits
> sufficiently isolates us from the risk of things blowing up. For
> example, when we're ready to build some of our admin tools on the
> spark integration instead of MR we can update them to use Java
> Services API or some similar runtime loading method to avoid having a
> dependency directly on the Spark artifacts.
>
> It's true that we could put this into a different repo with its own
> release cycle, but I suspect that will lead to even more build pain.
> Especially given that it's likely to remain under active development
> for the foreseeable future and we'll want to package some version of
> it in our convenience binary assembly. Contrast with our third party
> dependencies, which tend to remain the same over relatively large
> timespans (e.g. a major version). If we end up voting on releases that
> cover a version from both this hypothetical hbase-spark repo and the
> main repo, what would we have really gained by splitting the two up?
>

--f4030435cf04fe846c05528dc066--