hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@apache.org>
Subject Re: [DISCUSS] status of and plans for our hbase-spark integration
Date Thu, 22 Jun 2017 13:26:06 GMT
On Wed, Jun 21, 2017 at 10:37 PM, Stack <stack@duboce.net> wrote:
> On Wed, Jun 21, 2017 at 5:26 PM, Andrew Purtell <apurtell@apache.org> wrote:
>
>> I seem to recall that what eventually was committed to master as
>> hbase-spark was first shopped to the Spark project, who felt the same, that
>> it should be hosted elsewhere.
>
>
> I have the same remembrance.
>
>
>> ....  I would draw an analogy with
>> mapreduce: we had what we called 'first class' mapreduce integration, spark
>> is the alleged successor to mapreduce, we should evolve that support as
>> such. I'd like to know if that reasoning, or other rationale, is sufficient
>> at this time.
>>
>>
> Spark should be first-class on equal footing with MR if not more so (our MR
> integration is too tightly bound up with our internals badly in need of
> untangling).
>
> Reading over the scope of work Sean outlines -- the variants, pom profiles,
> the module profusion, and the uncertainties -- makes me queasy pulling it
> all in.
>
> I'm working on a little mini-hbase project at the mo to shade guava, etc.,
> and it is easy going. Made me think we could do a mini-project to host
> spark so we could contain it should it go up in flames.
>
> S

I think the current approach of keeping all the spark related stuff in
a set of modules that we don't depend on for our other bits
sufficiently isolates us from the risk of things blowing up. For
example, when we're ready to build some of our admin tools on the
spark integration instead of MR we can update them to use Java
Services API or some similar runtime loading method to avoid having a
dependency directly on the Spark artifacts.

It's true that we could put this into a different repo with its own
release cycle, but I suspect that will lead to even more build pain.
Especially given that it's likely to remain under active development
for the foreseeable future and we'll want to package some version of
it in our convenience binary assembly. Contrast with our third party
dependencies, which tend to remain the same over relatively large
timespans (e.g. a major version). If we end up voting on releases that
cover a version from both this hypothetical hbase-spark repo and the
main repo, what would we have really gained by splitting the two up?

Mime
View raw message