hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: [DISCUSS] status of and plans for our hbase-spark integration
Date Thu, 22 Jun 2017 00:26:11 GMT
I seem to recall that what eventually was committed to master as
hbase-spark was first shopped to the Spark project, who felt the same, that
it should be hosted elsewhere. On JIRA and in the dev@ archives I'd imagine
we can recall the reasons why we accepted it. I would draw an analogy with
mapreduce: we had what we called 'first class' mapreduce integration, spark
is the alleged successor to mapreduce, we should evolve that support as
such. I'd like to know if that reasoning, or other rationale, is sufficient
at this time.


On Wed, Jun 21, 2017 at 5:13 PM, Stack <stack@duboce.net> wrote:

> Great writeup.
>
> At first blush, this effort looks like it should be a separate project, not
> in hbase core at all.
>
> St.Ack
>
> On Wed, Jun 21, 2017 at 9:31 AM, Sean Busbey <busbey@apache.org> wrote:
>
> > Hi Folks!
> >
> > We've had integration with Apache Spark lingering in trunk for quite
> > some time, and I'd like us to push towards firming it up. I'm going to
> > try to cover a lot of ground below, so feel free to respond to just
> > pieces and I'll write up a doc on things afterwards.
> >
> > For background, the hbase-spark module currently exists in trunk,
> > branch-2, and the 2.0.0-alpha-1 release. Importantly, it has been in
> > no "ready to use" release so far. It's been in master for ~2 years and
> > has had a total of nearly 70 incremental changes. Right now it shows
> > up in Stack’s excellent state of 2.0 doc ( https://s.apache.org/1mB4 )
> > as a nice-to-have. I’d like to get some consensus on either getting it
> > into release trains or officially move it out of scope for 2.0.
> >
> > ----
> >
> > 1) Branch-1 releases
> >
> > In July 2015 we started tracking what kind of polish was needed for
> > this code to make it into our downstream facing release lines in
> > HBASE-14160. Personally, I think if the module isn't ready for a
> > branch-1 release than it shouldn't be in a branch-2 release either.
> >
> > The only things still tracked as required are some form of published
> > API docs (HBASE-17766) and an IT that we can run (HBASE-18175). Our Yi
> > Liang has been working on both of these, and I think we have a good
> > start on them.
> >
> > Is there anything else we ought to be tracking here? I notice the
> > umbrella "make the connector better" issue (HBASE-14789) has only
> > composite row key support still open (HBASE-15335). It looks like that
> > work stalled out last summer after an admirable effort by our Zhan
> > Zhang. Can this wait for a future minor release?
> >
> > Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and
> > then our existing support backported to branch-1 in time for whenever
> > we get HBase 1.4 started.
> >
> > 2) What Spark version(s) do we care about?
> >
> > The hbase-spark module originally started with support for Spark 1.3.
> > It currently sits at supporting just 1.6. Our Ted Yu has been
> > dutifully trying to find consensus on how we handle Spark 2.0 over in
> > HBASE-16179 for nearly a year.
> >
> > AFAICT the Spark community has no more notion of what version(s) their
> > downstream users are relying on than we do. It appears that Spark 1.6
> > will be their last 1.y release and at least the dev community is
> > largely moving on to 2.y releases now.
> >
> > What version(s) do we want to handle and thus encourage our downstream
> > folks to use?
> >
> > Just as a point of reference, Spark 1.6 doesn't have any proper
> > handling of delegation tokens and our current do-it-ourselves
> > workaround breaks in the presence of the support introduced in Spark
> > 2.
> >
> > The way I see it, the options are a) ship both 1.6 and 2.y support, b)
> > ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in
> > branch-2. Does anyone have preferences here?
> >
> > Personally, I think I favor option b for simplicity, though I don't
> > care for more possible delay in getting stuff out in branch-1.
> > Probably option a would be best for our downstreamers.
> >
> > Related, while we've been going around on HBASE-16179 the Apache Spark
> > community started shipping 2.1 releases and is now in the process of
> > finalizing 2.2. Do we need to do anything different for these
> > versions?
> >
> > Spark’s versioning policy suggests “not unless we want to support
> > newer APIs or used alpha stuff”. But I don’t have practical experience
> > with how this plays out in yet.
> >
> > http://spark.apache.org/versioning-policy.html
> >
> >
> > 3) What scala version(s) do we care about?
> >
> > For those who aren't aware, Scala compatibility is a nightmare. Since
> > Scala is still the primary language for implementation of Spark jobs,
> > we have to care more about this than I'd like. (the only way out, I
> > think, would be to implement our integration entirely in some other
> > JVM language)
> >
> > The short version is that each minor version of scala (we care about)
> > is mutually incompatible with all others. Right now both Spark 1.6 and
> > Spark 2.y work with each of Scala 2.10 and 2.11. There's talk of
> > adding support for Scala 2.12, but it will not happen until after
> > Spark 2.2.
> >
> > (for those looking for a thread on Scala versions in Spark, I think
> > this is the most recent: https://s.apache.org/IW4D )
> >
> > Personally, I think we serve our downstreamers best when we ship
> > artifacts that work with each of the scala versions a given version of
> > Spark supports. It's painful to have to do something like upgrade your
> > scala version just because the storage layer you want to use requires
> > a particular version. It's also painful to have to rebuild artifacts
> > because that layer only offers support for the scala version you like
> > as a DIY option.
> >
> > The happy part of this situation is that the problem, as exposed to
> > us, is at a byte code level and not a source issue. So probably we can
> > support multiple scala versions just by rebuilding the same source
> > against different library versions.
> >
> > 4) Packaging all this probably will be a pain no matter what we do
> >
> > One of the key points of contention on HBASE-16179 is around module
> > layout given X versions of Spark and Y versions of Scala.
> >
> > As things are in master and branch-2 now, we support exactly Spark 1.6
> > on Scala 2.10. It would certainly be easiest to continue to just pick
> > one Spark X and one Scala Y. Ted can correct me, but I believe the
> > most recent state of HBASE-16179 does the full enumeration but only
> > places a single artifact in the assembly (thus making that combination
> > the blessed default). Now that we have precedent for client-specific
> > libraries in the assembly (i.e. the jruby libs are kept off to the
> > side and only included in classpaths that need them like the shell), I
> > think we could do a better job of making sure libraries are deployed
> > regardless of which spark and scala combination is present on a
> > cluster.
> >
> > As a downstream user, I would want to make sure I can add a dependency
> > to my maven project that will work for my particular spark/scala
> > choice. I definitely don’t want to have to run my own nexus instance
> > so that I can build my own hbase-spark client module reliably.
> >
> > As a release manager, I don’t want to have to run O(X * Y) builds just
> > so we get the right set of maven artifacts.
> >
> > All of these personal opinions stated, what do others think?
> >
> >
> > 5) Do we have the right collection of Spark API(s)?
> >
> > Spark has a large variety of APIs for interacting with data. Here are
> > pointers to the big ones.
> >
> > RDDs (essentially in-memory tabular data):
> > https://spark.apache.org/docs/latest/programming-guide.html
> >
> > Streaming (essentially a series of the above over time) :
> > https://spark.apache.org/docs/latest/streaming-programming-guide.html
> >
> > Datasets/Dataframes (sql-oriented structured data processing that
> > exposes computation info to the storage layer):
> > https://spark.apache.org/docs/latest/sql-programming-guide.html
> >
> > Structured Streaming (essentially a series of the above over time):
> > https://spark.apache.org/docs/latest/structured-streaming-
> > programming-guide.html
> >
> > Right now we have support for the first three, more or less.
> > Structured Streaming is alpha as of Spark 2.1 and is expected to be GA
> > for Spark 2.2.
> >
> > Going forward, do we want our plan to be robust support for all of
> > these APIs? Would we be better off focusing solely on the newer bits
> > like dataframes?
> >
> > 6) What about the SHC project?
> >
> > In case you didn’t see the excellent talk at HBaseCon from Weiqing
> > Yang, she’s been maintaining a high quality integration library
> > between HBase and Spark.
> >
> >   HBaseCon West 2017 slides: https://s.apache.org/IQMA
> >   Blog: https://s.apache.org/m1bc
> >   Repo: https://github.com/hortonworks-spark/shc
> >
> > I’d love to see us encourage the SHC devs to fold their work into
> > participation in our wider community. Before approaching them about
> > that, I think we need to make sure we share goals and can give them
> > reasonable expectations about release cadence (which probably means
> > making it into branch-1).
> >
> > Right now, I’d only consider the things that have made it to our docs
> > to be “done”. Here’s the relevant section of the ref guide:
> >
> > http://hbase.apache.org/book.html#spark
> >
> > Comparing our current offering and the above, I’d say the big gaps
> > between our offering and the SHC project are:
> >
> >   * Avro serialization (we have this implemented but documentation is
> > limited to an example in the section on SparkSQL support)
> >   * Composite keys (as mentioned above, we have a start to this)
> >   * More robust handling of delegation tokens, i.e. in presence of
> > multiple secure clusters
> >   * Handling of Phoenix encoded data
> >
> > Are these all things we’d want available to our downstream folks?
> >
> > Personally, I think we’d serve our downstream folks well closing all
> > of these gaps. I don’t think they ought to be blockers on getting our
> > integration into releases; at first glance none of them look like
> > they’d present compatibility issues.
> >
> > We’d need to figure out what to do about the phoenix encoding bit,
> > dependency-wise. Ideally we’d get the phoenix folks to isolate their
> > data encoding into a standalone artifact. I’m not sure how much effort
> > that will be, but I’d be happy to take the suggestion over to them.
> >
> > ---
> >
> > Thanks to everyone who made it all the way down here.  That’s the end
> > of what I could think of after reflecting on this for a couple of days
> > (thanks to our Mike Drob for bearing the brunt of my in progress
> > ramblings).
> >
> > I know this is a wide variety of things; again feel free to just
> > respond in pieces to the sections that strike your fancy. I’ll make
> > sure we have a doc with a good summary of whatever consensus we reach
> > and post it here, the website, and/or JIRA once we’ve had awhile for
> > folks to contribute.
> >
> > -busbey
> >
>



-- 
Best regards,

   - Andy

If you are given a choice, you believe you have acted freely. - Raymond
Teller (via Peter Watts)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message