hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@apache.org>
Subject [DISCUSS] status of and plans for our hbase-spark integration
Date Wed, 21 Jun 2017 16:31:09 GMT
Hi Folks!

We've had integration with Apache Spark lingering in trunk for quite
some time, and I'd like us to push towards firming it up. I'm going to
try to cover a lot of ground below, so feel free to respond to just
pieces and I'll write up a doc on things afterwards.

For background, the hbase-spark module currently exists in trunk,
branch-2, and the 2.0.0-alpha-1 release. Importantly, it has been in
no "ready to use" release so far. It's been in master for ~2 years and
has had a total of nearly 70 incremental changes. Right now it shows
up in Stack’s excellent state of 2.0 doc ( https://s.apache.org/1mB4 )
as a nice-to-have. I’d like to get some consensus on either getting it
into release trains or officially move it out of scope for 2.0.

----

1) Branch-1 releases

In July 2015 we started tracking what kind of polish was needed for
this code to make it into our downstream facing release lines in
HBASE-14160. Personally, I think if the module isn't ready for a
branch-1 release than it shouldn't be in a branch-2 release either.

The only things still tracked as required are some form of published
API docs (HBASE-17766) and an IT that we can run (HBASE-18175). Our Yi
Liang has been working on both of these, and I think we have a good
start on them.

Is there anything else we ought to be tracking here? I notice the
umbrella "make the connector better" issue (HBASE-14789) has only
composite row key support still open (HBASE-15335). It looks like that
work stalled out last summer after an admirable effort by our Zhan
Zhang. Can this wait for a future minor release?

Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and
then our existing support backported to branch-1 in time for whenever
we get HBase 1.4 started.

2) What Spark version(s) do we care about?

The hbase-spark module originally started with support for Spark 1.3.
It currently sits at supporting just 1.6. Our Ted Yu has been
dutifully trying to find consensus on how we handle Spark 2.0 over in
HBASE-16179 for nearly a year.

AFAICT the Spark community has no more notion of what version(s) their
downstream users are relying on than we do. It appears that Spark 1.6
will be their last 1.y release and at least the dev community is
largely moving on to 2.y releases now.

What version(s) do we want to handle and thus encourage our downstream
folks to use?

Just as a point of reference, Spark 1.6 doesn't have any proper
handling of delegation tokens and our current do-it-ourselves
workaround breaks in the presence of the support introduced in Spark
2.

The way I see it, the options are a) ship both 1.6 and 2.y support, b)
ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in
branch-2. Does anyone have preferences here?

Personally, I think I favor option b for simplicity, though I don't
care for more possible delay in getting stuff out in branch-1.
Probably option a would be best for our downstreamers.

Related, while we've been going around on HBASE-16179 the Apache Spark
community started shipping 2.1 releases and is now in the process of
finalizing 2.2. Do we need to do anything different for these
versions?

Spark’s versioning policy suggests “not unless we want to support
newer APIs or used alpha stuff”. But I don’t have practical experience
with how this plays out in yet.

http://spark.apache.org/versioning-policy.html


3) What scala version(s) do we care about?

For those who aren't aware, Scala compatibility is a nightmare. Since
Scala is still the primary language for implementation of Spark jobs,
we have to care more about this than I'd like. (the only way out, I
think, would be to implement our integration entirely in some other
JVM language)

The short version is that each minor version of scala (we care about)
is mutually incompatible with all others. Right now both Spark 1.6 and
Spark 2.y work with each of Scala 2.10 and 2.11. There's talk of
adding support for Scala 2.12, but it will not happen until after
Spark 2.2.

(for those looking for a thread on Scala versions in Spark, I think
this is the most recent: https://s.apache.org/IW4D )

Personally, I think we serve our downstreamers best when we ship
artifacts that work with each of the scala versions a given version of
Spark supports. It's painful to have to do something like upgrade your
scala version just because the storage layer you want to use requires
a particular version. It's also painful to have to rebuild artifacts
because that layer only offers support for the scala version you like
as a DIY option.

The happy part of this situation is that the problem, as exposed to
us, is at a byte code level and not a source issue. So probably we can
support multiple scala versions just by rebuilding the same source
against different library versions.

4) Packaging all this probably will be a pain no matter what we do

One of the key points of contention on HBASE-16179 is around module
layout given X versions of Spark and Y versions of Scala.

As things are in master and branch-2 now, we support exactly Spark 1.6
on Scala 2.10. It would certainly be easiest to continue to just pick
one Spark X and one Scala Y. Ted can correct me, but I believe the
most recent state of HBASE-16179 does the full enumeration but only
places a single artifact in the assembly (thus making that combination
the blessed default). Now that we have precedent for client-specific
libraries in the assembly (i.e. the jruby libs are kept off to the
side and only included in classpaths that need them like the shell), I
think we could do a better job of making sure libraries are deployed
regardless of which spark and scala combination is present on a
cluster.

As a downstream user, I would want to make sure I can add a dependency
to my maven project that will work for my particular spark/scala
choice. I definitely don’t want to have to run my own nexus instance
so that I can build my own hbase-spark client module reliably.

As a release manager, I don’t want to have to run O(X * Y) builds just
so we get the right set of maven artifacts.

All of these personal opinions stated, what do others think?


5) Do we have the right collection of Spark API(s)?

Spark has a large variety of APIs for interacting with data. Here are
pointers to the big ones.

RDDs (essentially in-memory tabular data):
https://spark.apache.org/docs/latest/programming-guide.html

Streaming (essentially a series of the above over time) :
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Datasets/Dataframes (sql-oriented structured data processing that
exposes computation info to the storage layer):
https://spark.apache.org/docs/latest/sql-programming-guide.html

Structured Streaming (essentially a series of the above over time):
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Right now we have support for the first three, more or less.
Structured Streaming is alpha as of Spark 2.1 and is expected to be GA
for Spark 2.2.

Going forward, do we want our plan to be robust support for all of
these APIs? Would we be better off focusing solely on the newer bits
like dataframes?

6) What about the SHC project?

In case you didn’t see the excellent talk at HBaseCon from Weiqing
Yang, she’s been maintaining a high quality integration library
between HBase and Spark.

  HBaseCon West 2017 slides: https://s.apache.org/IQMA
  Blog: https://s.apache.org/m1bc
  Repo: https://github.com/hortonworks-spark/shc

I’d love to see us encourage the SHC devs to fold their work into
participation in our wider community. Before approaching them about
that, I think we need to make sure we share goals and can give them
reasonable expectations about release cadence (which probably means
making it into branch-1).

Right now, I’d only consider the things that have made it to our docs
to be “done”. Here’s the relevant section of the ref guide:

http://hbase.apache.org/book.html#spark

Comparing our current offering and the above, I’d say the big gaps
between our offering and the SHC project are:

  * Avro serialization (we have this implemented but documentation is
limited to an example in the section on SparkSQL support)
  * Composite keys (as mentioned above, we have a start to this)
  * More robust handling of delegation tokens, i.e. in presence of
multiple secure clusters
  * Handling of Phoenix encoded data

Are these all things we’d want available to our downstream folks?

Personally, I think we’d serve our downstream folks well closing all
of these gaps. I don’t think they ought to be blockers on getting our
integration into releases; at first glance none of them look like
they’d present compatibility issues.

We’d need to figure out what to do about the phoenix encoding bit,
dependency-wise. Ideally we’d get the phoenix folks to isolate their
data encoding into a standalone artifact. I’m not sure how much effort
that will be, but I’d be happy to take the suggestion over to them.

---

Thanks to everyone who made it all the way down here.  That’s the end
of what I could think of after reflecting on this for a couple of days
(thanks to our Mike Drob for bearing the brunt of my in progress
ramblings).

I know this is a wide variety of things; again feel free to just
respond in pieces to the sections that strike your fancy. I’ll make
sure we have a doc with a good summary of whatever consensus we reach
and post it here, the website, and/or JIRA once we’ve had awhile for
folks to contribute.

-busbey

Mime
View raw message