Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: <CA+RK=_Cwy9C98BB2Yp1NaERZF+TZ2zF9QZ_JmW6kerq6o0GFuw@mail.gmail.com>
References: <CAN5cbe4QV3GmTHiKAFYoceCye4PmAWgW+jduzXZGS5tZFtXC0A@mail.gmail.com>
 <CA+RK=_Cwy9C98BB2Yp1NaERZF+TZ2zF9QZ_JmW6kerq6o0GFuw@mail.gmail.com>
From: Yi Liang <easyliangjob@gmail.com>
Date: Wed, 21 Jun 2017 13:56:30 -0700
Message-ID: <CAE_69daZgiYMH7oUTF9s4=JdNnxKyHcGPEvPQGr1J-9s-pTzyg@mail.gmail.com>
Subject: Re: [DISCUSS] status of and plans for our hbase-spark integration
To: dev@hbase.apache.org
Content-Type: multipart/alternative; boundary="f403045f6ebe3cca8405527e9a4c"
archived-at: Wed, 21 Jun 2017 20:56:38 -0000

--f403045f6ebe3cca8405527e9a4c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

(1) Spark version: I also prefer your option b,

(2) Scala version: we can support both Scala 2.10 and 2.11, since there is
no source issue, just byte level change. We can use the profile in maven.
Like -PscalVersion=3Dxxx.xxx

(3) Do we have the right collection of Spark API(s): i think  for hbase 2.0
release, we can just support the first three,and support Structured
Streaming API in the future release.

(4) For Composite keys: the patch in that jira is almost finished, maybe
stale, we need to modify and test it, and hopefully put it into hbase 2.0.

Thanks,
Yi

On Wed, Jun 21, 2017 at 1:18 PM, Andrew Purtell <apurtell@apache.org> wrote=
:

> Phoenix has its own spark connector, which interfaces via the JDBC driver=
 (
> https://phoenix.apache.org/phoenix_spark.html). Do they need to do
> anything? As of 4.10 Phoenix supports Spark >=3D 2.0 (PHOENIX-3333). I'm =
not
> sure what they do to handle Scala version combinations.
>
> On Wed, Jun 21, 2017 at 9:31 AM, Sean Busbey <busbey@apache.org> wrote:
>
> > Hi Folks!
> >
> > We've had integration with Apache Spark lingering in trunk for quite
> > some time, and I'd like us to push towards firming it up. I'm going to
> > try to cover a lot of ground below, so feel free to respond to just
> > pieces and I'll write up a doc on things afterwards.
> >
> > For background, the hbase-spark module currently exists in trunk,
> > branch-2, and the 2.0.0-alpha-1 release. Importantly, it has been in
> > no "ready to use" release so far. It's been in master for ~2 years and
> > has had a total of nearly 70 incremental changes. Right now it shows
> > up in Stack=E2=80=99s excellent state of 2.0 doc ( https://s.apache.org=
/1mB4 )
> > as a nice-to-have. I=E2=80=99d like to get some consensus on either get=
ting it
> > into release trains or officially move it out of scope for 2.0.
> >
> > ----
> >
> > 1) Branch-1 releases
> >
> > In July 2015 we started tracking what kind of polish was needed for
> > this code to make it into our downstream facing release lines in
> > HBASE-14160. Personally, I think if the module isn't ready for a
> > branch-1 release than it shouldn't be in a branch-2 release either.
> >
> > The only things still tracked as required are some form of published
> > API docs (HBASE-17766) and an IT that we can run (HBASE-18175). Our Yi
> > Liang has been working on both of these, and I think we have a good
> > start on them.
> >
> > Is there anything else we ought to be tracking here? I notice the
> > umbrella "make the connector better" issue (HBASE-14789) has only
> > composite row key support still open (HBASE-15335). It looks like that
> > work stalled out last summer after an admirable effort by our Zhan
> > Zhang. Can this wait for a future minor release?
> >
> > Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and
> > then our existing support backported to branch-1 in time for whenever
> > we get HBase 1.4 started.
> >
> > 2) What Spark version(s) do we care about?
> >
> > The hbase-spark module originally started with support for Spark 1.3.
> > It currently sits at supporting just 1.6. Our Ted Yu has been
> > dutifully trying to find consensus on how we handle Spark 2.0 over in
> > HBASE-16179 for nearly a year.
> >
> > AFAICT the Spark community has no more notion of what version(s) their
> > downstream users are relying on than we do. It appears that Spark 1.6
> > will be their last 1.y release and at least the dev community is
> > largely moving on to 2.y releases now.
> >
> > What version(s) do we want to handle and thus encourage our downstream
> > folks to use?
> >
> > Just as a point of reference, Spark 1.6 doesn't have any proper
> > handling of delegation tokens and our current do-it-ourselves
> > workaround breaks in the presence of the support introduced in Spark
> > 2.
> >
> > The way I see it, the options are a) ship both 1.6 and 2.y support, b)
> > ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in
> > branch-2. Does anyone have preferences here?
> >
> > Personally, I think I favor option b for simplicity, though I don't
> > care for more possible delay in getting stuff out in branch-1.
> > Probably option a would be best for our downstreamers.
> >
> > Related, while we've been going around on HBASE-16179 the Apache Spark
> > community started shipping 2.1 releases and is now in the process of
> > finalizing 2.2. Do we need to do anything different for these
> > versions?
> >
> > Spark=E2=80=99s versioning policy suggests =E2=80=9Cnot unless we want =
to support
> > newer APIs or used alpha stuff=E2=80=9D. But I don=E2=80=99t have pract=
ical experience
> > with how this plays out in yet.
> >
> > http://spark.apache.org/versioning-policy.html
> >
> >
> > 3) What scala version(s) do we care about?
> >
> > For those who aren't aware, Scala compatibility is a nightmare. Since
> > Scala is still the primary language for implementation of Spark jobs,
> > we have to care more about this than I'd like. (the only way out, I
> > think, would be to implement our integration entirely in some other
> > JVM language)
> >
> > The short version is that each minor version of scala (we care about)
> > is mutually incompatible with all others. Right now both Spark 1.6 and
> > Spark 2.y work with each of Scala 2.10 and 2.11. There's talk of
> > adding support for Scala 2.12, but it will not happen until after
> > Spark 2.2.
> >
> > (for those looking for a thread on Scala versions in Spark, I think
> > this is the most recent: https://s.apache.org/IW4D )
> >
> > Personally, I think we serve our downstreamers best when we ship
> > artifacts that work with each of the scala versions a given version of
> > Spark supports. It's painful to have to do something like upgrade your
> > scala version just because the storage layer you want to use requires
> > a particular version. It's also painful to have to rebuild artifacts
> > because that layer only offers support for the scala version you like
> > as a DIY option.
> >
> > The happy part of this situation is that the problem, as exposed to
> > us, is at a byte code level and not a source issue. So probably we can
> > support multiple scala versions just by rebuilding the same source
> > against different library versions.
> >
> > 4) Packaging all this probably will be a pain no matter what we do
> >
> > One of the key points of contention on HBASE-16179 is around module
> > layout given X versions of Spark and Y versions of Scala.
> >
> > As things are in master and branch-2 now, we support exactly Spark 1.6
> > on Scala 2.10. It would certainly be easiest to continue to just pick
> > one Spark X and one Scala Y. Ted can correct me, but I believe the
> > most recent state of HBASE-16179 does the full enumeration but only
> > places a single artifact in the assembly (thus making that combination
> > the blessed default). Now that we have precedent for client-specific
> > libraries in the assembly (i.e. the jruby libs are kept off to the
> > side and only included in classpaths that need them like the shell), I
> > think we could do a better job of making sure libraries are deployed
> > regardless of which spark and scala combination is present on a
> > cluster.
> >
> > As a downstream user, I would want to make sure I can add a dependency
> > to my maven project that will work for my particular spark/scala
> > choice. I definitely don=E2=80=99t want to have to run my own nexus ins=
tance
> > so that I can build my own hbase-spark client module reliably.
> >
> > As a release manager, I don=E2=80=99t want to have to run O(X * Y) buil=
ds just
> > so we get the right set of maven artifacts.
> >
> > All of these personal opinions stated, what do others think?
> >
> >
> > 5) Do we have the right collection of Spark API(s)?
> >
> > Spark has a large variety of APIs for interacting with data. Here are
> > pointers to the big ones.
> >
> > RDDs (essentially in-memory tabular data):
> > https://spark.apache.org/docs/latest/programming-guide.html
> >
> > Streaming (essentially a series of the above over time) :
> > https://spark.apache.org/docs/latest/streaming-programming-guide.html
> >
> > Datasets/Dataframes (sql-oriented structured data processing that
> > exposes computation info to the storage layer):
> > https://spark.apache.org/docs/latest/sql-programming-guide.html
> >
> > Structured Streaming (essentially a series of the above over time):
> > https://spark.apache.org/docs/latest/structured-streaming-
> > programming-guide.html
> >
> > Right now we have support for the first three, more or less.
> > Structured Streaming is alpha as of Spark 2.1 and is expected to be GA
> > for Spark 2.2.
> >
> > Going forward, do we want our plan to be robust support for all of
> > these APIs? Would we be better off focusing solely on the newer bits
> > like dataframes?
> >
> > 6) What about the SHC project?
> >
> > In case you didn=E2=80=99t see the excellent talk at HBaseCon from Weiq=
ing
> > Yang, she=E2=80=99s been maintaining a high quality integration library
> > between HBase and Spark.
> >
> >   HBaseCon West 2017 slides: https://s.apache.org/IQMA
> >   Blog: https://s.apache.org/m1bc
> >   Repo: https://github.com/hortonworks-spark/shc
> >
> > I=E2=80=99d love to see us encourage the SHC devs to fold their work in=
to
> > participation in our wider community. Before approaching them about
> > that, I think we need to make sure we share goals and can give them
> > reasonable expectations about release cadence (which probably means
> > making it into branch-1).
> >
> > Right now, I=E2=80=99d only consider the things that have made it to ou=
r docs
> > to be =E2=80=9Cdone=E2=80=9D. Here=E2=80=99s the relevant section of th=
e ref guide:
> >
> > http://hbase.apache.org/book.html#spark
> >
> > Comparing our current offering and the above, I=E2=80=99d say the big g=
aps
> > between our offering and the SHC project are:
> >
> >   * Avro serialization (we have this implemented but documentation is
> > limited to an example in the section on SparkSQL support)
> >   * Composite keys (as mentioned above, we have a start to this)
> >   * More robust handling of delegation tokens, i.e. in presence of
> > multiple secure clusters
> >   * Handling of Phoenix encoded data
> >
> > Are these all things we=E2=80=99d want available to our downstream folk=
s?
> >
> > Personally, I think we=E2=80=99d serve our downstream folks well closin=
g all
> > of these gaps. I don=E2=80=99t think they ought to be blockers on getti=
ng our
> > integration into releases; at first glance none of them look like
> > they=E2=80=99d present compatibility issues.
> >
> > We=E2=80=99d need to figure out what to do about the phoenix encoding b=
it,
> > dependency-wise. Ideally we=E2=80=99d get the phoenix folks to isolate =
their
> > data encoding into a standalone artifact. I=E2=80=99m not sure how much=
 effort
> > that will be, but I=E2=80=99d be happy to take the suggestion over to t=
hem.
> >
> > ---
> >
> > Thanks to everyone who made it all the way down here.  That=E2=80=99s t=
he end
> > of what I could think of after reflecting on this for a couple of days
> > (thanks to our Mike Drob for bearing the brunt of my in progress
> > ramblings).
> >
> > I know this is a wide variety of things; again feel free to just
> > respond in pieces to the sections that strike your fancy. I=E2=80=99ll =
make
> > sure we have a doc with a good summary of whatever consensus we reach
> > and post it here, the website, and/or JIRA once we=E2=80=99ve had awhil=
e for
> > folks to contribute.
> >
> > -busbey
> >
>
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

--f403045f6ebe3cca8405527e9a4c--