hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: [DISCUSS] status of and plans for our hbase-spark integration
Date Fri, 23 Jun 2017 17:06:07 GMT
On Wed, Jun 21, 2017 at 9:31 AM, Sean Busbey <busbey@apache.org> wrote:

> ....
> 1) Branch-1 releases


> Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and
> then our existing support backported to branch-1 in time for whenever
> we get HBase 1.4 started.

Sounds good.

2) What Spark version(s) do we care about?
> ...
> What version(s) do we want to handle and thus encourage our downstream
> folks to use?

> Personally, I think I favor option b for simplicity, though I don't
> care for more possible delay in getting stuff out in branch-1.
> Probably option a would be best for our downstreamers.
Lets do option b.) well. If demand and contribs, lets consider adding 1.6


> 4) Packaging all this probably will be a pain no matter what we do
> One of the key points of contention on HBASE-16179 is around module
> layout given X versions of Spark and Y versions of Scala.
> As things are in master and branch-2 now, we support exactly Spark 1.6
> on Scala 2.10. It would certainly be easiest to continue to just pick
> one Spark X and one Scala Y. Ted can correct me, but I believe the
> most recent state of HBASE-16179 does the full enumeration but only
> places a single artifact in the assembly (thus making that combination
> the blessed default).

Can we do better than this?

Default will be only option tested.

Will users bother probing to see if it even possible to build with an
another scala (will they trust it).

> Now that we have precedent for client-specific
> libraries in the assembly (i.e. the jruby libs are kept off to the
> side and only included in classpaths that need them like the shell), I
> think we could do a better job of making sure libraries are deployed
> regardless of which spark and scala combination is present on a
> cluster.
Sounds good.

> As a downstream user, I would want to make sure I can add a dependency
> to my maven project that will work for my particular spark/scala
> choice. I definitely don’t want to have to run my own nexus instance
> so that I can build my own hbase-spark client module reliably.
> As a release manager, I don’t want to have to run O(X * Y) builds just
> so we get the right set of maven artifacts.
> All of these personal opinions stated, what do others think?
I think it reasonable that the project take on these personal opinions as
build objectives; anything else seems to put unfair burden on downstreamers.

> 5) Do we have the right collection of Spark API(s)?
> Spark has a large variety of APIs for interacting with data. Here are
> pointers to the big ones.
> RDDs (essentially in-memory tabular data):
> https://spark.apache.org/docs/latest/programming-guide.html
> Streaming (essentially a series of the above over time) :
> https://spark.apache.org/docs/latest/streaming-programming-guide.html
> Datasets/Dataframes (sql-oriented structured data processing that
> exposes computation info to the storage layer):
> https://spark.apache.org/docs/latest/sql-programming-guide.html
> Structured Streaming (essentially a series of the above over time):
> https://spark.apache.org/docs/latest/structured-streaming-pr
> ogramming-guide.html
> Right now we have support for the first three, more or less.
> Structured Streaming is alpha as of Spark 2.1 and is expected to be GA
> for Spark 2.2.
> Going forward, do we want our plan to be robust support for all of
> these APIs? Would we be better off focusing solely on the newer bits
> like dataframes?
> Probably need to do all 3.


> 6) What about the SHC project?
> In case you didn’t see the excellent talk at HBaseCon from Weiqing
> Yang, she’s been maintaining a high quality integration library
> between HBase and Spark.
>   HBaseCon West 2017 slides: https://s.apache.org/IQMA
>   Blog: https://s.apache.org/m1bc
>   Repo: https://github.com/hortonworks-spark/shc
> I’d love to see us encourage the SHC devs to fold their work into
> participation in our wider community. Before approaching them about
> that, I think we need to make sure we share goals and can give them
> reasonable expectations about release cadence (which probably means
> making it into branch-1).

I pinged Weiqing; my guess is she has an opinion on your swath here.

> Right now, I’d only consider the things that have made it to our docs
> to be “done”. Here’s the relevant section of the ref guide:
> http://hbase.apache.org/book.html#spark
> Comparing our current offering and the above, I’d say the big gaps
> between our offering and the SHC project are:
>   * Avro serialization (we have this implemented but documentation is
> limited to an example in the section on SparkSQL support)
>   * Composite keys (as mentioned above, we have a start to this)
>   * More robust handling of delegation tokens, i.e. in presence of
> multiple secure clusters
>   * Handling of Phoenix encoded data
> Are these all things we’d want available to our downstream folks?
I don't know enough about the integration but is the 'handling of Phoenix
encoded data' about mapping spark types to a serialization in hbase? If
not, where is the need for seamless transforms between spark types and a
natural hbase serialization listed. We need this IIRC.

> Personally, I think we’d serve our downstream folks well closing all
> of these gaps. I don’t think they ought to be blockers on getting our
> integration into releases; at first glance none of them look like
> they’d present compatibility issues.

> We’d need to figure out what to do about the phoenix encoding bit,
> dependency-wise. Ideally we’d get the phoenix folks to isolate their
> data encoding into a standalone artifact. I’m not sure how much effort
> that will be, but I’d be happy to take the suggestion over to them.
Or use hbase 'types' in hbase-common. Serialization is pluggable so could
do native and Phoenix (<= WORK!).

Great writeup.


> ---
> Thanks to everyone who made it all the way down here.  That’s the end
> of what I could think of after reflecting on this for a couple of days
> (thanks to our Mike Drob for bearing the brunt of my in progress
> ramblings).
> I know this is a wide variety of things; again feel free to just
> respond in pieces to the sections that strike your fancy. I’ll make
> sure we have a doc with a good summary of whatever consensus we reach
> and post it here, the website, and/or JIRA once we’ve had awhile for
> folks to contribute.
> -busbey

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message