Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 33EC8200CAE for ; Wed, 21 Jun 2017 18:31:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 327D3160BD5; Wed, 21 Jun 2017 16:31:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 51D0E160BD0 for ; Wed, 21 Jun 2017 18:31:37 +0200 (CEST) Received: (qmail 63515 invoked by uid 500); 21 Jun 2017 16:31:31 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 63502 invoked by uid 99); 21 Jun 2017 16:31:31 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jun 2017 16:31:31 +0000 Received: from mail-pf0-f171.google.com (mail-pf0-f171.google.com [209.85.192.171]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 137B71A00A2 for ; Wed, 21 Jun 2017 16:31:31 +0000 (UTC) Received: by mail-pf0-f171.google.com with SMTP id q86so3273986pfl.3 for ; Wed, 21 Jun 2017 09:31:31 -0700 (PDT) X-Gm-Message-State: AKS2vOyYO6vexl9pVe0JWA3Ybf5qlri9+wROiAdNOJH+b0xpNs5BDGi5 km8qMof11oi7gVSgNzyvZ420EOnjEA== X-Received: by 10.84.212.137 with SMTP id e9mr43482651pli.115.1498062690474; Wed, 21 Jun 2017 09:31:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.150.214 with HTTP; Wed, 21 Jun 2017 09:31:09 -0700 (PDT) From: Sean Busbey Date: Wed, 21 Jun 2017 11:31:09 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: [DISCUSS] status of and plans for our hbase-spark integration To: dev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable archived-at: Wed, 21 Jun 2017 16:31:38 -0000 Hi Folks! We've had integration with Apache Spark lingering in trunk for quite some time, and I'd like us to push towards firming it up. I'm going to try to cover a lot of ground below, so feel free to respond to just pieces and I'll write up a doc on things afterwards. For background, the hbase-spark module currently exists in trunk, branch-2, and the 2.0.0-alpha-1 release. Importantly, it has been in no "ready to use" release so far. It's been in master for ~2 years and has had a total of nearly 70 incremental changes. Right now it shows up in Stack=E2=80=99s excellent state of 2.0 doc ( https://s.apache.org/1mB= 4 ) as a nice-to-have. I=E2=80=99d like to get some consensus on either getting= it into release trains or officially move it out of scope for 2.0. ---- 1) Branch-1 releases In July 2015 we started tracking what kind of polish was needed for this code to make it into our downstream facing release lines in HBASE-14160. Personally, I think if the module isn't ready for a branch-1 release than it shouldn't be in a branch-2 release either. The only things still tracked as required are some form of published API docs (HBASE-17766) and an IT that we can run (HBASE-18175). Our Yi Liang has been working on both of these, and I think we have a good start on them. Is there anything else we ought to be tracking here? I notice the umbrella "make the connector better" issue (HBASE-14789) has only composite row key support still open (HBASE-15335). It looks like that work stalled out last summer after an admirable effort by our Zhan Zhang. Can this wait for a future minor release? Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and then our existing support backported to branch-1 in time for whenever we get HBase 1.4 started. 2) What Spark version(s) do we care about? The hbase-spark module originally started with support for Spark 1.3. It currently sits at supporting just 1.6. Our Ted Yu has been dutifully trying to find consensus on how we handle Spark 2.0 over in HBASE-16179 for nearly a year. AFAICT the Spark community has no more notion of what version(s) their downstream users are relying on than we do. It appears that Spark 1.6 will be their last 1.y release and at least the dev community is largely moving on to 2.y releases now. What version(s) do we want to handle and thus encourage our downstream folks to use? Just as a point of reference, Spark 1.6 doesn't have any proper handling of delegation tokens and our current do-it-ourselves workaround breaks in the presence of the support introduced in Spark 2. The way I see it, the options are a) ship both 1.6 and 2.y support, b) ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in branch-2. Does anyone have preferences here? Personally, I think I favor option b for simplicity, though I don't care for more possible delay in getting stuff out in branch-1. Probably option a would be best for our downstreamers. Related, while we've been going around on HBASE-16179 the Apache Spark community started shipping 2.1 releases and is now in the process of finalizing 2.2. Do we need to do anything different for these versions? Spark=E2=80=99s versioning policy suggests =E2=80=9Cnot unless we want to s= upport newer APIs or used alpha stuff=E2=80=9D. But I don=E2=80=99t have practical= experience with how this plays out in yet. http://spark.apache.org/versioning-policy.html 3) What scala version(s) do we care about? For those who aren't aware, Scala compatibility is a nightmare. Since Scala is still the primary language for implementation of Spark jobs, we have to care more about this than I'd like. (the only way out, I think, would be to implement our integration entirely in some other JVM language) The short version is that each minor version of scala (we care about) is mutually incompatible with all others. Right now both Spark 1.6 and Spark 2.y work with each of Scala 2.10 and 2.11. There's talk of adding support for Scala 2.12, but it will not happen until after Spark 2.2. (for those looking for a thread on Scala versions in Spark, I think this is the most recent: https://s.apache.org/IW4D ) Personally, I think we serve our downstreamers best when we ship artifacts that work with each of the scala versions a given version of Spark supports. It's painful to have to do something like upgrade your scala version just because the storage layer you want to use requires a particular version. It's also painful to have to rebuild artifacts because that layer only offers support for the scala version you like as a DIY option. The happy part of this situation is that the problem, as exposed to us, is at a byte code level and not a source issue. So probably we can support multiple scala versions just by rebuilding the same source against different library versions. 4) Packaging all this probably will be a pain no matter what we do One of the key points of contention on HBASE-16179 is around module layout given X versions of Spark and Y versions of Scala. As things are in master and branch-2 now, we support exactly Spark 1.6 on Scala 2.10. It would certainly be easiest to continue to just pick one Spark X and one Scala Y. Ted can correct me, but I believe the most recent state of HBASE-16179 does the full enumeration but only places a single artifact in the assembly (thus making that combination the blessed default). Now that we have precedent for client-specific libraries in the assembly (i.e. the jruby libs are kept off to the side and only included in classpaths that need them like the shell), I think we could do a better job of making sure libraries are deployed regardless of which spark and scala combination is present on a cluster. As a downstream user, I would want to make sure I can add a dependency to my maven project that will work for my particular spark/scala choice. I definitely don=E2=80=99t want to have to run my own nexus instanc= e so that I can build my own hbase-spark client module reliably. As a release manager, I don=E2=80=99t want to have to run O(X * Y) builds j= ust so we get the right set of maven artifacts. All of these personal opinions stated, what do others think? 5) Do we have the right collection of Spark API(s)? Spark has a large variety of APIs for interacting with data. Here are pointers to the big ones. RDDs (essentially in-memory tabular data): https://spark.apache.org/docs/latest/programming-guide.html Streaming (essentially a series of the above over time) : https://spark.apache.org/docs/latest/streaming-programming-guide.html Datasets/Dataframes (sql-oriented structured data processing that exposes computation info to the storage layer): https://spark.apache.org/docs/latest/sql-programming-guide.html Structured Streaming (essentially a series of the above over time): https://spark.apache.org/docs/latest/structured-streaming-programming-guide= .html Right now we have support for the first three, more or less. Structured Streaming is alpha as of Spark 2.1 and is expected to be GA for Spark 2.2. Going forward, do we want our plan to be robust support for all of these APIs? Would we be better off focusing solely on the newer bits like dataframes? 6) What about the SHC project? In case you didn=E2=80=99t see the excellent talk at HBaseCon from Weiqing Yang, she=E2=80=99s been maintaining a high quality integration library between HBase and Spark. HBaseCon West 2017 slides: https://s.apache.org/IQMA Blog: https://s.apache.org/m1bc Repo: https://github.com/hortonworks-spark/shc I=E2=80=99d love to see us encourage the SHC devs to fold their work into participation in our wider community. Before approaching them about that, I think we need to make sure we share goals and can give them reasonable expectations about release cadence (which probably means making it into branch-1). Right now, I=E2=80=99d only consider the things that have made it to our do= cs to be =E2=80=9Cdone=E2=80=9D. Here=E2=80=99s the relevant section of the re= f guide: http://hbase.apache.org/book.html#spark Comparing our current offering and the above, I=E2=80=99d say the big gaps between our offering and the SHC project are: * Avro serialization (we have this implemented but documentation is limited to an example in the section on SparkSQL support) * Composite keys (as mentioned above, we have a start to this) * More robust handling of delegation tokens, i.e. in presence of multiple secure clusters * Handling of Phoenix encoded data Are these all things we=E2=80=99d want available to our downstream folks? Personally, I think we=E2=80=99d serve our downstream folks well closing al= l of these gaps. I don=E2=80=99t think they ought to be blockers on getting o= ur integration into releases; at first glance none of them look like they=E2=80=99d present compatibility issues. We=E2=80=99d need to figure out what to do about the phoenix encoding bit, dependency-wise. Ideally we=E2=80=99d get the phoenix folks to isolate thei= r data encoding into a standalone artifact. I=E2=80=99m not sure how much eff= ort that will be, but I=E2=80=99d be happy to take the suggestion over to them. --- Thanks to everyone who made it all the way down here. That=E2=80=99s the e= nd of what I could think of after reflecting on this for a couple of days (thanks to our Mike Drob for bearing the brunt of my in progress ramblings). I know this is a wide variety of things; again feel free to just respond in pieces to the sections that strike your fancy. I=E2=80=99ll make sure we have a doc with a good summary of whatever consensus we reach and post it here, the website, and/or JIRA once we=E2=80=99ve had awhile fo= r folks to contribute. -busbey