Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 914DB200CAE for ; Wed, 21 Jun 2017 22:56:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9045B160BD5; Wed, 21 Jun 2017 20:56:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 80D96160BD0 for ; Wed, 21 Jun 2017 22:56:37 +0200 (CEST) Received: (qmail 74759 invoked by uid 500); 21 Jun 2017 20:56:36 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 74741 invoked by uid 99); 21 Jun 2017 20:56:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jun 2017 20:56:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 976661AFA95 for ; Wed, 21 Jun 2017 20:56:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id y-YipA-YyGWq for ; Wed, 21 Jun 2017 20:56:32 +0000 (UTC) Received: from mail-yw0-f177.google.com (mail-yw0-f177.google.com [209.85.161.177]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C596B5F2EE for ; Wed, 21 Jun 2017 20:56:31 +0000 (UTC) Received: by mail-yw0-f177.google.com with SMTP id 63so69100140ywr.0 for ; Wed, 21 Jun 2017 13:56:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=MHUlKFI8nAR5y0wjESVBnisAgPSb2Xrru00Yk4VmkB0=; b=Hq6khJp9zClnovOfjqOPksHIIIIg8/4FgLv1nFpYh8uaq53rVe8jIkIU3m4mMKN8zA DcB/K2M67L6Y0wIJFrg5TcRESQvBgT0X9zK5sDz2BBYZaqL+Jj0+NU6OLzGoGrDY7VKR kickYt9mzfA2cSOtXz8M0bNFyl6lZHXFAsWj+RPa15CBmfh8+z0peYXAuh6U6OQ+Uumq YQBIfmVojC971RWcdpfdf3X6BHzQ88/woF75AEWbdJHIrsv7rUZvGy6D9slw/aqmyanz cuTjwQCL16ImEEskp/TRD1L8li7E/dXtR7TI/KiWE042+cWI3d8QMdMbuEoGBsRSA8Xk 9zKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=MHUlKFI8nAR5y0wjESVBnisAgPSb2Xrru00Yk4VmkB0=; b=RMJ4Lwk/PQRugdkEnieN8tWh4Dpi0nlBf+6kjH2UQco4gHrCtGnPHgQ8+rPQSGlT5u BdWv43YjgVno0iJLtvdsjxAzav4dB/UmMp1uLw1Zmsew4TCcPkMIfauJvpSacnOzlIpR 8PGctJzspjQSgUkp9SWKocWWtesKcryfnGvVoTsBvRDOYtIudF1R4YFfhuSKODRQEw4W PwZGN4I2lsiwGhucarDyX3AEFy6QR5iG8cbZ5IU/YQMDVzyy9gjQlPLkRc9Y0eduBV2M YwgnvSB6DdoaEXNZvDu1EmGhsRZIlFeco3mrcnd20BpRvpP8Gvx7+1Vt0w45PjvFZZqZ Zsrw== X-Gm-Message-State: AKS2vOz0YCpKheIi3uJtFa2TzaMq7FUadsYT4zLDZ7kqJfFVkLIlhbbH B8C1KD/ZflugDpYuXXr0kT8jkJ0/IQ== X-Received: by 10.129.174.73 with SMTP id g9mr29700215ywk.255.1498078590581; Wed, 21 Jun 2017 13:56:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.129.165.142 with HTTP; Wed, 21 Jun 2017 13:56:30 -0700 (PDT) In-Reply-To: References: From: Yi Liang Date: Wed, 21 Jun 2017 13:56:30 -0700 Message-ID: Subject: Re: [DISCUSS] status of and plans for our hbase-spark integration To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary="f403045f6ebe3cca8405527e9a4c" archived-at: Wed, 21 Jun 2017 20:56:38 -0000 --f403045f6ebe3cca8405527e9a4c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable (1) Spark version: I also prefer your option b, (2) Scala version: we can support both Scala 2.10 and 2.11, since there is no source issue, just byte level change. We can use the profile in maven. Like -PscalVersion=3Dxxx.xxx (3) Do we have the right collection of Spark API(s): i think for hbase 2.0 release, we can just support the first three,and support Structured Streaming API in the future release. (4) For Composite keys: the patch in that jira is almost finished, maybe stale, we need to modify and test it, and hopefully put it into hbase 2.0. Thanks, Yi On Wed, Jun 21, 2017 at 1:18 PM, Andrew Purtell wrote= : > Phoenix has its own spark connector, which interfaces via the JDBC driver= ( > https://phoenix.apache.org/phoenix_spark.html). Do they need to do > anything? As of 4.10 Phoenix supports Spark >=3D 2.0 (PHOENIX-3333). I'm = not > sure what they do to handle Scala version combinations. > > On Wed, Jun 21, 2017 at 9:31 AM, Sean Busbey wrote: > > > Hi Folks! > > > > We've had integration with Apache Spark lingering in trunk for quite > > some time, and I'd like us to push towards firming it up. I'm going to > > try to cover a lot of ground below, so feel free to respond to just > > pieces and I'll write up a doc on things afterwards. > > > > For background, the hbase-spark module currently exists in trunk, > > branch-2, and the 2.0.0-alpha-1 release. Importantly, it has been in > > no "ready to use" release so far. It's been in master for ~2 years and > > has had a total of nearly 70 incremental changes. Right now it shows > > up in Stack=E2=80=99s excellent state of 2.0 doc ( https://s.apache.org= /1mB4 ) > > as a nice-to-have. I=E2=80=99d like to get some consensus on either get= ting it > > into release trains or officially move it out of scope for 2.0. > > > > ---- > > > > 1) Branch-1 releases > > > > In July 2015 we started tracking what kind of polish was needed for > > this code to make it into our downstream facing release lines in > > HBASE-14160. Personally, I think if the module isn't ready for a > > branch-1 release than it shouldn't be in a branch-2 release either. > > > > The only things still tracked as required are some form of published > > API docs (HBASE-17766) and an IT that we can run (HBASE-18175). Our Yi > > Liang has been working on both of these, and I think we have a good > > start on them. > > > > Is there anything else we ought to be tracking here? I notice the > > umbrella "make the connector better" issue (HBASE-14789) has only > > composite row key support still open (HBASE-15335). It looks like that > > work stalled out last summer after an admirable effort by our Zhan > > Zhang. Can this wait for a future minor release? > > > > Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and > > then our existing support backported to branch-1 in time for whenever > > we get HBase 1.4 started. > > > > 2) What Spark version(s) do we care about? > > > > The hbase-spark module originally started with support for Spark 1.3. > > It currently sits at supporting just 1.6. Our Ted Yu has been > > dutifully trying to find consensus on how we handle Spark 2.0 over in > > HBASE-16179 for nearly a year. > > > > AFAICT the Spark community has no more notion of what version(s) their > > downstream users are relying on than we do. It appears that Spark 1.6 > > will be their last 1.y release and at least the dev community is > > largely moving on to 2.y releases now. > > > > What version(s) do we want to handle and thus encourage our downstream > > folks to use? > > > > Just as a point of reference, Spark 1.6 doesn't have any proper > > handling of delegation tokens and our current do-it-ourselves > > workaround breaks in the presence of the support introduced in Spark > > 2. > > > > The way I see it, the options are a) ship both 1.6 and 2.y support, b) > > ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in > > branch-2. Does anyone have preferences here? > > > > Personally, I think I favor option b for simplicity, though I don't > > care for more possible delay in getting stuff out in branch-1. > > Probably option a would be best for our downstreamers. > > > > Related, while we've been going around on HBASE-16179 the Apache Spark > > community started shipping 2.1 releases and is now in the process of > > finalizing 2.2. Do we need to do anything different for these > > versions? > > > > Spark=E2=80=99s versioning policy suggests =E2=80=9Cnot unless we want = to support > > newer APIs or used alpha stuff=E2=80=9D. But I don=E2=80=99t have pract= ical experience > > with how this plays out in yet. > > > > http://spark.apache.org/versioning-policy.html > > > > > > 3) What scala version(s) do we care about? > > > > For those who aren't aware, Scala compatibility is a nightmare. Since > > Scala is still the primary language for implementation of Spark jobs, > > we have to care more about this than I'd like. (the only way out, I > > think, would be to implement our integration entirely in some other > > JVM language) > > > > The short version is that each minor version of scala (we care about) > > is mutually incompatible with all others. Right now both Spark 1.6 and > > Spark 2.y work with each of Scala 2.10 and 2.11. There's talk of > > adding support for Scala 2.12, but it will not happen until after > > Spark 2.2. > > > > (for those looking for a thread on Scala versions in Spark, I think > > this is the most recent: https://s.apache.org/IW4D ) > > > > Personally, I think we serve our downstreamers best when we ship > > artifacts that work with each of the scala versions a given version of > > Spark supports. It's painful to have to do something like upgrade your > > scala version just because the storage layer you want to use requires > > a particular version. It's also painful to have to rebuild artifacts > > because that layer only offers support for the scala version you like > > as a DIY option. > > > > The happy part of this situation is that the problem, as exposed to > > us, is at a byte code level and not a source issue. So probably we can > > support multiple scala versions just by rebuilding the same source > > against different library versions. > > > > 4) Packaging all this probably will be a pain no matter what we do > > > > One of the key points of contention on HBASE-16179 is around module > > layout given X versions of Spark and Y versions of Scala. > > > > As things are in master and branch-2 now, we support exactly Spark 1.6 > > on Scala 2.10. It would certainly be easiest to continue to just pick > > one Spark X and one Scala Y. Ted can correct me, but I believe the > > most recent state of HBASE-16179 does the full enumeration but only > > places a single artifact in the assembly (thus making that combination > > the blessed default). Now that we have precedent for client-specific > > libraries in the assembly (i.e. the jruby libs are kept off to the > > side and only included in classpaths that need them like the shell), I > > think we could do a better job of making sure libraries are deployed > > regardless of which spark and scala combination is present on a > > cluster. > > > > As a downstream user, I would want to make sure I can add a dependency > > to my maven project that will work for my particular spark/scala > > choice. I definitely don=E2=80=99t want to have to run my own nexus ins= tance > > so that I can build my own hbase-spark client module reliably. > > > > As a release manager, I don=E2=80=99t want to have to run O(X * Y) buil= ds just > > so we get the right set of maven artifacts. > > > > All of these personal opinions stated, what do others think? > > > > > > 5) Do we have the right collection of Spark API(s)? > > > > Spark has a large variety of APIs for interacting with data. Here are > > pointers to the big ones. > > > > RDDs (essentially in-memory tabular data): > > https://spark.apache.org/docs/latest/programming-guide.html > > > > Streaming (essentially a series of the above over time) : > > https://spark.apache.org/docs/latest/streaming-programming-guide.html > > > > Datasets/Dataframes (sql-oriented structured data processing that > > exposes computation info to the storage layer): > > https://spark.apache.org/docs/latest/sql-programming-guide.html > > > > Structured Streaming (essentially a series of the above over time): > > https://spark.apache.org/docs/latest/structured-streaming- > > programming-guide.html > > > > Right now we have support for the first three, more or less. > > Structured Streaming is alpha as of Spark 2.1 and is expected to be GA > > for Spark 2.2. > > > > Going forward, do we want our plan to be robust support for all of > > these APIs? Would we be better off focusing solely on the newer bits > > like dataframes? > > > > 6) What about the SHC project? > > > > In case you didn=E2=80=99t see the excellent talk at HBaseCon from Weiq= ing > > Yang, she=E2=80=99s been maintaining a high quality integration library > > between HBase and Spark. > > > > HBaseCon West 2017 slides: https://s.apache.org/IQMA > > Blog: https://s.apache.org/m1bc > > Repo: https://github.com/hortonworks-spark/shc > > > > I=E2=80=99d love to see us encourage the SHC devs to fold their work in= to > > participation in our wider community. Before approaching them about > > that, I think we need to make sure we share goals and can give them > > reasonable expectations about release cadence (which probably means > > making it into branch-1). > > > > Right now, I=E2=80=99d only consider the things that have made it to ou= r docs > > to be =E2=80=9Cdone=E2=80=9D. Here=E2=80=99s the relevant section of th= e ref guide: > > > > http://hbase.apache.org/book.html#spark > > > > Comparing our current offering and the above, I=E2=80=99d say the big g= aps > > between our offering and the SHC project are: > > > > * Avro serialization (we have this implemented but documentation is > > limited to an example in the section on SparkSQL support) > > * Composite keys (as mentioned above, we have a start to this) > > * More robust handling of delegation tokens, i.e. in presence of > > multiple secure clusters > > * Handling of Phoenix encoded data > > > > Are these all things we=E2=80=99d want available to our downstream folk= s? > > > > Personally, I think we=E2=80=99d serve our downstream folks well closin= g all > > of these gaps. I don=E2=80=99t think they ought to be blockers on getti= ng our > > integration into releases; at first glance none of them look like > > they=E2=80=99d present compatibility issues. > > > > We=E2=80=99d need to figure out what to do about the phoenix encoding b= it, > > dependency-wise. Ideally we=E2=80=99d get the phoenix folks to isolate = their > > data encoding into a standalone artifact. I=E2=80=99m not sure how much= effort > > that will be, but I=E2=80=99d be happy to take the suggestion over to t= hem. > > > > --- > > > > Thanks to everyone who made it all the way down here. That=E2=80=99s t= he end > > of what I could think of after reflecting on this for a couple of days > > (thanks to our Mike Drob for bearing the brunt of my in progress > > ramblings). > > > > I know this is a wide variety of things; again feel free to just > > respond in pieces to the sections that strike your fancy. I=E2=80=99ll = make > > sure we have a doc with a good summary of whatever consensus we reach > > and post it here, the website, and/or JIRA once we=E2=80=99ve had awhil= e for > > folks to contribute. > > > > -busbey > > > > > > -- > Best regards, > Andrew > > Words like orphans lost among the crosstalk, meaning torn from truth's > decrepit hands > - A23, Crosstalk > --f403045f6ebe3cca8405527e9a4c--