Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 49014200CAF for ; Thu, 22 Jun 2017 17:01:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 479B3160BE7; Thu, 22 Jun 2017 15:01:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 605BA160BE5 for ; Thu, 22 Jun 2017 17:01:13 +0200 (CEST) Received: (qmail 68055 invoked by uid 500); 22 Jun 2017 15:01:11 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 68037 invoked by uid 99); 22 Jun 2017 15:01:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jun 2017 15:01:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id CB6221A034F for ; Thu, 22 Jun 2017 15:01:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.481 X-Spam-Level: ** X-Spam-Status: No, score=2.481 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera.com header.b=UbfNokp1; dkim=pass (2048-bit key) header.d=apache-org.20150623.gappssmtp.com header.b=ds/DkMxS Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id iK39s5ZZGk6b for ; Thu, 22 Jun 2017 15:01:07 +0000 (UTC) Received: from mail-qt0-f175.google.com (mail-qt0-f175.google.com [209.85.216.175]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id B1AE05F6C3 for ; Thu, 22 Jun 2017 15:01:06 +0000 (UTC) Received: by mail-qt0-f175.google.com with SMTP id u12so13951360qth.0 for ; Thu, 22 Jun 2017 08:01:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera.com; s=google; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to; bh=9NoHsYL63HvZ2FoD0nn9Y6FLBRnfA+bB0/if4/X57UE=; b=UbfNokp1v9BItZhVYzPZ99MBAKzbbJN4grV8PTyxNpdBoEXA25INoTJRkdiRj/Acvz 4qJ3i2eaYYbhTg97oCr4DN8lCNcxXfFAG+8zbzM1PEvAdnUPOXpANGPfljbbBMQkpx34 haVSbWkd4iuaZ6M5sTRkQ+xKLAKE+po0InyOBJtdaAmi/CDPGTmxg5OQyktI4eOgdBoA 40LxOnePuP+9zAxFfT5jyZT1PzoQM0PgYFtN0j31YPnJ24i3MqrEQPTMzkzu7uXrZzXg gImaJzDic8YDOB2oKemahm1LjodyLUpjEB3ZdI/K4YbO58H80ulOSAmoQ80WlGvoudNc Du1A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apache-org.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to; bh=9NoHsYL63HvZ2FoD0nn9Y6FLBRnfA+bB0/if4/X57UE=; b=ds/DkMxS3/ily4cjRFFB6RQobvv0YAuHybb28LoFnFYGlMTEM4ilNsJb7971Yc1ca2 PQlKS0sfFSwx4wmeD1qNkTidbxlBRV8px9IuVCn5zrIq042xfvl1oRDKNTvlohyYoro7 Zj/+v1z+ngSIceMJs7K/tdEfkG5Iu2C8baTUlhRMxCyvOicVleU9+Ka3x39n6BEtetA3 NFeCHhtEDw8Cedlbiv5oPgJtWbBwd04Igj3Bp8iQBZll3D3skki2zPV13IaC4XeakGFw PvWxrCSe5SBWUutNkpqVB2XMlxkaU4Zg/It1yORG6FzRvBm+otWDwHLoJU0IcOEUUUwT 9+bg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to; bh=9NoHsYL63HvZ2FoD0nn9Y6FLBRnfA+bB0/if4/X57UE=; b=ZOnVyCS35QGzxaXTpPlp5JLWaOzcv4ifEjS6P8+yeFnpKVmR5C2795axWs1I955n3p ZDjU3jUWC/x1UYHCArP3nvNL0deFmxn+7YSL5bPwwmjWRaJMFoWUfBJgCS7cdrIPgqnt 85b6GV7NdO1a0b0ye5kahSWriwIOINkaZf344wBAZu7SzP52XFhmr2PDf4/hC8c7REWe 3wMnpchNTDwYfqkbnasT30jKa1P+ixFs//FoJTMzc1btzFQygAsrsrDskWXfzAeGdEJe vaZFWgY8ovH+uvFoiCp1TWi96yywtdbTckxUwR5KrcJm1Mjd2B/SyGhhTqukTGjnZQkk 5Evg== X-Gm-Message-State: AKS2vOwHAsBh0SPmLYu/8+3Fy8bFlxmeKys1b4eYkeedAN6CF6BIV6yl NXgP8mOzTdmMnwJjKSJD7nGQD6gjF3BwS84= X-Received: by 10.200.3.88 with SMTP id w24mr3661027qtg.203.1498143665313; Thu, 22 Jun 2017 08:01:05 -0700 (PDT) MIME-Version: 1.0 Sender: mdrob@cloudera.com Received: by 10.237.59.247 with HTTP; Thu, 22 Jun 2017 08:00:44 -0700 (PDT) In-Reply-To: References: From: Mike Drob Date: Thu, 22 Jun 2017 10:00:44 -0500 X-Google-Sender-Auth: 5x4MqLL5UYaWB7WErmbwftaQW5g Message-ID: Subject: Re: [DISCUSS] status of and plans for our hbase-spark integration To: dev Content-Type: multipart/alternative; boundary="f4030435cf04fe846c05528dc066" archived-at: Thu, 22 Jun 2017 15:01:14 -0000 --f4030435cf04fe846c05528dc066 Content-Type: text/plain; charset="UTF-8" That's a lot of ground you're trying to cover, Sean, thanks for putting this together. > 1) Branch-1 releases > Is there anything else we ought to be tracking here? We currently have code in the o.a.spark namespace. I don't think there is a JIRA for it yet, but this seems like cross-project trouble waiting to happen. https://github.com/apache/hbase/tree/master/ hbase-spark/src/main/scala/org/apache/spark > The way I see it, the options are a) ship both 1.6 and 2.y support, b) > ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in > branch-2. Does anyone have preferences here? I think I prefer option B here as well. It sounds like Spark 2.2 will be out Very Soon, so we should almost certainly have a story for that. If there are no compatibility issues, then we can support >= 2.0 or 2.1, otherwise there's no reason to try and hit the moving target and we can focus on supporting the newest release. Like you said earlier, there's been no official release of this module yet, so I have to imagine that the current consumers are knowingly bleeding edge and can handle an upgrade or recompile on their own. > 4) Packaging all this probably will be a pain no matter what we do Do we have to package this in our assembly at all? Currently, we include the hbase-spark module in the branch-2 and master assembly, but I'm not convinced this needs to be the case. Is it too much to ask users to build a jar with dependencies (which I think we already do) and include the appropriate spark/scala/hbase jars in it (pulled from maven)? I think this problem can be better solved through docs and client tooling rather than going through awkward gymnastics to package m*n versions in our tarball _and_ making sure that we get all the classpaths right. > 5) Do we have the right collection of Spark API(s): Agree with Yi Liang here, release what we have then worry about adding things later. On Thu, Jun 22, 2017 at 8:26 AM, Sean Busbey wrote: > On Wed, Jun 21, 2017 at 10:37 PM, Stack wrote: > > On Wed, Jun 21, 2017 at 5:26 PM, Andrew Purtell > wrote: > > > >> I seem to recall that what eventually was committed to master as > >> hbase-spark was first shopped to the Spark project, who felt the same, > that > >> it should be hosted elsewhere. > > > > > > I have the same remembrance. > > > > > >> .... I would draw an analogy with > >> mapreduce: we had what we called 'first class' mapreduce integration, > spark > >> is the alleged successor to mapreduce, we should evolve that support as > >> such. I'd like to know if that reasoning, or other rationale, is > sufficient > >> at this time. > >> > >> > > Spark should be first-class on equal footing with MR if not more so (our > MR > > integration is too tightly bound up with our internals badly in need of > > untangling). > > > > Reading over the scope of work Sean outlines -- the variants, pom > profiles, > > the module profusion, and the uncertainties -- makes me queasy pulling it > > all in. > > > > I'm working on a little mini-hbase project at the mo to shade guava, > etc., > > and it is easy going. Made me think we could do a mini-project to host > > spark so we could contain it should it go up in flames. > > > > S > > I think the current approach of keeping all the spark related stuff in > a set of modules that we don't depend on for our other bits > sufficiently isolates us from the risk of things blowing up. For > example, when we're ready to build some of our admin tools on the > spark integration instead of MR we can update them to use Java > Services API or some similar runtime loading method to avoid having a > dependency directly on the Spark artifacts. > > It's true that we could put this into a different repo with its own > release cycle, but I suspect that will lead to even more build pain. > Especially given that it's likely to remain under active development > for the foreseeable future and we'll want to package some version of > it in our convenience binary assembly. Contrast with our third party > dependencies, which tend to remain the same over relatively large > timespans (e.g. a major version). If we end up voting on releases that > cover a version from both this hypothetical hbase-spark repo and the > main repo, what would we have really gained by splitting the two up? > --f4030435cf04fe846c05528dc066--