Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 512EB17AA6 for ; Fri, 22 May 2015 05:32:20 +0000 (UTC) Received: (qmail 34711 invoked by uid 500); 22 May 2015 05:32:18 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 34633 invoked by uid 500); 22 May 2015 05:32:18 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 34623 invoked by uid 99); 22 May 2015 05:32:18 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2015 05:32:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 18AF6182865 for ; Fri, 22 May 2015 05:32:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.901 X-Spam-Level: *** X-Spam-Status: No, score=3.901 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9scbaBIetBgl for ; Fri, 22 May 2015 05:32:05 +0000 (UTC) Received: from mail-ie0-f175.google.com (mail-ie0-f175.google.com [209.85.223.175]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id AAEED2315C for ; Fri, 22 May 2015 05:32:04 +0000 (UTC) Received: by iesa3 with SMTP id a3so24100868ies.2 for ; Thu, 21 May 2015 22:31:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=nMRdviYFxkcKbJzS9mIguLEfFuZHAuG2uojH0y4dL5E=; b=TOaTShyuU3h1GlUIl448e7ab3Ln/P/SnngcXUo3SMJi1DmVo2kufU4PlZOeMIat9t4 D9sDw6pRXoSTo9w5Hu/w2INasV9XPY54E4ygjpR6zQrgEOI9Pclaq+UmDN/PWI5XsRHd QRjCzEAeuuFbyrscXnQ/R9yVqT7p0zmJp6Cz4ivW0Ez1v59YfgL2eIlYEJQvCKtZD4Zr jPa9ngNYdk486eGgeep7W+fsHnR/woZwRE9oYffa1wRaaIDSjwLcCne9Moubqzb6YDCS 5xfGZJmVlCWn/uDHLYkzrdnxtBOO+Me23ug1nED124ODqr2YF4A2S/OVkrDmqr+8VNAn MFxQ== MIME-Version: 1.0 X-Received: by 10.42.43.199 with SMTP id y7mr7523730ice.12.1432272717555; Thu, 21 May 2015 22:31:57 -0700 (PDT) Received: by 10.64.121.166 with HTTP; Thu, 21 May 2015 22:31:57 -0700 (PDT) In-Reply-To: References: <201505201338147389944@yahoo.com.hk> Date: Thu, 21 May 2015 22:31:57 -0700 Message-ID: Subject: Re: Hive on Spark VS Spark SQL From: Cheolsoo Park To: user@hive.apache.org Content-Type: multipart/alternative; boundary=bcaec51969418d2e290516a4fa38 --bcaec51969418d2e290516a4fa38 Content-Type: text/plain; charset=UTF-8 Hi Xuefu, Thanks for the good comparison. I agree with most points, but #1 isn't true. SparkSQL has its own parser (implemented with Scala parser combinator library), analyzer, and optimizer although they're not as mature as Hive. What it depends on Hive for is Metastore, CliDriver, DDL parser, etc. Cheolsoo On Wed, May 20, 2015 at 10:45 AM, Xuefu Zhang wrote: > I have been working on HIve on Spark, and knows a little about SparkSQL. > Here are a few factors to be considered: > > 1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's > front end (parser and semantic analyzer) and metastore, and inject in > between a laryer where Hive's operator tree is reinterpreted in Spark's > constructs (transactions and actions). Thus, it's tied to a specific > version of Hive, which is always behind official Hive releases. > 2. Because of the reinterpretation, many features (window functions, > lateral views, etc) from Hive need to be reimplemented in Spark world. If > an implementation hasn't been done, you see a gap. That's why you would > expect functional disparity, not to mention future Hive futures. > 3. SparkSQL is far from production ready. > 4. On the other hand, Hive on Spark is native in Hive, embracing all Hive > features and growing with Hive. Hive's operators are honored without > re-interpretation. The integration is done at the execution layer, where > Spark is nothing but an advanced MapReduce engine. > 5. Hive is aiming at enterprise use cases, where there are more important > concerns such as security than purely if it works or if it runs fast. Hive > on Spark certainly makes the query run faster, but still keeps the same > enterprise-readiness. > 6. SparkSQL is a good fit if you're a heavy Spark user who occasionally > needs to run some SQL. Or you're a casual SQL user and like to try > something new. > 7. If haven't touched either Spark or Hive, I'd suggest you start with > Hive, especially for an enterprise. > 8. If you're an existing Hive user and consider taking advantage of Spark, > consider Hive on Spark. > 9. It's strongly discouraged to mix Hive and SparkSQL in your deployment. > SparkSQL includes a version of Hive, which is very likely at a different > version of the Hive that you have (even if you don't use Hive on Spark). > Library conflicts can put you in a nightmare. > 10. I haven't benchmarked SparkSQL myself, but I heard several reports > that SparkSQL, when being tried at scale, is either fast or failing your > queries. > > Hope this helps. > > Thanks, > > > On Tue, May 19, 2015 at 10:38 PM, guoqing0629@yahoo.com.hk < > guoqing0629@yahoo.com.hk> wrote: > >> Hive on Spark and SparkSQL which should be better , and what are the key >> characteristics and the advantages and the disadvantages between ? >> >> ------------------------------ >> guoqing0629@yahoo.com.hk >> > > --bcaec51969418d2e290516a4fa38 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Xuefu,

Thanks for the good compariso= n. I agree with most points, but #1 isn't true.

SparkSQL has its own parser (implemented with Scala parser combinator lib= rary), analyzer, and optimizer although they're not as mature as Hive. = What it depends on Hive for is Metastore, CliDriver, DDL parser, etc.

Ch= eolsoo

On We= d, May 20, 2015 at 10:45 AM, Xuefu Zhang <xzhang@cloudera.com> wrote:
<= div>
I have been working on HIve on Spark, and= knows a little about SparkSQL. Here are a few factors to be considered:
1. SparkSQL is similar to Shark (discontinued) in that it clones= Hive's front end (parser and semantic analyzer) and metastore, and inj= ect in between a laryer where Hive's operator tree is reinterpreted in = Spark's constructs (transactions and actions). Thus, it's tied to a= specific version of Hive, which is always behind official Hive releases.
2. Because of the reinterpretation, many features (window functions= , lateral views, etc) from Hive need to be reimplemented in Spark world. If= an implementation hasn't been done, you see a gap. That's why you = would expect functional disparity, not to mention future Hive futures.
<= /div>3. SparkSQL is far from production ready.
4. On the other han= d, Hive on Spark is native in Hive, embracing all Hive features and growing= with Hive. Hive's operators are honored without re-interpretation. The= integration is done at the execution layer, where Spark is nothing but an = advanced MapReduce engine.
5. Hive is aiming at enterprise use cas= es, where there are more important concerns such as security than purely if= it works or if it runs fast. Hive on Spark certainly makes the query run f= aster, but still keeps the same enterprise-readiness.
6. SparkSQL = is a good fit if you're a heavy Spark user who occasionally needs to ru= n some SQL. Or you're a casual SQL user and like to try something new.<= br>
7. If haven't touched either Spark or Hive, I'd suggest yo= u start with Hive, especially for an enterprise.
8. If you're = an existing Hive user and consider taking advantage of Spark, consider Hive= on Spark.
9. It's strongly discouraged to mix Hive and SparkS= QL in your deployment. SparkSQL includes a version of Hive, which is very l= ikely at a different version of the Hive that you have (even if you don'= ;t use Hive on Spark). Library conflicts can put you in a nightmare.
10. I haven't benchmarked SparkSQL myself, but I heard several repor= ts that SparkSQL, when being tried at scale, is either fast or failing your= queries.

Hope this helps.

Thanks,
<= div>

On Tue, May 19, 2015 at 10= :38 PM, guoqi= ng0629@yahoo.com.hk <guoqing0629@yahoo.com.hk> wr= ote:
Hive on Spark and SparkSQL which should be better , and w= hat are the key characteristics and the advantages and the disadvantages be= tween ?



--bcaec51969418d2e290516a4fa38--