Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A336180DC for ; Wed, 2 Mar 2016 18:19:38 +0000 (UTC) Received: (qmail 24630 invoked by uid 500); 2 Mar 2016 18:19:36 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 24563 invoked by uid 500); 2 Mar 2016 18:19:36 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 24553 invoked by uid 99); 2 Mar 2016 18:19:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2016 18:19:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 65DBBC01DC for ; Wed, 2 Mar 2016 18:19:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.199 X-Spam-Level: * X-Spam-Status: No, score=1.199 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ow5M3gEdPTrA for ; Wed, 2 Mar 2016 18:19:34 +0000 (UTC) Received: from mail-qg0-f50.google.com (mail-qg0-f50.google.com [209.85.192.50]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 87BB05FB70 for ; Wed, 2 Mar 2016 18:19:34 +0000 (UTC) Received: by mail-qg0-f50.google.com with SMTP id u110so46666149qge.3 for ; Wed, 02 Mar 2016 10:19:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-transfer-encoding :message-id:cc:from:subject:date:to; bh=6wrnTB9y/P/C64BiYRaBtgpNCfvK16eH1UuMWIEqmus=; b=eIz6eSdR5yumv4hbmztEoezSMe25EfT/JIBJTrgIKDPEi4x5QAjoqMzt8iwf4Ay75K VLWuF/ZZtcFUCLET+zfqKNytOLBnyC+XI13TObELzcTzLSbB8RjKqNZSLDztseP0yb8Q O7blr+MJkkF/w+/Ws6i9xHhtkCelXeDUXzI4Vw122MKzfmACdcUX7X+M64eVCQ5UQtzP 6Th0Fx7IhM2bjAeELpzMl7ZndQw4YH5j3oratHl1QW+vyxpNYaA3cHaQU8LEmsXb8bcI 7teEB7f+OCtUOEkdori8EfFpTj4JcQ2+0i49VFgRefxuYPEYLLTyjKiYonN81m71naK+ ruVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:references:mime-version:in-reply-to :content-transfer-encoding:message-id:cc:from:subject:date:to; bh=6wrnTB9y/P/C64BiYRaBtgpNCfvK16eH1UuMWIEqmus=; b=PGdvEndvJWIlVGASC+MqzDVZeUiHXio1XcGd10TTYtn32mFHNl4dwo/ncWXML08Gl6 iFwRkPn4JXjitSuycZlDOoS8pQ9CVx9Ze+Xem4sei4afMOjpSgevjyguHhSPzseIQIyA aAa36pam+tD4pOqIRXefif+E4Ok6HxEqmij9OQFUDgGx5VmHQAOv38mmSHInydMhoasJ GayhcFZ/U6q6ergWT7K66vAdT8Ra6wvdYowGUagaOK12lr43+nBE3QZfRD36DhNP7Ep3 0m2t6n/J9uH9XPPVCVBql0X/1F/GVV3QkuWrRo+x3Cl7s7NG3FDt/t7o4SBsvXSoLRuS xa6A== X-Gm-Message-State: AD7BkJL3K8NyN1OaAUyYS90TTcQcr9BtSZLtOtiOpTj4Fui5O67kVV3/mrGse7q/a7aY9w== X-Received: by 10.140.96.45 with SMTP id j42mr34534763qge.63.1456942473839; Wed, 02 Mar 2016 10:14:33 -0800 (PST) Received: from [25.0.74.42] ([24.114.52.87]) by smtp.gmail.com with ESMTPSA id a12sm15390149qkj.32.2016.03.02.10.14.32 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 02 Mar 2016 10:14:32 -0800 (PST) References: <1892683924.2956880.1456868332338.JavaMail.yahoo@mail.yahoo.com> <316A6CEC-B530-460B-97BF-43F3BF3A738A@gmail.com> Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: multipart/alternative; boundary=Apple-Mail-2A97F2F7-A0FC-47A9-90F5-EA7843368A10 Content-Transfer-Encoding: 7bit Message-Id: Cc: Ashok Kumar X-Mailer: iPhone Mail (11D257) From: Dayong Subject: Re: Hive and Impala Date: Wed, 2 Mar 2016 13:14:27 -0500 To: "user@hive.apache.org" --Apple-Mail-2A97F2F7-A0FC-47A9-90F5-EA7843368A10 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive m= etadata store can be decoupled from hive as well. In reality, we do suffer f= rom hive's performance even for ETL job. As result, we'll switch to implala += spark/ flink.=20 Thanks, Dayong > On Mar 2, 2016, at 10:35 AM, Mich Talebzadeh w= rote: >=20 > I forgot besides LLAP you are going to have Hive Hybrid Procedural SQL On H= adoop (HPL/SQL) which is going to add another dimension to Hive=20 >=20 > Dr Mich Talebzadeh > =20 > LinkedIn https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6= zP6AcPCCdOABUrV8Pw > =20 > http://talebzadehmich.wordpress.com > =20 >=20 >> On 2 March 2016 at 15:30, Mich Talebzadeh wro= te: >> SQL plays an increasing important role on Hadoop. As of today Hive IMO pr= ovides the best and most robust solution to anything resembling to Data Ware= house "solution" on Hadoop, chiefly by means of its powerful metastore which= can be hosted on a variety of mission critical databases plus Hive's ever i= ncreasing support for a variety of file types on HDFs from humble textfile t= o ORC. The remaining tools are little more than query tools that crucially r= ely on Hive Metastore for their needs. Take away Hive component and they are= more and less lame ducks. >>=20 >> Hive on MR speed was perceived to be slow but what the hec we are talking= about a Data Warehouse here which in most part should be batch oriented an= d not user-facing and batch oriented. In Hive 0.14 and 2.0 you can use Spark= and Tez as the execution engine and if you are well into functional program= ming, you can deploy Spark on Hive. If you look around from Impala to Spark t= he architecture is essentially a query tool. >>=20 >>=20 >>=20 >> Dr Mich Talebzadeh >> =20 >> LinkedIn https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd= 6zP6AcPCCdOABUrV8Pw >> =20 >> http://talebzadehmich.wordpress.com >> =20 >>=20 >>> On 2 March 2016 at 13:52, Dayong wrote: >>> As I remember of few weeks before in Hadoop weekly news feed, cloudera h= as a benchmark showing implala is a little better than spark SQL and hive wi= th tez. You can check that. =46rom my experience, hive is still leading tool= for regular ETL job since it is stable. The other tool are better for adhoc= and interactive query use case. Cloudera bet on implala especially with its= new kudo project.=20 >>>=20 >>> Thanks, >>> Dayong >>>=20 >>>> On Mar 1, 2016, at 5:14 PM, Edward Capriolo wro= te: >>>>=20 >>>> My nocks on impala. (not intended to be a post knocking impala) >>>>=20 >>>> Impala really has not delivered on the complex types that hive has (aft= er promising it for quite a while), also it only works with the 'blessed' in= put formats, parquet, avro, text. >>>>=20 >>>> It is very annoying to work with impala, In my version if you create a p= artition in hive impala does not see it. You have to run "refresh".=20 >>>>=20 >>>> In impala I do not have all the UDFS that hive has like percentile, etc= .=20 >>>>=20 >>>> Impala is fast. Many data-analysts / data-scientist types that can't wa= it 10 seconds for a query so when I need top produce something for them I ma= ke sure the data has no complex types and uses a table type that impala unde= rstands.=20 >>>>=20 >>>> But for my work I still work primarily in hive, because I do not want t= o deal with all the things that impala does not have/might have/ and when I n= eed something special like my own UDFs it is easier to whip up the solution i= n hive.=20 >>>>=20 >>>> Having worked with M$ SQL server, and vertica, Impala is on par with th= em but I don'think of it like i think of hive. To me it just feels like a ve= rtica that I can cheat loading sometimes because it is backed by hdfs.=20 >>>>=20 >>>> Hive is something different, I am making pipelines, I am transforming d= ata, doing streaming, writing custom udfs, querying JSON directly. Its not != =3D impala. >>>>=20 >>>> ::random message of the day:: >>>>=20 >>>>=20 >>>> =20 >>>>=20 >>>>> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar wro= te: >>>>>=20 >>>>> Dr Mitch, >>>>>=20 >>>>> My two cents here. >>>>>=20 >>>>> I don't have direct experience of Impala but in my humble opinion I sh= are your views that Hive provides the best metastore of all Big Data systems= . Looking around almost every product in one form and shape use Hive code so= mewhere. My colleagues inform me that Hive is one of the most stable Big Dat= a products. >>>>>=20 >>>>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus o= f course MR, there is really little need for many other products in the same= space. It is good to keep things simple. >>>>>=20 >>>>> Warmest >>>>>=20 >>>>>=20 >>>>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh wrote: >>>>>=20 >>>>>=20 >>>>> I have not heard of Impala anymore. I saw an article in LinkedIn title= d >>>>>=20 >>>>> "Apache Hive Or Cloudera Impala? What is Best for me?" >>>>>=20 >>>>> "We can access all objects from Hive data warehouse with HiveQL which l= everages the map-reduce architecture in background for data retrieval and tr= ansformation and this results in latency." >>>>>=20 >>>>> My response was >>>>>=20 >>>>> This statement is no longer valid as you have choices of three engines= now with MR, Spark and Tez. I have not used Impala myself as I don't think t= here is a need for it with Hive on Spark or Spark using Hive metastore provi= ding whatever needed. Hive is for Data Warehouse and provides what is says o= n the tin. Please also bear in mind that Hive offers ORC storage files that p= rovide store Index capabilities further optimizing the queries with addition= al stats at file, stripe and row group levels.=20 >>>>>=20 >>>>> Anyway the question is with Hive on Spark or Spark using Hive metastor= e what we cannot achieve that we can achieve with Impala? >>>>>=20 >>>>>=20 >>>>> Dr Mich Talebzadeh >>>>> =20 >>>>> LinkedIn https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianr= bJd6zP6AcPCCdOABUrV8Pw >>>>> =20 >>>>> http://talebzadehmich.wordpress.com >=20 --Apple-Mail-2A97F2F7-A0FC-47A9-90F5-EA7843368A10 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Tez is kind of outdated and Orc is so d= edicated on hive. In addition, hive metadata store can be decoupled from hiv= e as well. In reality, we do suffer from hive's performance even for ETL job= . As result, we'll switch to implala + spark/ flink. 

Thanks,Dayong

On Mar 2, 2016, at 10:35 AM, Mich Talebzadeh &l= t;mich.talebzadeh@gmail.com= > wrote:

I f= orgot besides LLAP you are going to have Hive Hybrid Procedural SQL On Hadoop (HPL= /SQL) which is going to add another dimension to Hive 

On 2 March 2016 at 15:30, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
SQL plays an increasing important role on Ha= doop. As of today Hive IMO provides the best and most robust solution to any= thing resembling to Data Warehouse "solution" on Hadoop, chiefly by mea= ns of its powerful metastore which can be hosted on a variety of mission cri= tical databases plus Hive's ever increasing support for a variety of fi= le types on HDFs from humble textfile to ORC. The remaining tools are l= ittle more than query tools that crucially rely on Hive Metastore for their n= eeds. Take away Hive component and they are more and less lame ducks.
<= div>
Hive on MR speed was perceived to be slow but what the he= c we are talking about a Data Warehouse here which in most part should be ba= tch oriented  and not user-facing and batch oriented. In Hive 0.14= and 2.0 you can use Spark and Tez as the execution engine and if you are we= ll into functional programming, you can deploy Spark on Hive. If you look ar= ound from Impala to Spark the architecture is essentially a query tool.




On 2 March 2016= at 13:52, Dayong <willddy@gmail.com> wrote:
As I remember of few weeks before in Hadoop weekly n= ews feed, cloudera has a benchmark showing implala is a little better than s= park SQL and hive with tez. You can check that. =46rom my experience, hive i= s still leading tool for regular ETL job since it is stable. The other tool a= re better for adhoc and interactive query use case. Cloudera bet on implala e= specially with its new kudo project. 

Thanks,
Dayong

On Mar 1, 2016, at 5:14 PM, Edward Capriolo <edlinuxguru@gmail.com<= /a>> wrote:

= My nocks on impala. (not intended to be a post knocking impala)

Impala really has not delivered on the complex types that hive has (a= fter promising it for quite a while), also it only works with the 'blessed' i= nput formats, parquet, avro, text.

It is very annoying to work with i= mpala, In my version if you create a partition in hive impala does not see i= t. You have to run "refresh". 

In impala I do n= ot have all the UDFS that hive has like percentile, etc. 

Impala= is fast. Many data-analysts / data-scientist types that can't wait 10 secon= ds for a query so when I need top produce something for them I make sure the= data has no complex types and uses a table type that impala understands.&nb= sp;

But for my work I still work primarily in hive,= because I do not want to deal with all the things that impala does not have= /might have/ and when I need something special like my own UDFs it is easier= to whip up the solution in hive. 

Having worked with M$ SQL ser= ver, and vertica, Impala is on par with them but I don'think of it like i th= ink of hive. To me it just feels like a vertica that I can cheat loading som= etimes because it is backed by hdfs. 

Hive is s= omething different, I am making pipelines, I am transforming data, doing str= eaming, writing custom udfs, querying JSON directly. Its not !=3D impala.
::random message of the day::


 

On Tue, Mar 1, 2016 at 4= :38 PM, Ashok Kumar <ashok34668@yahoo.com> wrote:
Dr Mitch,

My two cents here= .

I don't have direct exper= ience of Impala but in my humble opinion I share your views that Hive provid= es the best metastore of all Big Data systems. Looking around almost every p= roduct in one form and shape use Hive code somewhere. My colleagues inform m= e that Hive is one of the most stable Big Data products.

With the capabilities of Spark on Hive and Hiv= e on Spark or Tez plus of course MR, there is really little need for many ot= her products in the same space. It is good to keep things simple.
=
Warmest


On Tuesday, 1 March 2016, 11:33, Mich= Talebzadeh <mich.talebzadeh@gmail.com> wrote:


=
I have not heard of Impala anymore. I saw an arti= cle in LinkedIn titled

"Apache Hive Or Cloudera Imp= ala? What is Best for me?"

"We can access all objec= ts from Hive data warehouse with HiveQL which leverages the map-reduce archi= tecture in background for data retrieval and transformation and this results= in latency."

My response was

=
This statement is no longer valid as you have choices of three engines n= ow with MR, Spark and Tez. I have not used Impala myself as I don't think th= ere is a need for it with Hive on Spark or Spark using Hive metastore provid= ing whatever needed. Hive is for Data Warehouse and provides what is says on= the tin. Please also bear in mind that Hive offers ORC storage files that p= rovide store Index capabilities further optimizing the queries with addition= al stats at file, stripe and row group levels. 

Anyway the question is with Hive on Spark or Spark using Hive metastore wh= at we cannot achieve that we can achieve with Impala?







= --Apple-Mail-2A97F2F7-A0FC-47A9-90F5-EA7843368A10--