Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5609018788 for ; Wed, 2 Mar 2016 15:40:44 +0000 (UTC) Received: (qmail 22964 invoked by uid 500); 2 Mar 2016 15:40:42 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 22893 invoked by uid 500); 2 Mar 2016 15:40:42 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 22878 invoked by uid 99); 2 Mar 2016 15:40:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2016 15:40:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5FAD5C00ED for ; Wed, 2 Mar 2016 15:40:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id mps0Q0lvY2kh for ; Wed, 2 Mar 2016 15:40:39 +0000 (UTC) Received: from mail-vk0-f54.google.com (mail-vk0-f54.google.com [209.85.213.54]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id A16465FBF5 for ; Wed, 2 Mar 2016 15:40:38 +0000 (UTC) Received: by mail-vk0-f54.google.com with SMTP id e6so205070798vkh.2 for ; Wed, 02 Mar 2016 07:40:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=d094CmfMLziS0mccz4PgN5MiBFTj0p+QYjd2531K4Po=; b=eiSmqLqPrMa2afSOMCCSTPDwhiJnbS/kMX3XyipzI9SM6HPMRVtqF3vU6bDgKJvVtV A7DWeHWRjKSlw19lGQdO/uznT2gEtCZrZUfW3khshkSTJ2IOn4kzvl66E6LHSP7Ok9uT 6Cu6ShgqtsXu+NeFhox56oqIi/Pn4z0rMkS67h3KJHD6LQjM2TbqNC29LwsCjqkpIU+8 GzN7XcLyhCpG2pYozk0iHmbE6wlYY/mL968kgPQrD55UCKqHoZKhBbAzE10EsJF2s7tC 9sOuzpmtVSMjzr9ppz1bENN97PqvWEYAT5uygRc3G2zXMjH7o0iZjaYdLJE3znoiZ3+J PRCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=d094CmfMLziS0mccz4PgN5MiBFTj0p+QYjd2531K4Po=; b=mszSn30s39eLqT/GjwrXEL6+gf/WF67Bvkpgwf+tysNfPJSHOV51gxOUDC9YHkP+UO PLvtk5Z5kxji7rFOQlR9FqPFjgwreN24Q/1F5SJWrwCqRcUwv5TUZobt+cunvIVBAX6Q NBSKT1RS0DT7FRmlDCjyW8GzlPEDtV9CNOtPWUlEfGYfj1XlchQZHKBIEezIU0EGlfz+ Zge4s+oUWU/rsjW6mtFQi/FhRp0dmmVYgXeCQ+HWcNm9td+stGY2esdE7IvsJpU4fttC RR3GzRkYE6IkLgJG9pzWBmoXckE8zOq4kqHhiWgVkX4kHrk448M4HZmWVwfpNu/xMuz4 5cEA== X-Gm-Message-State: AD7BkJIFDv0iGoGLgsV8CF0+wFDbv6VncogBoex5Ip75+cyjZvJa0yC38u19w+4EBtW1tCzhVU/4C1UzCZCFDQ== MIME-Version: 1.0 X-Received: by 10.31.9.72 with SMTP id 69mr17074872vkj.126.1456932935851; Wed, 02 Mar 2016 07:35:35 -0800 (PST) Received: by 10.31.128.213 with HTTP; Wed, 2 Mar 2016 07:35:35 -0800 (PST) In-Reply-To: References: <1892683924.2956880.1456868332338.JavaMail.yahoo@mail.yahoo.com> <316A6CEC-B530-460B-97BF-43F3BF3A738A@gmail.com> Date: Wed, 2 Mar 2016 15:35:35 +0000 Message-ID: Subject: Re: Hive and Impala From: Mich Talebzadeh To: user@hive.apache.org Cc: Ashok Kumar Content-Type: multipart/alternative; boundary=001a11440dfa1a62c3052d12a2c3 --001a11440dfa1a62c3052d12a2c3 Content-Type: text/plain; charset=UTF-8 I forgot besides LLAP you are going to have Hive Hybrid Procedural SQL On Hadoop (HPL/SQL) which is going to add another dimension to Hive Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 2 March 2016 at 15:30, Mich Talebzadeh wrote: > SQL plays an increasing important role on Hadoop. As of today Hive IMO > provides the best and most robust solution to anything resembling to Data > Warehouse "solution" on Hadoop, chiefly by means of its powerful metastore > which can be hosted on a variety of mission critical databases plus Hive's > ever increasing support for a variety of file types on HDFs from humble > textfile to ORC. The remaining tools are little more than query tools that > crucially rely on Hive Metastore for their needs. Take away Hive component > and they are more and less lame ducks. > > Hive on MR speed was perceived to be slow but what the hec we are talking > about a Data Warehouse here which in most part should be batch oriented > and not user-facing and batch oriented. In Hive 0.14 and 2.0 you can use > Spark and Tez as the execution engine and if you are well into functional > programming, you can deploy Spark on Hive. If you look around from Impala > to Spark the architecture is essentially a query tool. > > > > Dr Mich Talebzadeh > > > > LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > * > > > > http://talebzadehmich.wordpress.com > > > > On 2 March 2016 at 13:52, Dayong wrote: > >> As I remember of few weeks before in Hadoop weekly news feed, cloudera >> has a benchmark showing implala is a little better than spark SQL and hive >> with tez. You can check that. From my experience, hive is still leading >> tool for regular ETL job since it is stable. The other tool are better for >> adhoc and interactive query use case. Cloudera bet on implala especially >> with its new kudo project. >> >> Thanks, >> Dayong >> >> On Mar 1, 2016, at 5:14 PM, Edward Capriolo >> wrote: >> >> My nocks on impala. (not intended to be a post knocking impala) >> >> Impala really has not delivered on the complex types that hive has (after >> promising it for quite a while), also it only works with the 'blessed' >> input formats, parquet, avro, text. >> >> It is very annoying to work with impala, In my version if you create a >> partition in hive impala does not see it. You have to run "refresh". >> >> In impala I do not have all the UDFS that hive has like percentile, etc. >> >> Impala is fast. Many data-analysts / data-scientist types that can't wait >> 10 seconds for a query so when I need top produce something for them I make >> sure the data has no complex types and uses a table type that impala >> understands. >> >> But for my work I still work primarily in hive, because I do not want to >> deal with all the things that impala does not have/might have/ and when I >> need something special like my own UDFs it is easier to whip up the >> solution in hive. >> >> Having worked with M$ SQL server, and vertica, Impala is on par with them >> but I don'think of it like i think of hive. To me it just feels like a >> vertica that I can cheat loading sometimes because it is backed by hdfs. >> >> Hive is something different, I am making pipelines, I am transforming >> data, doing streaming, writing custom udfs, querying JSON directly. Its not >> != impala. >> >> ::random message of the day:: >> >> >> >> >> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar wrote: >> >>> >>> Dr Mitch, >>> >>> My two cents here. >>> >>> I don't have direct experience of Impala but in my humble opinion I >>> share your views that Hive provides the best metastore of all Big Data >>> systems. Looking around almost every product in one form and shape use Hive >>> code somewhere. My colleagues inform me that Hive is one of the most stable >>> Big Data products. >>> >>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of >>> course MR, there is really little need for many other products in the same >>> space. It is good to keep things simple. >>> >>> Warmest >>> >>> >>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh < >>> mich.talebzadeh@gmail.com> wrote: >>> >>> >>> I have not heard of Impala anymore. I saw an article in LinkedIn titled >>> >>> "Apache Hive Or Cloudera Impala? What is Best for me?" >>> >>> "We can access all objects from Hive data warehouse with HiveQL which >>> leverages the map-reduce architecture in background for data retrieval and >>> transformation and this results in latency." >>> >>> My response was >>> >>> This statement is no longer valid as you have choices of three engines >>> now with MR, Spark and Tez. I have not used Impala myself as I don't think >>> there is a need for it with Hive on Spark or Spark using Hive metastore >>> providing whatever needed. Hive is for Data Warehouse and provides what is >>> says on the tin. Please also bear in mind that Hive offers ORC storage >>> files that provide store Index capabilities further optimizing the queries >>> with additional stats at file, stripe and row group levels. >>> >>> Anyway the question is with Hive on Spark or Spark using Hive metastore >>> what we cannot achieve that we can achieve with Impala? >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> * >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> >> > --001a11440dfa1a62c3052d12a2c3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I forgot besides LLAP you are going to have Hive Hybrid Procedural = SQL On Hadoop (HPL/SQL)=C2=A0which is going to add another dimension to= Hive=C2=A0


On 2 March 2016 at 15:30, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
SQL plays an increasing important role= on Hadoop. As of today Hive IMO provides the best and most robust solution= to anything resembling to Data Warehouse "solution"=C2=A0on Hado= op, chiefly by means of its powerful metastore which can be hosted on a var= iety of mission critical databases plus Hive's ever increasing support= =C2=A0for a variety of file types on HDFs=C2=A0from humble textfile to ORC.= The remaining tools are little more than query tools that crucially rely o= n Hive Metastore for their needs. Take away Hive component and they are mor= e and less lame ducks.

Hive on MR speed was percei= ved to be slow but what the hec we are talking about a Data Warehouse here = which in most part should be batch oriented=C2=A0 and not=C2=A0user-facing = and batch oriented. In Hive 0.14 and 2.0 you can use Spark and Tez as the e= xecution engine and if you are well into functional programming, you can de= ploy Spark on Hive. If you look around from Impala to Spark the architectur= e is essentially a query tool.




On 2 March 201= 6 at 13:52, Dayong <willddy@gmail.com> wrote:
As I remember of few weeks before in Hadoop we= ekly news feed, cloudera has a benchmark showing implala is a little better= than spark SQL and hive with tez. You can check that. From my experience, = hive is still leading tool for regular ETL job since it is stable. The othe= r tool are better for adhoc and interactive query use case. Cloudera bet on= implala especially with its new kudo project.=C2=A0

Thanks,
Day= ong

On Mar 1, 2016, at 5:14 PM, Edward Caprio= lo <edlinuxgu= ru@gmail.com> wrote:

My nocks on impala. (not intended to be a post knocking impal= a)

Impala really has not delivered on the complex types = that hive has (after promising it for quite a while), also it only works wi= th the 'blessed' input formats, parquet, avro, text.

It is v= ery annoying to work with impala, In my version if you create a partition i= n hive impala does not see it. You have to run "refresh".=C2=A0

In impala I do not have all the UDFS that hive has = like percentile, etc.=C2=A0

Impala is fast. Many data-analysts / dat= a-scientist types that can't wait 10 seconds for a query so when I need= top produce something for them I make sure the data has no complex types a= nd uses a table type that impala understands.=C2=A0

But for my work I still work primarily in hive, because I do not want to = deal with all the things that impala does not have/might have/ and when I n= eed something special like my own UDFs it is easier to whip up the solution= in hive.=C2=A0

Having worked with M$ SQL server, and vertica, Impal= a is on par with them but I don'think of it like i think of hive. To me= it just feels like a vertica that I can cheat loading sometimes because it= is backed by hdfs.=C2=A0

Hive is something differ= ent, I am making pipelines, I am transforming data, doing streaming, writin= g custom udfs, querying JSON directly. Its not !=3D impala.

::random= message of the day::


=C2=A0

On Tue, Mar 1, 2016 at 4:38 PM, As= hok Kumar <ashok34668@yahoo.com> wrote:

Dr Mitch,

My two cents here.

I don't have direct exp= erience of Impala but in my humble opinion I share your views that Hive pro= vides the best metastore of all Big Data systems. Looking around almost eve= ry product in one form and shape use Hive code somewhere. My colleagues inf= orm me that Hive is one of the most stable Big Data products.

With the capabilities of Spark on Hive = and Hive on Spark or Tez plus of course MR, there is really little need for= many other products in the same space. It is good to keep things simple.

Warmest


=
On Tuesday, 1 March 2= 016, 11:33, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:


I have not heard of Impala any= more. I saw an article in LinkedIn titled

"Ap= ache Hive Or Cloudera Impala? What is Best for me?"

"We can access all objects from Hive data warehouse with HiveQL= which leverages the map-reduce architecture in background for data retriev= al and transformation and this results in latency."

My response was

This statement is no longe= r valid as you have choices of three engines now with MR, Spark and Tez. I = have not used Impala myself as I don't think there is a need for it wit= h Hive on Spark or Spark using Hive metastore providing whatever needed. Hi= ve is for Data Warehouse and provides what is says on the tin. Please also = bear in mind that Hive offers ORC storage files that provide store Index ca= pabilities further optimizing the queries with additional stats at file, st= ripe and row group levels.=C2=A0

Anyway the questi= on is with Hive on Spark or Spark using Hive metastore what we cannot achie= ve that we can achieve with Impala?




<= /div>


--001a11440dfa1a62c3052d12a2c3--