Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 356651815A for ; Wed, 2 Mar 2016 18:26:08 +0000 (UTC) Received: (qmail 50257 invoked by uid 500); 2 Mar 2016 18:26:06 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 50178 invoked by uid 500); 2 Mar 2016 18:26:06 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 50168 invoked by uid 99); 2 Mar 2016 18:26:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2016 18:26:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 217F018057C for ; Wed, 2 Mar 2016 18:26:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id l26DhkJ4r3AS for ; Wed, 2 Mar 2016 18:26:02 +0000 (UTC) Received: from mail-vk0-f54.google.com (mail-vk0-f54.google.com [209.85.213.54]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id D8F525F2C4 for ; Wed, 2 Mar 2016 18:26:01 +0000 (UTC) Received: by mail-vk0-f54.google.com with SMTP id k196so211711039vka.0 for ; Wed, 02 Mar 2016 10:26:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=NUHj57j1BnYKqsRQnfwwYHzOjXpbmpz/6nAkJkp/SpE=; b=Le3yksvd9tUSd5TcH1O4Z/HXQv4P5spfBd6KE0tqoqeSYPt+mykgHM35IzVbzrFa7P kS+7NiZPDiL0bb04MiPw8aWR+BBMl2oRfEhrNFWKau3QU1zV2PY8eX/aCulGH7X5/E/z Lv4hZi0hxrGILkU0zjJeMpeBKSMJnf+i2JP3mvGd8fqTwaST9hO00kKQYPznVK7YBIvS pau+TgFiworNFU8xddHv/JqLvk2gKoOLFQnaDdRaD0AxvNhybTsAdcidktu7jhi9Cgkl rDqSDcNIeLCYAuXHe0NI2StM8E4CjEavgTXKYDFuvqp475uRdJGer74cReg58FfCzhKe H5FA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=NUHj57j1BnYKqsRQnfwwYHzOjXpbmpz/6nAkJkp/SpE=; b=dJ8TUvfzru8FoecTQQgBhgyW11He6dkODbxUPs0lS7aV5QY9jbxJBuKTm0MFSTXKm1 sknFrqahAyHif5nia/8C1RrY0BujYcHhNePIrzAIS3IU/SZ1M/DUnOMhLwyOVg2cABry YOm+8/HIqYiJgCiQ21G9ogxX6FV86FPWdVgb3WnLpcKu+aJyjlU8W3YY/qSNPoEU5twE SjRT5nCL1jbffB4tvfu8gQtSFGHaD8fhd1rbuv0RF+86v5woe4rL2mvovuzJMi9vupLP 6myjDlShctP6I5zrFCsMBRZMR8C6T0Bft/csnCSaqgd2Rrg7fqy6AbAfoa60IMUHL/BP XhRQ== X-Gm-Message-State: AD7BkJIBsMTuP7wqdnApekm2uLh/HjtgamFpTkGCYXkaW1YgU5jHWQ02JX8zvCb6E6lLvkqn8daErkv7XeWc9Q== MIME-Version: 1.0 X-Received: by 10.31.9.72 with SMTP id 69mr17682997vkj.126.1456943160975; Wed, 02 Mar 2016 10:26:00 -0800 (PST) Received: by 10.31.128.213 with HTTP; Wed, 2 Mar 2016 10:26:00 -0800 (PST) In-Reply-To: References: <1892683924.2956880.1456868332338.JavaMail.yahoo@mail.yahoo.com> <316A6CEC-B530-460B-97BF-43F3BF3A738A@gmail.com> Date: Wed, 2 Mar 2016 18:26:00 +0000 Message-ID: Subject: Re: Hive and Impala From: Mich Talebzadeh To: user@hive.apache.org Cc: Ashok Kumar Content-Type: multipart/alternative; boundary=001a11440dfa916740052d15030e --001a11440dfa916740052d15030e Content-Type: text/plain; charset=UTF-8 OK two questions here please: 1. Which version of Hive are you running 2. Have you tried Hive on Spark which does both DAG & In-memory calculation. Query Hive on Spark job[1] stages: INFO : 2 INFO : 3 HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 2 March 2016 at 18:14, Dayong wrote: > Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive > metadata store can be decoupled from hive as well. In reality, we do suffer > from hive's performance even for ETL job. As result, we'll switch to > implala + spark/ flink. > > Thanks, > Dayong > > On Mar 2, 2016, at 10:35 AM, Mich Talebzadeh > wrote: > > I forgot besides LLAP you are going to have Hive Hybrid Procedural SQL On > Hadoop (HPL/SQL) which is going to add another dimension to Hive > > Dr Mich Talebzadeh > > > > LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > * > > > > http://talebzadehmich.wordpress.com > > > > On 2 March 2016 at 15:30, Mich Talebzadeh > wrote: > >> SQL plays an increasing important role on Hadoop. As of today Hive IMO >> provides the best and most robust solution to anything resembling to Data >> Warehouse "solution" on Hadoop, chiefly by means of its powerful metastore >> which can be hosted on a variety of mission critical databases plus Hive's >> ever increasing support for a variety of file types on HDFs from humble >> textfile to ORC. The remaining tools are little more than query tools that >> crucially rely on Hive Metastore for their needs. Take away Hive component >> and they are more and less lame ducks. >> >> Hive on MR speed was perceived to be slow but what the hec we are talking >> about a Data Warehouse here which in most part should be batch oriented >> and not user-facing and batch oriented. In Hive 0.14 and 2.0 you can use >> Spark and Tez as the execution engine and if you are well into functional >> programming, you can deploy Spark on Hive. If you look around from Impala >> to Spark the architecture is essentially a query tool. >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> * >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 2 March 2016 at 13:52, Dayong wrote: >> >>> As I remember of few weeks before in Hadoop weekly news feed, cloudera >>> has a benchmark showing implala is a little better than spark SQL and hive >>> with tez. You can check that. From my experience, hive is still leading >>> tool for regular ETL job since it is stable. The other tool are better for >>> adhoc and interactive query use case. Cloudera bet on implala especially >>> with its new kudo project. >>> >>> Thanks, >>> Dayong >>> >>> On Mar 1, 2016, at 5:14 PM, Edward Capriolo >>> wrote: >>> >>> My nocks on impala. (not intended to be a post knocking impala) >>> >>> Impala really has not delivered on the complex types that hive has >>> (after promising it for quite a while), also it only works with the >>> 'blessed' input formats, parquet, avro, text. >>> >>> It is very annoying to work with impala, In my version if you create a >>> partition in hive impala does not see it. You have to run "refresh". >>> >>> In impala I do not have all the UDFS that hive has like percentile, etc. >>> >>> Impala is fast. Many data-analysts / data-scientist types that can't >>> wait 10 seconds for a query so when I need top produce something for them I >>> make sure the data has no complex types and uses a table type that impala >>> understands. >>> >>> But for my work I still work primarily in hive, because I do not want to >>> deal with all the things that impala does not have/might have/ and when I >>> need something special like my own UDFs it is easier to whip up the >>> solution in hive. >>> >>> Having worked with M$ SQL server, and vertica, Impala is on par with >>> them but I don'think of it like i think of hive. To me it just feels like a >>> vertica that I can cheat loading sometimes because it is backed by hdfs. >>> >>> Hive is something different, I am making pipelines, I am transforming >>> data, doing streaming, writing custom udfs, querying JSON directly. Its not >>> != impala. >>> >>> ::random message of the day:: >>> >>> >>> >>> >>> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar >>> wrote: >>> >>>> >>>> Dr Mitch, >>>> >>>> My two cents here. >>>> >>>> I don't have direct experience of Impala but in my humble opinion I >>>> share your views that Hive provides the best metastore of all Big Data >>>> systems. Looking around almost every product in one form and shape use Hive >>>> code somewhere. My colleagues inform me that Hive is one of the most stable >>>> Big Data products. >>>> >>>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of >>>> course MR, there is really little need for many other products in the same >>>> space. It is good to keep things simple. >>>> >>>> Warmest >>>> >>>> >>>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh < >>>> mich.talebzadeh@gmail.com> wrote: >>>> >>>> >>>> I have not heard of Impala anymore. I saw an article in LinkedIn titled >>>> >>>> "Apache Hive Or Cloudera Impala? What is Best for me?" >>>> >>>> "We can access all objects from Hive data warehouse with HiveQL which >>>> leverages the map-reduce architecture in background for data retrieval and >>>> transformation and this results in latency." >>>> >>>> My response was >>>> >>>> This statement is no longer valid as you have choices of three engines >>>> now with MR, Spark and Tez. I have not used Impala myself as I don't think >>>> there is a need for it with Hive on Spark or Spark using Hive metastore >>>> providing whatever needed. Hive is for Data Warehouse and provides what is >>>> says on the tin. Please also bear in mind that Hive offers ORC storage >>>> files that provide store Index capabilities further optimizing the queries >>>> with additional stats at file, stripe and row group levels. >>>> >>>> Anyway the question is with Hive on Spark or Spark using Hive metastore >>>> what we cannot achieve that we can achieve with Impala? >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> * >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> >>> >> > --001a11440dfa916740052d15030e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
OK two questions here please:
=C2=A0
  1. Which version of Hive are you running
  2. Have you tr= ied Hive on Spark which does both DAG & In-memory calculation.
  3. Query Hive on Spark job[1] stages:
    INFO=C2=A0= : 2
    = INFO=C2=A0 : 3
    =C2=A0
    =C2=A0
    HTH
    =C2=A0


On 2 March 2016 at 18:14, Dayong <willddy= @gmail.com> wrote:
Tez is kind of outdated and Orc is so dedicated on hive. In = addition, hive metadata store can be decoupled from hive as well. In realit= y, we do suffer from hive's performance even for ETL job. As result, we= 'll switch to implala + spark/ flink.=C2=A0

Thanks,
Dayong

On Mar 2, 2016, at 10:35 AM, Mich= Talebzadeh <mich.talebzadeh@gmail.com> wrote:

I forgot besides LLAP you are going to have= Hive Hybrid Procedural SQL On Hadoop (HPL/SQL)=C2=A0which is going = to add another dimension to Hive=C2=A0


On 2 March 2016 at 15:30, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
SQL plays an increasing important role on Hadoop. As= of today Hive IMO provides the best and most robust solution to anything r= esembling to Data Warehouse "solution"=C2=A0on Hadoop, chiefly by= means of its powerful metastore which can be hosted on a variety of missio= n critical databases plus Hive's ever increasing support=C2=A0for a var= iety of file types on HDFs=C2=A0from humble textfile to ORC. The remaining = tools are little more than query tools that crucially rely on Hive Metastor= e for their needs. Take away Hive component and they are more and less lame= ducks.

Hive on MR speed was perceived to be slow = but what the hec we are talking about a Data Warehouse here which in most p= art should be batch oriented=C2=A0 and not=C2=A0user-facing and batch orien= ted. In Hive 0.14 and 2.0 you can use Spark and Tez as the execution engine= and if you are well into functional programming, you can deploy Spark on H= ive. If you look around from Impala to Spark the architecture is essentiall= y a query tool.




On 2 March 2016 at 13:52, D= ayong <willddy@gmail.com> wrote:
As I remember of few weeks before in Hadoop weekly news fee= d, cloudera has a benchmark showing implala is a little better than spark S= QL and hive with tez. You can check that. From my experience, hive is still= leading tool for regular ETL job since it is stable. The other tool are be= tter for adhoc and interactive query use case. Cloudera bet on implala espe= cially with its new kudo project.=C2=A0

Thanks,
Dayong

On Mar 1, 2016, at 5:14 PM, Edward Capriolo <edlinuxguru@gmail.com<= /a>> wrote:

My nocks on impala. (not intended to be a post knocking impala)

Impala really has not delivered on the complex types that hive has= (after promising it for quite a while), also it only works with the 'b= lessed' input formats, parquet, avro, text.

It is very annoying = to work with impala, In my version if you create a partition in hive impala= does not see it. You have to run "refresh".=C2=A0

=
In impala I do not have all the UDFS that hive has like percenti= le, etc.=C2=A0

Impala is fast. Many data-analysts / data-scientist t= ypes that can't wait 10 seconds for a query so when I need top produce = something for them I make sure the data has no complex types and uses a tab= le type that impala understands.=C2=A0

But for my = work I still work primarily in hive, because I do not want to deal with all= the things that impala does not have/might have/ and when I need something= special like my own UDFs it is easier to whip up the solution in hive.=C2= =A0

Having worked with M$ SQL server, and vertica, Impala is on par = with them but I don'think of it like i think of hive. To me it just fee= ls like a vertica that I can cheat loading sometimes because it is backed b= y hdfs.=C2=A0

Hive is something different, I am ma= king pipelines, I am transforming data, doing streaming, writing custom udf= s, querying JSON directly. Its not !=3D impala.

::random message of = the day::


=C2=A0
On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar <ashok34668@yahoo.com> wrote:

Dr M= itch,

My two cents here.

I don't have direct experience of I= mpala but in my humble opinion I share your views that Hive provides the be= st metastore of all Big Data systems. Looking around almost every product i= n one form and shape use Hive code somewhere. My colleagues inform me that = Hive is one of the most stable Big Data products.
With the capabilities of Spark on Hive and Hive on = Spark or Tez plus of course MR, there is really little need for many other = products in the same space. It is good to keep things simple.
Warmest


On Tuesday, 1 March 2016, 11:33, = Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:


=
I have not heard of Impala anymore. I saw = an article in LinkedIn titled

"Apache Hive Or= Cloudera Impala? What is Best for me?"

"= ;We can access all objects from Hive data warehouse with HiveQL which lever= ages the map-reduce architecture in background for data retrieval and trans= formation and this results in latency."

My r= esponse was

This statement is no longer valid as y= ou have choices of three engines now with MR, Spark and Tez. I have not use= d Impala myself as I don't think there is a need for it with Hive on Sp= ark or Spark using Hive metastore providing whatever needed. Hive is for Da= ta Warehouse and provides what is says on the tin. Please also bear in mind= that Hive offers ORC storage files that provide store Index capabilities f= urther optimizing the queries with additional stats at file, stripe and row= group levels.=C2=A0

Anyway the question is with H= ive on Spark or Spark using Hive metastore what we cannot achieve that we c= an achieve with Impala?




<= /div>



--001a11440dfa916740052d15030e--