hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dayong <will...@gmail.com>
Subject Re: Hive and Impala
Date Wed, 02 Mar 2016 18:14:27 GMT
Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive metadata store can
be decoupled from hive as well. In reality, we do suffer from hive's performance even for
ETL job. As result, we'll switch to implala + spark/ flink. 

Thanks,
Dayong

> On Mar 2, 2016, at 10:35 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> I forgot besides LLAP you are going to have Hive Hybrid Procedural SQL On Hadoop (HPL/SQL)
which is going to add another dimension to Hive 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 2 March 2016 at 15:30, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
>> SQL plays an increasing important role on Hadoop. As of today Hive IMO provides the
best and most robust solution to anything resembling to Data Warehouse "solution" on Hadoop,
chiefly by means of its powerful metastore which can be hosted on a variety of mission critical
databases plus Hive's ever increasing support for a variety of file types on HDFs from humble
textfile to ORC. The remaining tools are little more than query tools that crucially rely
on Hive Metastore for their needs. Take away Hive component and they are more and less lame
ducks.
>> 
>> Hive on MR speed was perceived to be slow but what the hec we are talking about a
Data Warehouse here which in most part should be batch oriented  and not user-facing and batch
oriented. In Hive 0.14 and 2.0 you can use Spark and Tez as the execution engine and if you
are well into functional programming, you can deploy Spark on Hive. If you look around from
Impala to Spark the architecture is essentially a query tool.
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 2 March 2016 at 13:52, Dayong <willddy@gmail.com> wrote:
>>> As I remember of few weeks before in Hadoop weekly news feed, cloudera has a
benchmark showing implala is a little better than spark SQL and hive with tez. You can check
that. From my experience, hive is still leading tool for regular ETL job since it is stable.
The other tool are better for adhoc and interactive query use case. Cloudera bet on implala
especially with its new kudo project. 
>>> 
>>> Thanks,
>>> Dayong
>>> 
>>>> On Mar 1, 2016, at 5:14 PM, Edward Capriolo <edlinuxguru@gmail.com>
wrote:
>>>> 
>>>> My nocks on impala. (not intended to be a post knocking impala)
>>>> 
>>>> Impala really has not delivered on the complex types that hive has (after
promising it for quite a while), also it only works with the 'blessed' input formats, parquet,
avro, text.
>>>> 
>>>> It is very annoying to work with impala, In my version if you create a partition
in hive impala does not see it. You have to run "refresh". 
>>>> 
>>>> In impala I do not have all the UDFS that hive has like percentile, etc.

>>>> 
>>>> Impala is fast. Many data-analysts / data-scientist types that can't wait
10 seconds for a query so when I need top produce something for them I make sure the data
has no complex types and uses a table type that impala understands. 
>>>> 
>>>> But for my work I still work primarily in hive, because I do not want to
deal with all the things that impala does not have/might have/ and when I need something special
like my own UDFs it is easier to whip up the solution in hive. 
>>>> 
>>>> Having worked with M$ SQL server, and vertica, Impala is on par with them
but I don'think of it like i think of hive. To me it just feels like a vertica that I can
cheat loading sometimes because it is backed by hdfs. 
>>>> 
>>>> Hive is something different, I am making pipelines, I am transforming data,
doing streaming, writing custom udfs, querying JSON directly. Its not != impala.
>>>> 
>>>> ::random message of the day::
>>>> 
>>>> 
>>>>  
>>>> 
>>>>> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar <ashok34668@yahoo.com>
wrote:
>>>>> 
>>>>> Dr Mitch,
>>>>> 
>>>>> My two cents here.
>>>>> 
>>>>> I don't have direct experience of Impala but in my humble opinion I share
your views that Hive provides the best metastore of all Big Data systems. Looking around almost
every product in one form and shape use Hive code somewhere. My colleagues inform me that
Hive is one of the most stable Big Data products.
>>>>> 
>>>>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus
of course MR, there is really little need for many other products in the same space. It is
good to keep things simple.
>>>>> 
>>>>> Warmest
>>>>> 
>>>>> 
>>>>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>> 
>>>>> 
>>>>> I have not heard of Impala anymore. I saw an article in LinkedIn titled
>>>>> 
>>>>> "Apache Hive Or Cloudera Impala? What is Best for me?"
>>>>> 
>>>>> "We can access all objects from Hive data warehouse with HiveQL which
leverages the map-reduce architecture in background for data retrieval and transformation
and this results in latency."
>>>>> 
>>>>> My response was
>>>>> 
>>>>> This statement is no longer valid as you have choices of three engines
now with MR, Spark and Tez. I have not used Impala myself as I don't think there is a need
for it with Hive on Spark or Spark using Hive metastore providing whatever needed. Hive is
for Data Warehouse and provides what is says on the tin. Please also bear in mind that Hive
offers ORC storage files that provide store Index capabilities further optimizing the queries
with additional stats at file, stripe and row group levels. 
>>>>> 
>>>>> Anyway the question is with Hive on Spark or Spark using Hive metastore
what we cannot achieve that we can achieve with Impala?
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
> 

Mime
View raw message