hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <>
Subject Re: Hive footprint
Date Mon, 25 Apr 2016 22:39:09 GMT
Hi Naveen,

Thank you for your detailed explanation.

Please allow me to explain my points if I may

I think a viable solution for big data stack will encompass (again this is
my view) Spark with Hive, HDFS and Yarn as winning combinations. Hadoop
encompasses HDFS and it is almost impossible to side step it without
finding a viable alternative as a persistent storage. Yarn is the resource
rock, Spark is a great query tool including Spark streaming and Hive is the
real Data Warehouse in Big data space that provides the meta-data for all
the tools.

You will forgive me to set aside Impala as I don't hear much about it
anymore (please feel free to agree to differ). So my prime interest is to
see Hive being improved as it should be, i.e. a proper Data Warehouse with
proper indexing strategy. I don’t really subscribe to ORC storage index as
through my experience they have not delivered the contribution to Hive CBO
as expected. My personal experience has been that they provide some
improvements on what is already available (Stats wise), but otherwise
unless you bucket your table (i.e. have an effective numeric column with
high cardinality that can be used in hash partitioning the table), one
cannot make effective use of storage index.

Now back to Hive and its external indexes. Currently the infrastructure is
there but not the functionality. I don’t know what it takes to make indexes
in Hive accountable for the CBO. We should aim to consolidate Hadoop
ecosystem by investing in the existing tools rather than trying to fragment
it further. There seems to be little effort in this area for reasons that I
may not be aware. However, I am more than happy to contribute to this case.

Kind regards,


Dr Mich Talebzadeh

LinkedIn *

On 25 April 2016 at 19:28, Naveen Gangam <> wrote:

> Hi Mich,
> I am a developer at Cloudera and contribute to Apache Hive.
> Hive and MPP query engine projects like Impala have settled into their
> respective positions so there is less confusion between these projects.
> For example, across Cloudera's customer base the majority of customers use
> Impala to enable them to perform BI and SQL analytics directly on Hadoop.
> Most Impala users are using Hive for the data preparation of the data sets
> they're serving up via Impala. As such Impala typically competes with
> traditional analytic databases where customers decide between:
>     * Using Hadoop and Hive for data processing that feeds into another
> database or BI layer for the analytics
>     * Unified architecture where they directly serve some sets of BI and
> analytics from Hadoop via Impala while typically using Hive, Spark,
> MapReduce, etc for their data preparation
> You can see nearly all Hadoop distributions provide users with Hive for
> core data processing plus an MPP query engine for BI and SQL analytics like
> Impala, Drill, BigSQL, etc. Even Facebook who created and still heavily
> uses Hive, also uses Presto internally as their MPP query engine for BI.
> For more details you can see Cloudera's SQL-on-Hadoop webinar that talks
> about when to use Hive, Impala, and Spark (SQL)
> <>
> Support for local variables and stored procedures in Hive is included in
> HPL/SQL module of Hive 2.0. However, this is an experimental feature. We
> will evaluate it for production-readiness before including it in CDH Hive.
> Finally, HBase is typically not the best storage manager for migrations
> from commercial DWs to Big Data. Most commercial DW migrations use HDFS
> rather than HBase as the storage manager.
> Hope this helps.
> Thank you
> Naveen
> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>> wrote:
>> Hi,
>> I notice that Impala is rarely mentioned these days.  I may be missing
>> something. However, I gather it is coming to end now as I don't recall many
>> use cases for it (or customers asking for it). In contrast, Hive has hold
>> its ground with the new addition of Spark and Tez as execution engines,
>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>> good choice for its metastore it scales well.
>> If Hive had the ability (organic) to have local variable and stored
>> procedure support then it would be top notch Data Warehouse. Given its
>> metastore, I don't see any technical reason why it cannot support these
>> constructs.
>> I was recently asked to comment on migration from commercial DWs to Big
>> Data (primarily for TCO reason) and really could not recall any better
>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>> decides there is still HDFS, a good engine for Hive (sounds like many
>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>> Let me know your thoughts.
>> Dr Mich Talebzadeh
>> LinkedIn *
>> <>*

View raw message