hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <>
Subject Re: Indexes in Hive
Date Wed, 06 Jan 2016 18:19:22 GMT
The issue with this is that HDFS lacks the ability to co-locate blocks.  
So if you break your columns into one file per column (the more 
traditional column route) you end up in a situation where 2/3 of the 
time only one of your columns is being locally read, which results in a 
significant performance penalty.  That's why ORC and Parquet and RCFile 
all use one file for their "columnar" stores.


> Mich Talebzadeh <>
> January 5, 2016 at 22:24
> Hi,
> Thinking loudly.
> Ideally we should consider a totally columnar storage offering in 
> which each
> column of table is stored as compressed value (I disregard for now how
> actually ORC does this but obviously it is not exactly a columnar 
> storage).
> So each table can be considered as a loose federation of columnar storage
> and each column is effectively an index?
> As columns are far narrower than tables, each index block will be very
> higher density and all operations like aggregates can be done directly on
> index rather than table.
> This type of table offering will be in true nature of data warehouse
> storage. Of course row operations (get me all rows for this table) will be
> slower but that is the trade-off that we need to consider.
> Expecting users to write their own IndexHandler may be technically
> interesting but commercially not viable as Hive needs to be a product 
> on its
> own merit not a development base. Writing your own storage attributes etc.
> requires skills that will put off people seeing Hive as an attractive
> proposition (requiring considerable investment in skill sets in order to
> maintain Hive).
> Thus my thinking on this is to offer true columnar storage in Hive to be a
> proper data warehouse. In addition, the development tools cab ne made
> available for those interested in tailoring their own specific Hive
> solutions.
> Dr Mich Talebzadeh
> LinkedIn
> V8Pw
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 
> 15",
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale 
> Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. 
> It is
> the responsibility of the recipient to ensure that this email is virus 
> free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees 
> accept
> any responsibility.
> -----Original Message-----
> From: Gopal Vijayaraghavan [] On Behalf Of 
> Gopal
> Vijayaraghavan
> Sent: 05 January 2016 23:55
> To:
> Subject: Re: Is Hive Index officially not recommended?
> now?
> The builtin indexes - those that write data as smaller tables are only
> useful in a pre-columnar world, where the indexes offer a huge 
> reduction in
> IO.
> Part #1 of using hive indexes effectively is to write your own
> HiveIndexHandler, with usesIndexTable=false;
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary
> lookups into other range conditions.
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
> Some of the index data now lives in the HBase metastore, so that the
> inclusion/exclusion of whole partitions can be done off the consolidated
> index.
> The experience from BI workloads run by customers is that in general, the
> lookup to the right "slice" of data is more of a problem than the actual
> aggregate.
> And that for a workhorse data warehouse, this has to survive even if 
> there's
> a non-stop stream of updates into it.
> Cheers,
> Gopal

View raw message