Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 07A8218CB9 for ; Wed, 6 Jan 2016 06:24:35 +0000 (UTC) Received: (qmail 6760 invoked by uid 500); 6 Jan 2016 06:24:33 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 6684 invoked by uid 500); 6 Jan 2016 06:24:33 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 6674 invoked by uid 99); 6 Jan 2016 06:24:33 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2016 06:24:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 28F5418048A for ; Wed, 6 Jan 2016 06:24:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.101 X-Spam-Level: * X-Spam-Status: No, score=1.101 tagged_above=-999 required=6.31 tests=[KAM_COUK=1.1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 3iBxygB5P5-n for ; Wed, 6 Jan 2016 06:24:21 +0000 (UTC) Received: from sulu.netzoomi.net (sulu.netzoomi.net [83.138.144.103]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTP id BD6B6201BC for ; Wed, 6 Jan 2016 06:24:20 +0000 (UTC) Received: from vulcan.netzoomi.net (unknown [212.100.249.54]) by sulu.netzoomi.net (Postfix) with ESMTP id 314616A4470 for ; Wed, 6 Jan 2016 06:24:17 +0000 (GMT) X-Envelope-From: Received: from w7 (cpc86449-seve24-2-0-cust177.13-3.cable.virginm.net [86.19.59.178]) by vulcan.netzoomi.net (Postfix) with ESMTPA id 125CC1248691 for ; Wed, 6 Jan 2016 06:24:17 +0000 (GMT) From: "Mich Talebzadeh" To: Subject: Indexes in Hive Date: Wed, 6 Jan 2016 06:24:21 -0000 Message-ID: <020301d1484a$e1ab8870$a5029950$@peridale.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 16.0 Thread-Index: AdFISuC7lbn/GvN2RoC3RPo/fjFzKQ== Content-Language: en-gb Hi, Thinking loudly. Ideally we should consider a totally columnar storage offering in which each column of table is stored as compressed value (I disregard for now how actually ORC does this but obviously it is not exactly a columnar storage). So each table can be considered as a loose federation of columnar storage and each column is effectively an index? As columns are far narrower than tables, each index block will be very higher density and all operations like aggregates can be done directly on index rather than table. This type of table offering will be in true nature of data warehouse storage. Of course row operations (get me all rows for this table) will be slower but that is the trade-off that we need to consider. Expecting users to write their own IndexHandler may be technically interesting but commercially not viable as Hive needs to be a product on its own merit not a development base. Writing your own storage attributes etc. requires skills that will put off people seeing Hive as an attractive proposition (requiring considerable investment in skill sets in order to maintain Hive). Thus my thinking on this is to offer true columnar storage in Hive to be a proper data warehouse. In addition, the development tools cab ne made available for those interested in tailoring their own specific Hive solutions. HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908. pdf Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. -----Original Message----- From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gopal Vijayaraghavan Sent: 05 January 2016 23:55 To: user@hive.apache.org Subject: Re: Is Hive Index officially not recommended? >So in a nutshell in Hive if "external" indexes are not used for >improving query response, what value they add and can we forget them for now? The builtin indexes - those that write data as smaller tables are only useful in a pre-columnar world, where the indexes offer a huge reduction in IO. Part #1 of using hive indexes effectively is to write your own HiveIndexHandler, with usesIndexTable=false; And then write a IndexPredicateAnalyzer, which lets you map arbitrary lookups into other range conditions. Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA" which consolidates the "internal" index into an external store (HBase). Some of the index data now lives in the HBase metastore, so that the inclusion/exclusion of whole partitions can be done off the consolidated index. https://issues.apache.org/jira/browse/HIVE-11676 The experience from BI workloads run by customers is that in general, the lookup to the right "slice" of data is more of a problem than the actual aggregate. And that for a workhorse data warehouse, this has to survive even if there's a non-stop stream of updates into it. Cheers, Gopal