Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB84618FC3 for ; Wed, 6 Jan 2016 07:49:46 +0000 (UTC) Received: (qmail 59233 invoked by uid 500); 6 Jan 2016 07:49:45 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 59153 invoked by uid 500); 6 Jan 2016 07:49:44 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 59143 invoked by uid 99); 6 Jan 2016 07:49:44 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2016 07:49:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7761618048D for ; Wed, 6 Jan 2016 07:49:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.099 X-Spam-Level: X-Spam-Status: No, score=-0.099 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Jnxm702RbDbt for ; Wed, 6 Jan 2016 07:49:31 +0000 (UTC) Received: from mail-wm0-f43.google.com (mail-wm0-f43.google.com [74.125.82.43]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id C3A29439DB for ; Wed, 6 Jan 2016 07:49:30 +0000 (UTC) Received: by mail-wm0-f43.google.com with SMTP id b14so63330005wmb.1 for ; Tue, 05 Jan 2016 23:49:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:content-transfer-encoding:mime-version:subject :message-id:date:references:in-reply-to:to; bh=LhGiWMa/ndL9OmD2HiaOh1lUQIlP5G0UuwMrVkLjxoA=; b=xJfVnhUX0+7c7ksOzq9ZqmrUBJRVKWnrVtmig2ZrlukeFccsJlT5DujZ5Qpm3H0v8Z OGFXhimEJV3+R1j2LcvFfVvAgcGIXX4D3Xh4Fc0tBHuwwdn3hmps+aa6SOCLYZyPK6tQ xYSEFeqcSmQcz+i5i5277nrF/3WP36mZ8uLDI5oiw9yG2fpOIEgvCUjWJrud1iyTrZcJ VuiQHJUJxhQMPiA9iOHPG515mm1NImznVB5CgVFTVV5Kmwwme+15Pteqgv8jj1/1v0IN +4mGTLfnZ3Uo5wO/f/5EMDWnXXdkDkNviWjaOVfxnHJV7gX+oGpaUKwvgnfuTP0TiOja X6fA== X-Received: by 10.194.116.97 with SMTP id jv1mr105754969wjb.38.1452066564620; Tue, 05 Jan 2016 23:49:24 -0800 (PST) Received: from [10.219.217.132] ([176.4.103.217]) by smtp.gmail.com with ESMTPSA id av8sm19967022wjc.13.2016.01.05.23.49.17 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 05 Jan 2016 23:49:23 -0800 (PST) From: =?utf-8?Q?J=C3=B6rn_Franke?= Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: Indexes in Hive Message-Id: <3943EE66-ACE8-4D0B-8C26-48ED44ADC838@gmail.com> Date: Wed, 6 Jan 2016 08:49:00 +0100 References: <020301d1484a$e1ab8870$a5029950$@peridale.co.uk> In-Reply-To: <020301d1484a$e1ab8870$a5029950$@peridale.co.uk> To: user@hive.apache.org X-Mailer: iPhone Mail (13C75) If I understand you correctly this could be just another Hive storage format= . > On 06 Jan 2016, at 07:24, Mich Talebzadeh wrote: >=20 > Hi, >=20 > Thinking loudly. >=20 > Ideally we should consider a totally columnar storage offering in which ea= ch > column of table is stored as compressed value (I disregard for now how > actually ORC does this but obviously it is not exactly a columnar storage)= . >=20 > So each table can be considered as a loose federation of columnar storage > and each column is effectively an index? >=20 > As columns are far narrower than tables, each index block will be very > higher density and all operations like aggregates can be done directly on > index rather than table.=20 >=20 > This type of table offering will be in true nature of data warehouse > storage. Of course row operations (get me all rows for this table) will be= > slower but that is the trade-off that we need to consider. >=20 > Expecting users to write their own IndexHandler may be technically > interesting but commercially not viable as Hive needs to be a product on i= ts > own merit not a development base. Writing your own storage attributes etc.= > requires skills that will put off people seeing Hive as an attractive > proposition (requiring considerable investment in skill sets in order to > maintain Hive). >=20 > Thus my thinking on this is to offer true columnar storage in Hive to be a= > proper data warehouse. In addition, the development tools cab ne made > available for those interested in tailoring their own specific Hive > solutions. >=20 >=20 > HTH >=20 >=20 >=20 > Dr Mich Talebzadeh >=20 > LinkedIn > https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUr > V8Pw >=20 > Sybase ASE 15 Gold Medal Award 2008 > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-09190= 8. > pdf > Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15"= , > ISBN 978-0-9563693-0-7.=20 > co-author "Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4 > Publications due shortly: > Complex Event Processing in Heterogeneous Environments, ISBN: > 978-0-9563693-3-8 > Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume= > one out shortly >=20 > http://talebzadehmich.wordpress.com >=20 > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technolog= y > Ltd, its subsidiaries or their employees, unless expressly so stated. It i= s > the responsibility of the recipient to ensure that this email is virus fre= e, > therefore neither Peridale Ltd, its subsidiaries nor their employees accep= t > any responsibility. >=20 >=20 > -----Original Message----- > From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gop= al > Vijayaraghavan > Sent: 05 January 2016 23:55 > To: user@hive.apache.org > Subject: Re: Is Hive Index officially not recommended? >=20 >=20 >> So in a nutshell in Hive if "external" indexes are not used for=20 >> improving query response, what value they add and can we forget them for > now? >=20 > The builtin indexes - those that write data as smaller tables are only > useful in a pre-columnar world, where the indexes offer a huge reduction i= n > IO. >=20 > Part #1 of using hive indexes effectively is to write your own > HiveIndexHandler, with usesIndexTable=3Dfalse; >=20 > And then write a IndexPredicateAnalyzer, which lets you map arbitrary > lookups into other range conditions. >=20 > Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA" > which consolidates the "internal" index into an external store (HBase). >=20 > Some of the index data now lives in the HBase metastore, so that the > inclusion/exclusion of whole partitions can be done off the consolidated > index.=20 >=20 > https://issues.apache.org/jira/browse/HIVE-11676 >=20 >=20 > The experience from BI workloads run by customers is that in general, the > lookup to the right "slice" of data is more of a problem than the actual > aggregate. >=20 > And that for a workhorse data warehouse, this has to survive even if there= 's > a non-stop stream of updates into it. >=20 > Cheers, > Gopal >=20