Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A181817C13 for ; Tue, 16 Feb 2016 07:52:53 +0000 (UTC) Received: (qmail 85634 invoked by uid 500); 16 Feb 2016 07:52:52 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 85567 invoked by uid 500); 16 Feb 2016 07:52:52 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 85557 invoked by uid 99); 16 Feb 2016 07:52:52 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2016 07:52:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A3ED41804DA for ; Tue, 16 Feb 2016 07:52:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.619 X-Spam-Level: ***** X-Spam-Status: No, score=5.619 tagged_above=-999 required=6.31 tests=[HTML_FONT_LOW_CONTRAST=0.001, HTML_MESSAGE=2, KAM_COUK=1.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_BL=0.01, RCVD_IN_MSPIKE_L3=2.499, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id pAmLfaH5kmZH for ; Tue, 16 Feb 2016 07:52:48 +0000 (UTC) Received: from p3nlsmtpcp01-03.prod.phx3.gdg (p3nlsmtpcp01-03.prod.phx3.secureserver.net [184.168.200.142]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 061845FAF3 for ; Tue, 16 Feb 2016 07:52:46 +0000 (UTC) Received: from p3plcpnl0356.prod.phx3.secureserver.net ([50.62.161.123]) by : HOSTING RELAY : with SMTP id VaR5a29roNw4PVaR5adp61; Tue, 16 Feb 2016 00:52:39 -0700 Received: from localhost ([127.0.0.1]:53987 helo=p3plcpnl0356.prod.phx3.secureserver.net) by p3plcpnl0356.prod.phx3.secureserver.net with esmtpa (Exim 4.85) (envelope-from ) id 1aVaR4-000749-Vy; Tue, 16 Feb 2016 00:52:39 -0700 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_39391bfc9bc35cc95c49411ca6ccbc64" Date: Tue, 16 Feb 2016 07:52:33 +0000 From: Mich Talebzadeh To: user@hive.apache.org Cc: =?UTF-8?Q?=E4=B8=87=E4=BF=AE=E8=BF=9C?= Subject: Re: The index for query in hive 1.2.1 does not work. Organization: Cloud Technology Partners Ltd In-Reply-To: <201602161117543179006@sina.com> References: <20160215170154333300106@sina.com>, <201602161044144700403@sina.com> <201602161117543179006@sina.com> Message-ID: <841d1fc3eb16abb3ec896f7a7d0f696f@cloudtechnologypartners.co.uk> X-Sender: mich.talebzadeh@cloudtechnologypartners.co.uk User-Agent: Roundcube Webmail/1.0.6 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - p3plcpnl0356.prod.phx3.secureserver.net X-AntiAbuse: Original Domain - hive.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - cloudtechnologypartners.co.uk X-Get-Message-Sender-Via: p3plcpnl0356.prod.phx3.secureserver.net: authenticated_id: mich.talebzadeh@cloudtechnologypartners.co.uk X-Source: X-Source-Args: X-Source-Dir: X-CMAE-Envelope: MS4wfPxVFTv+heCY8v10uaACDgkz1S78NNETfObPClre/M2hKuG7MzaJRwTqcGNDXzq9K0I7oOp28GO7j6W6GPvDrsSrL+gvptFey3br/KTd2uLNhDat6eXG TUJoHFlByjVEQxrbT/NFVfrVcRiQ/nMY5m2KfZXVUdgizjhGC/uPG9YBUyTAg5YoIX4V/qgme2xCmCINW7w0yrG2sq+VUnGzyzIkUuidyu/G26a7zixGhhUe ii+9nNsU3046HizWxzzT+A== --=_39391bfc9bc35cc95c49411ca6ccbc64 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi, "Traditional" Indexes are not currently used in Hive. You can create them but they are not used by the optimizer. You can create storage indexes in Hive using ORC file format that provides three levels of granularity * ORC File itself * Multiple stripes within the ORC file * Multiple row groups (row batches) within each stripe Effectively: * Chunks of data making up ORC file stored as storage index. _Storage index_ is the term used for the combined Index and statistics. * Each Storage Index has statistics of min, max, count, and sum for each column in the grouping of rows in batches of 10,000 called _row group_. Row group both _has row data_ and _index data_ * Crucially, it needs the location of the start of each row group, so that the query could jump straight to the beginning of the row group so narrowing down the search path. * The query should perform a SARG pushdown that limits which rows are required for the query and can avoid reading an entire file, or at least sections of the file which is by and large what a conventional RDBMS B-tree index does. * Support for new ACID features in Hive (insert, update and delete). HTH. Mich On 16/02/2016 03:17, 万修远 wrote: > Hello, > > WHEN I USE INDEX IN HIVE 1.2.1, I FIND THE INDEX DOES NOT WORK. THE DETAILS ARE AS FOLLOWS: > > 1. After using index, the query speed does not improve. If I use manual use of indexes, the query speed improve obviously, but when switch to automatic use of indexes, the speed makes no difference relative to not use index. > > 2. After rebuild index, I add a new text file which includes one record matching my query filter in the table directory. Then, the query results will show the record included in the new text file. (The case that append new record in the same file but in different block is the same.) > > 3.When debug the hive source code I find that the function generateIndexQuery of class CompactIndexHandler is't called. Finally I find that the function compile in class TaskCompiler returns early at the follow statements: > if (pCtx.getFetchTask() != null) { > return; > }this will result in index not working for query. But I do't know why to set FetchTask because I know little about hive. > > -------------------------------------------------------------------------------------------------------- > SO, MY QUESTION IS :1. Does hive 1.2.1 support index normally? IF it supports index completely, what's my issue?2. I want to know how indexes are used to optimize queries, where can I find some references? > > -------------------------------------------------------------------------------------------------------- > APPENDIX: HOW DO I USE INDEX IN HIVE 1.2.1 > > 1.create table and load data: > > create table table01( id int, name string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY 't'; > load data local inpath '/home/hadoop/data/dual.txt' overwrite into table table01; > > 2.create and rebuild index: > > create index table01_index on table table01(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild; > alter index table01_index on table01 rebuild; > > 3.set properties: > > set hive.optimize.index.filter.compact.minsize=0; > set hive.optimize.index.filter.compact.maxsize=-1; > set hive.index.compact.query.max.size=-1; > set hive.index.compact.query.max.entries=-1; > set Hive.optimize.index.groupby=false; > set hive.optimize.index.filter=true; > set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; > > 4.execute query statement: > > select * from table01 where id =500000; > > Thanks! > ------------------------- > > Jason -- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility. --=_39391bfc9bc35cc95c49411ca6ccbc64 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi,

 

"Traditional" Indexes are not currently used in Hive. You can create the= m but they are not used by the optimizer.

You can create storage indexes in Hive using ORC file format that provid= es three levels of granularity

  1. ORC File itself
  2. Multiple stripes within the ORC file
  3. Multiple row groups (row batches) within each stripe

Effectively:

  • Chunks of data making up ORC file stored as storage index. Storage = index is the term used for the combined Index and statistics.
  • Each Storage Index has statistics of min, max, count, and sum for each = column in the grouping of rows in batches of 10,000 called row group. Row group both has row data and index data
  • Crucially, it needs the location of the start of each row group, so tha= t the query could jump straight to the beginning of the row group so narrow= ing down the search path.
  • The query should perform a SARG pushdown that limits which rows are req= uired for the query and can avoid reading an entire file, or at least secti= ons of the file which is by and large what a conventional RDBMS B-tree inde= x does.
  • Support for new ACID features in Hive (insert, update and delete).

 

HTH.

 

Mich

 

On 16/02/2016 03:17, =E4=B8=87=E4=BF=AE=E8=BF=9C wrote:

Hello,
 
When I use index in hive 1.2.1, I find the index does not work= =2E  The details are as follows:
 
1. After using index, the query speed does not improve.  If I use= manual use of indexes, the query speed improve obviously, but when switch = to automatic use of indexes, the speed makes no difference relative to not = use index.
 
2. After rebuild index, I add a new text file which includes one recor= d matching my query filter in the table directory. Then,  the query re= sults will show the record included in the new text file. (The case that ap= pend new record in the same file but in different block is the same.)
 
3.When debug the hive source code I find that the function  generateIndexQuery of class Comp= actIndexHandler is= 't called. Finally I find that the function compile in class TaskCompiler = ;returns early at the follow statements:
if = (pCtx.getFetchTask() !=3D null) {
return;
}
this will result in in= dex not working for query. But I do't know why to set FetchTask because I k= now little about hive.
----------------------------------------------------------------------=
----------------------------------
= So, My question is :1. Does hive 1.2.1 support index normally? IF= it supports index completely, what's my issue?2. I want to know  how indexes are used to optimize queries, where can I fi= nd some references?
----------------------------------------------------------------------=
----------------------------------
= Appendix: How do I use index in hive 1.2.1
1.create table and load data:
cr=
eate table table01( id int, name string) =
; 
ROW FORMAT DELIMITED  
FIELDS TE= RMINATED BY '\t';  
load data local = ;inpath '/home/hadoop/data/dual.txt' overwrite into tab= le table01;
2=
=2Ecreate and rebuild index:
cr=
eate index table01_index on table table01(id) =
;as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' =
;with deferred rebuild;
alter index table01_index&= nbsp;on table01 rebuild;
3=
=2Eset properties:
se=
t hive.optimize.index.filter.compact.minsize=3D0;
set hive= =2Eoptimize.index.filter.compact.maxsize=3D-1;
set hive.index.com= pact.query.max.size=3D-1;
set hive.index.compact.query.max.entrie= s=3D-1;
set Hive.optimize.index.groupby=3Dfalse;
set hi= ve.optimize.index.filter=3Dtrue;
set hive.input.format=3Dorg.apac= he.hadoop.hive.ql.io.HiveInputFormat;
4=
=2Eexecute query statement:
se=
lect * from table01 where id =3D500000;
Thanks!

Jason

 

 

--
Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6z=
P6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This m=
essage is for the designated recipient only, if you are not the intended re=
cipient, you should destroy it immediately. Any information in this message=
 shall not be understood as given or endorsed by Cloud Technology Partners =
Ltd, its subsidiaries or their employees, unless expressly so stated. It is=
 the responsibility of the recipient to ensure that this email is virus fre=
e, therefore neither Cloud Technology partners Ltd, its subsidiaries nor th=
eir employees accept any responsibility.

--=_39391bfc9bc35cc95c49411ca6ccbc64--