Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 32A72200B2D for ; Thu, 16 Jun 2016 23:55:47 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3124D160A51; Thu, 16 Jun 2016 21:55:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2C6471602C5 for ; Thu, 16 Jun 2016 23:55:46 +0200 (CEST) Received: (qmail 70751 invoked by uid 500); 16 Jun 2016 21:55:44 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 70740 invoked by uid 99); 16 Jun 2016 21:55:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jun 2016 21:55:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 73FA6C0E03 for ; Thu, 16 Jun 2016 21:55:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id F_Be7FP6f1Vy for ; Thu, 16 Jun 2016 21:55:42 +0000 (UTC) Received: from mail-qk0-f172.google.com (mail-qk0-f172.google.com [209.85.220.172]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id D2AA65F1B3 for ; Thu, 16 Jun 2016 21:55:41 +0000 (UTC) Received: by mail-qk0-f172.google.com with SMTP id c73so67701869qkg.2 for ; Thu, 16 Jun 2016 14:55:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=+LCuzbzSnLXVcAHBlH5C4VeQf8irRS+e9MDxXc0WoAQ=; b=EmZeI+qSXafiZ43WbMLNRW4nW+6Rzyn+0qLcAb3tHBps8ndRI98Ze/ePVd+yQOWL5a KymG6hhWiYKwhrptaA9sQC3sZT/DDIjcyT2Mjxy7Y8bPp0X7xtSwXC4PD5noaj2JEKP0 yooIxQdmrnsNukgb4P+7Xdt6Vsf5cdVUOv2ES6jhR7Qk5hUA3J82MA01isPW1Di4GZiB Z950rxsQxxYc8wzMLjohiMegYmFdv/bzUaiT1HlpR8d0oSmC6nKe3bBsMTnSBK0WuXxd xHhoj75MjemSfmW0oaiWiw5a1LmAHbM4KiKP6gYETH2imQxJWu1CcTq0Di4K8VeXS28t UL7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=+LCuzbzSnLXVcAHBlH5C4VeQf8irRS+e9MDxXc0WoAQ=; b=kdWiLXgem0gfmyA0GPjhy/AztLLU1iKI5QsvjbsBYOu5ec5w/shuG8Y1YzpVB9PFxA XvOTzjFgD0WmkkfPZ1QDDSNBwRCyrqDUPYIYXDki8JOnuA3az21imPKOptc8wNDKPiuL qMMBexJxZBNkhc5FgVr2O55tz13/9twmO+8bbUC9tEH6aNFKG3NEKJNpK8CWUtwSKA85 dhuhPqkaVBKZb71wdWdDDFDziw4GmiIg/tWPp+JiN14uDsEzOw3dHG5tVMELjNpSKnLP dOpHzi7O4Avy41A8tJ7mTsvzw6i11JJlajzGsRFh5XmSQGtpPG1+18nd5qqadO0x0oNY XXkA== X-Gm-Message-State: ALyK8tLEuumGIsBA88i7XiZmxUPBacEcZY4+4+00IbSQWi3raCzhKhWJEm0RtgOfdpeTen/U7pZePbAgp2NEjg== MIME-Version: 1.0 X-Received: by 10.200.38.244 with SMTP id 49mr7839310qtp.32.1466114135120; Thu, 16 Jun 2016 14:55:35 -0700 (PDT) Received: by 10.233.239.216 with HTTP; Thu, 16 Jun 2016 14:55:35 -0700 (PDT) Received: by 10.233.239.216 with HTTP; Thu, 16 Jun 2016 14:55:35 -0700 (PDT) In-Reply-To: References: <174C4970-3B08-44FB-9B8B-775527E73527@gmail.com> Date: Fri, 17 Jun 2016 00:55:35 +0300 Message-ID: Subject: Re: Hive indexes without improvement of performance From: Vadim Dedkov To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a114064ee39622205356c4c93 archived-at: Thu, 16 Jun 2016 21:55:47 -0000 --001a114064ee39622205356c4c93 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I explain. I can get result for count(*) with index table help, but I can't realise how I can get result for *-select with index table help 17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:50 =D0=BF=D0=BE=D0=BB=D1=8C=D0= =B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "Vadim Dedkov" =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB: > >>If the optimizer does not pick up the index then you can query the inde= x > directly > Could you explain me, how I can do this for query like > > *select * from my_schema_name.doc_t WHERE id =3D '3723445235879';* > > ? > Thank you > 17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:03 =D0=BF=D0=BE=D0=BB=D1=8C=D0= =B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "J=C3=B6rn Franke" > =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB: > >> The indexes are based on HDFS blocksize, which is usually around 128 mb. >> This means for hitting a single row you must always load the full block.= In >> traditional databases this blocksize it is much faster. If the optimizer >> does not pick up the index then you can query the index directly (it is >> just a table!). Keep in mind that you should use for the index also an >> adequate storage format, such as Orc or parquet. >> >> You should not use the traditional indexes, but use Hive+Tez and the Orc >> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is= of >> key importance that you insert the data sorted on the columns that you u= se >> in the where clause. You should compress the table with snappy. >> Additionally partitions make sense. Finally please use the right data ty= pes >> . Storage indexes work best with ints etc. for text fields you can try >> bloom filters. >> >> That being said, also in other relational databases such as Oracle >> Exadata, the use of traditional indexes is discouraged for warehouse >> scenarios, but storage indexes and columnar formats including compressio= n >> will bring the most performance. >> >> On 16 Jun 2016, at 22:50, Vadim Dedkov wrote: >> >> Hello! >> >> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support. >> >> My index creation: >> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' >> WITH DEFERRED REBUILD;* >> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;* >> >> Then I set configs: >> *set hive.optimize.autoindex=3Dtrue;* >> *set hive.optimize.index.filter=3Dtrue;* >> *set hive.optimize.index.filter.compact.minsize=3D0;* >> *set hive.index.compact.query.max.size=3D-1;* >> *set hive.index.compact.query.max.entries=3D-1; * >> >> And my query is: >> *select count(*) from my_schema_name.doc_t WHERE id =3D '3723445235879';= * >> >> Sometimes I have improvement of performance, but most of cases - not. >> >> In cases when I have improvement: >> 1. my query is >> *select count(*) from my_schema_name.doc_t WHERE id =3D '3723445235879';= * >> give me NullPointerException (in logs I see that Hive doesn't find my >> index table) >> 2. then I write: >> *USE my_schema_name;* >> *select count(*) from doc_t WHERE id =3D '3723445235879';* >> and have result with improvement >> (172 sec) >> >> In case when I don't have improvement, I can use either >> *select count(*) from my_schema_name.doc_t WHERE id =3D '3723445235879';= * >> without exception, either >> *USE my_schema_name;* >> *select count(*) from doc_t WHERE id =3D '3723445235879';* >> and have result >> (1153 sec) >> >> My table is about 6 billion rows. >> I tried various combinations on index configs, including only these two: >> *set hive.optimize.index.filter=3Dtrue;* >> *set hive.optimize.index.filter.compact.minsize=3D0;* >> My hadoop version is 2.6.0-cdh5.5.0 >> >> What I do wrong? >> >> Thank you. >> >> -- >> _______________ _______________ >> Best regards, =D0=A1 =D1=83=D0=B2=D0=B0=D0=B6=D0=B5= =D0=BD=D0=B8=D0=B5=D0=BC >> Vadim Dedkov. =D0=92=D0=B0=D0=B4=D0=B8=D0=BC =D0=94=D0= =B5=D0=B4=D0=BA=D0=BE=D0=B2. >> >> --001a114064ee39622205356c4c93 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

I explain. I can get result for count(*) with index table he= lp, but I can't realise how I can get result for *-select with index ta= ble help

17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:50 = =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C &q= uot;Vadim Dedkov" <dedkovva@g= mail.com> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:

>>If the opti= mizer does not pick up the index then you can query the index directly
Could you explain me, how I can do this for query like

select * from my_schema_name.doc_t WHERE id = =3D '3723445235879';

?
Thank you

17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:03 = =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C &q= uot;J=C3=B6rn Franke" <jornfranke@gmail.com> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81= =D0=B0=D0=BB:
The indexes are based on HDFS blocksize, which is usually = around 128 mb. This means for hitting a single row you must always load the= full block. In traditional databases this blocksize it is much faster. If = the optimizer does not pick up the index then you can query the index direc= tly (it is just a table!). Keep in mind that you should use for the index a= lso an adequate storage format, such as Orc or parquet.

You should not use the traditional indexes, but use Hive+Tez and the = Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It i= s of key importance that you insert the data sorted on the columns that you= use in the where clause. You should compress the table with snappy. Additi= onally partitions make sense. Finally please use the right data types . Sto= rage indexes work best with ints etc. for text fields you can try bloom fil= ters.

That being said, also in other relational da= tabases such as Oracle Exadata, the use of traditional indexes is discourag= ed for warehouse scenarios, but storage indexes and columnar formats includ= ing compression will bring the most performance.

On 16 Jun 20= 16, at 22:50, Vadim Dedkov <dedkovva@gmail.com> wrote:

Hello!

I use = Hive 1.1.0-cdh5.5.0 and try to use indexes support.
=C2=A0
<= div>My index creation:
CREATE INDEX doc_id_idx on TABLE my_sch= ema_name.doc_t (id) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;

Then I set configs:
set hive.optimize.autoinde= x=3Dtrue;
set hive.optimize.index.filter=3Dtrue;
=
set hive.optimize.index.filter.compact.minsize=3D0;
<= i>set hive.index.compact.query.max.size=3D-1;
set hive.ind= ex.compact.query.max.entries=3D-1;=C2=A0

And m= y query is:
select count(*) from my_schema_name.doc_t WHERE id= =3D '3723445235879';

Sometimes I have= improvement of performance, but most of cases - not.

<= div>In cases when I have improvement:
1. my query is
select count(*) from my_schema_name.doc_t WHERE id =3D '3723445235879&= #39;;
give me NullPointerException (in logs I see that Hive d= oesn't find my index table)
2. then I write:
USE= my_schema_name;
select count(*) from doc_t WHERE id =3D &= #39;3723445235879';
and have result with improvement
(172 sec)

In case when I don't have impr= ovement, I can use either
select count(*) from my_schema_name.= doc_t WHERE id =3D '3723445235879';
without exception= , either
USE my_schema_name;
select count(*) = from doc_t WHERE id =3D '3723445235879';
and have res= ult
(1153 sec)

My table is about 6 billi= on rows.
I tried various combinations on index configs, including= only these two:=C2=A0
set hive.optimize.index.filter=3Dtrue;<= /i>
set hive.optimize.index.filter.compact.minsize=3D0;
My hadoop version is 2.6.0-cdh5.5.0

What I= do wrong?

Thank you.

--
<= div data-smartmail=3D"gmail_signature">
______________= _ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 _______________
B= est regards, =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0=D0=A1 =D1=83=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC
Va= dim Dedkov. =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =D0=92=D0=B0=D0=B4=D0=B8=D0=BC =D0=94=D0=B5=D0=B4=D0=BA=D0=BE=D0=B2.
<= /div>
--001a114064ee39622205356c4c93--