Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: <CAKf8W7cm275mkfjSQuQXk=TZ98obe2HwETmso-DSfdvvg-2egQ@mail.gmail.com>
References: <CAKf8W7dUnYWtTKsbsrc6rneq_YCgT5D2hCL-o9Jbh-iBG_-ttQ@mail.gmail.com>
	<174C4970-3B08-44FB-9B8B-775527E73527@gmail.com>
	<CAKf8W7cm275mkfjSQuQXk=TZ98obe2HwETmso-DSfdvvg-2egQ@mail.gmail.com>
Date: Fri, 17 Jun 2016 00:55:35 +0300
Message-ID: <CAKf8W7f51zrGXTY9rqvmLuzJpzVY0fChOGrbF8hMfad3bWBMdQ@mail.gmail.com>
Subject: Re: Hive indexes without improvement of performance
From: Vadim Dedkov <dedkovva@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=001a114064ee39622205356c4c93
archived-at: Thu, 16 Jun 2016 21:55:47 -0000

--001a114064ee39622205356c4c93
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I explain. I can get result for count(*) with index table help, but I can't
realise how I can get result for *-select with index table help
17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:50 =D0=BF=D0=BE=D0=BB=D1=8C=D0=
=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "Vadim Dedkov" <dedkovva@gmai=
l.com>
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:

> >>If the optimizer does not pick up the index then you can query the inde=
x
> directly
> Could you explain me, how I can do this for query like
>
> *select * from my_schema_name.doc_t WHERE id =3D '3723445235879';*
>
> ?
> Thank you
> 17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:03 =D0=BF=D0=BE=D0=BB=D1=8C=D0=
=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "J=C3=B6rn Franke" <jornfrank=
e@gmail.com>
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>> The indexes are based on HDFS blocksize, which is usually around 128 mb.
>> This means for hitting a single row you must always load the full block.=
 In
>> traditional databases this blocksize it is much faster. If the optimizer
>> does not pick up the index then you can query the index directly (it is
>> just a table!). Keep in mind that you should use for the index also an
>> adequate storage format, such as Orc or parquet.
>>
>> You should not use the traditional indexes, but use Hive+Tez and the Orc
>> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is=
 of
>> key importance that you insert the data sorted on the columns that you u=
se
>> in the where clause. You should compress the table with snappy.
>> Additionally partitions make sense. Finally please use the right data ty=
pes
>> . Storage indexes work best with ints etc. for text fields you can try
>> bloom filters.
>>
>> That being said, also in other relational databases such as Oracle
>> Exadata, the use of traditional indexes is discouraged for warehouse
>> scenarios, but storage indexes and columnar formats including compressio=
n
>> will bring the most performance.
>>
>> On 16 Jun 2016, at 22:50, Vadim Dedkov <dedkovva@gmail.com> wrote:
>>
>> Hello!
>>
>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>
>> My index creation:
>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
>> WITH DEFERRED REBUILD;*
>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>
>> Then I set configs:
>> *set hive.optimize.autoindex=3Dtrue;*
>> *set hive.optimize.index.filter=3Dtrue;*
>> *set hive.optimize.index.filter.compact.minsize=3D0;*
>> *set hive.index.compact.query.max.size=3D-1;*
>> *set hive.index.compact.query.max.entries=3D-1; *
>>
>> And my query is:
>> *select count(*) from my_schema_name.doc_t WHERE id =3D '3723445235879';=
*
>>
>> Sometimes I have improvement of performance, but most of cases - not.
>>
>> In cases when I have improvement:
>> 1. my query is
>> *select count(*) from my_schema_name.doc_t WHERE id =3D '3723445235879';=
*
>> give me NullPointerException (in logs I see that Hive doesn't find my
>> index table)
>> 2. then I write:
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id =3D '3723445235879';*
>> and have result with improvement
>> (172 sec)
>>
>> In case when I don't have improvement, I can use either
>> *select count(*) from my_schema_name.doc_t WHERE id =3D '3723445235879';=
*
>> without exception, either
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id =3D '3723445235879';*
>> and have result
>> (1153 sec)
>>
>> My table is about 6 billion rows.
>> I tried various combinations on index configs, including only these two:
>> *set hive.optimize.index.filter=3Dtrue;*
>> *set hive.optimize.index.filter.compact.minsize=3D0;*
>> My hadoop version is 2.6.0-cdh5.5.0
>>
>> What I do wrong?
>>
>> Thank you.
>>
>> --
>> _______________             _______________
>> Best regards,                    =D0=A1 =D1=83=D0=B2=D0=B0=D0=B6=D0=B5=
=D0=BD=D0=B8=D0=B5=D0=BC
>> Vadim Dedkov.                  =D0=92=D0=B0=D0=B4=D0=B8=D0=BC =D0=94=D0=
=B5=D0=B4=D0=BA=D0=BE=D0=B2.
>>
>>

--001a114064ee39622205356c4c93
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">I explain. I can get result for count(*) with index table he=
lp, but I can&#39;t realise how I can get result for *-select with index ta=
ble help</p>
<div class=3D"gmail_quote">17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:50 =
=D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C &q=
uot;Vadim Dedkov&quot; &lt;<a href=3D"mailto:dedkovva@gmail.com">dedkovva@g=
mail.com</a>&gt; =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<br type=3D"att=
ribution"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex"><p dir=3D"ltr">&gt;&gt;If the opti=
mizer does not pick up the index then you can query the index directly<br>
Could you explain me, how I can do this for query like <br>
</p>
<blockquote><p dir=3D"ltr"><i>select * from my_schema_name.doc_t WHERE id =
=3D &#39;3723445235879&#39;;</i><br>
</p>
</blockquote>
<p dir=3D"ltr">?<br>
Thank you</p>
<div class=3D"gmail_quote">17 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3. 0:03 =
=D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C &q=
uot;J=C3=B6rn Franke&quot; &lt;<a href=3D"mailto:jornfranke@gmail.com" targ=
et=3D"_blank">jornfranke@gmail.com</a>&gt; =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=
=D0=B0=D0=BB:<br type=3D"attribution"><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div d=
ir=3D"auto"><div>The indexes are based on HDFS blocksize, which is usually =
around 128 mb. This means for hitting a single row you must always load the=
 full block. In traditional databases this blocksize it is much faster. If =
the optimizer does not pick up the index then you can query the index direc=
tly (it is just a table!). Keep in mind that you should use for the index a=
lso an adequate storage format, such as Orc or parquet.</div><div><br></div=
><div>You should not use the traditional indexes, but use Hive+Tez and the =
Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It i=
s of key importance that you insert the data sorted on the columns that you=
 use in the where clause. You should compress the table with snappy. Additi=
onally partitions make sense. Finally please use the right data types . Sto=
rage indexes work best with ints etc. for text fields you can try bloom fil=
ters.</div><div><br></div><div>That being said, also in other relational da=
tabases such as Oracle Exadata, the use of traditional indexes is discourag=
ed for warehouse scenarios, but storage indexes and columnar formats includ=
ing compression will bring the most performance.</div><div><br>On 16 Jun 20=
16, at 22:50, Vadim Dedkov &lt;<a href=3D"mailto:dedkovva@gmail.com" target=
=3D"_blank">dedkovva@gmail.com</a>&gt; wrote:<br><br></div><blockquote type=
=3D"cite"><div><div dir=3D"ltr"><div>Hello!</div><div><br></div><div>I use =
Hive 1.1.0-cdh5.5.0 and try to use indexes support.</div><div>=C2=A0</div><=
div>My index creation:</div><div><i>CREATE INDEX doc_id_idx on TABLE my_sch=
ema_name.doc_t (id) AS &#39;COMPACT&#39; WITH DEFERRED REBUILD;</i></div><d=
iv><i>ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;</i></div><div=
><br></div><div>Then I set configs:</div><div><i>set hive.optimize.autoinde=
x=3Dtrue;</i></div><div><i>set hive.optimize.index.filter=3Dtrue;</i></div>=
<div><i>set hive.optimize.index.filter.compact.minsize=3D0;</i></div><div><=
i>set hive.index.compact.query.max.size=3D-1;</i></div><div><i>set hive.ind=
ex.compact.query.max.entries=3D-1;=C2=A0</i></div><div><br></div><div>And m=
y query is:</div><div><i>select count(*) from my_schema_name.doc_t WHERE id=
 =3D &#39;3723445235879&#39;;</i></div><div><br></div><div>Sometimes I have=
 improvement of performance, but most of cases - not.</div><div><br></div><=
div>In cases when I have improvement:</div><div>1. my query is</div><div><i=
>select count(*) from my_schema_name.doc_t WHERE id =3D &#39;3723445235879&=
#39;;</i></div><div>give me NullPointerException (in logs I see that Hive d=
oesn&#39;t find my index table)</div><div>2. then I write:</div><div><i>USE=
 my_schema_name;</i></div><div><i>select count(*) from doc_t WHERE id =3D &=
#39;3723445235879&#39;;</i></div><div>and have result with improvement</div=
><div>(172 sec)</div><div><br></div><div>In case when I don&#39;t have impr=
ovement, I can use either</div><div><i>select count(*) from my_schema_name.=
doc_t WHERE id =3D &#39;3723445235879&#39;;</i></div><div>without exception=
, either</div><div><i>USE my_schema_name;</i></div><div><i>select count(*) =
from doc_t WHERE id =3D &#39;3723445235879&#39;;</i></div><div>and have res=
ult</div><div>(1153 sec)</div><div><br></div><div>My table is about 6 billi=
on rows.</div><div>I tried various combinations on index configs, including=
 only these two:=C2=A0</div><div><i>set hive.optimize.index.filter=3Dtrue;<=
/i></div><div><i>set hive.optimize.index.filter.compact.minsize=3D0;</i></d=
iv><div>My hadoop version is 2.6.0-cdh5.5.0</div><div><br></div><div>What I=
 do wrong?</div><div><br></div><div>Thank you.</div><div><br></div>-- <br><=
div data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div>______________=
_ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 _______________<br></div><div>B=
est regards, =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0=D0=A1 =D1=83=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC<div>Va=
dim Dedkov. =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=D0=92=D0=B0=D0=B4=D0=B8=D0=BC =D0=94=D0=B5=D0=B4=D0=BA=D0=BE=D0=B2.</div><=
/div></div></div>
</div>
</div></blockquote></div></blockquote></div>
</blockquote></div>

--001a114064ee39622205356c4c93--