hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Dedkov <dedko...@gmail.com>
Subject Re: Hive indexes without improvement of performance
Date Thu, 16 Jun 2016 22:00:14 GMT
Ok, thank you. I tried Hive with Tez for my index-problem without any
performance
17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" <
mich.talebzadeh@gmail.com> написал:

>
> Well I guess I have to agree to differ on this with Jorn as before.
>
> Vadim,
>
> Please go ahead and try what Jorn suggests. Report back if you see any
> improvement.
>
> Couple of points if I may:
>
> Using Hive on Tez is not going to improve Optimiser's performance. That is
> just the execution engine and BTW I would rather use Hive on Spark. Both
> TEZ and Spark will be a better fit than the usual map-reduce engibe.
>
> Actually my suggestion would be to use Hive as storage layer only and use
> Spark as the query tool. In that case you don't need to worry about indexes
> etc in Hive. Spark with DAG and In-memory computing will do a much better
> job.
>
> So
>
>
>    1. Use Hive with its metadata to store data on HDFS
>    2. Use Spark SQL to query that Data. Orders of magnitude faster.
>
>
> However, I am all for you trying what Jorn suggested.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 22:02, Jörn Franke <jornfranke@gmail.com> wrote:
>
>> The indexes are based on HDFS blocksize, which is usually around 128 mb.
>> This means for hitting a single row you must always load the full block. In
>> traditional databases this blocksize it is much faster. If the optimizer
>> does not pick up the index then you can query the index directly (it is
>> just a table!). Keep in mind that you should use for the index also an
>> adequate storage format, such as Orc or parquet.
>>
>> You should not use the traditional indexes, but use Hive+Tez and the Orc
>> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
>> key importance that you insert the data sorted on the columns that you use
>> in the where clause. You should compress the table with snappy.
>> Additionally partitions make sense. Finally please use the right data types
>> . Storage indexes work best with ints etc. for text fields you can try
>> bloom filters.
>>
>> That being said, also in other relational databases such as Oracle
>> Exadata, the use of traditional indexes is discouraged for warehouse
>> scenarios, but storage indexes and columnar formats including compression
>> will bring the most performance.
>>
>> On 16 Jun 2016, at 22:50, Vadim Dedkov <dedkovva@gmail.com> wrote:
>>
>> Hello!
>>
>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>
>> My index creation:
>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
>> WITH DEFERRED REBUILD;*
>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>
>> Then I set configs:
>> *set hive.optimize.autoindex=true;*
>> *set hive.optimize.index.filter=true;*
>> *set hive.optimize.index.filter.compact.minsize=0;*
>> *set hive.index.compact.query.max.size=-1;*
>> *set hive.index.compact.query.max.entries=-1; *
>>
>> And my query is:
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>
>> Sometimes I have improvement of performance, but most of cases - not.
>>
>> In cases when I have improvement:
>> 1. my query is
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>> give me NullPointerException (in logs I see that Hive doesn't find my
>> index table)
>> 2. then I write:
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id = '3723445235879';*
>> and have result with improvement
>> (172 sec)
>>
>> In case when I don't have improvement, I can use either
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>> without exception, either
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id = '3723445235879';*
>> and have result
>> (1153 sec)
>>
>> My table is about 6 billion rows.
>> I tried various combinations on index configs, including only these two:
>> *set hive.optimize.index.filter=true;*
>> *set hive.optimize.index.filter.compact.minsize=0;*
>> My hadoop version is 2.6.0-cdh5.5.0
>>
>> What I do wrong?
>>
>> Thank you.
>>
>> --
>> _______________             _______________
>> Best regards,                    С уважением
>> Vadim Dedkov.                  Вадим Дедков.
>>
>>
>

Mime
View raw message