hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject RE: Is Hive Index officially not recommended?
Date Tue, 05 Jan 2016 20:29:34 GMT
Hi,

 

You point below:

 

The "traditional" indexes can still make sense for data not in Orc or parquet format.

 

Kindly consider below please

 

A traditional index in an RDBMs is normally a B-tree index with a value for that column and
pointer (Row ID)to the row in the data block that keeps the data.

 

 

In RRDBMS I create a unique index on column OBJECT_ID on table ‘t’ below and do a simple
query that can be covered by the index without touching the base table

 

1> select count(1) from t where OBJECT_ID < 100

2> go

 

QUERY PLAN FOR STATEMENT 1 (at line 1).

 

 

    STEP 1

        The type of query is EXECUTE.

        Executing a newly cached statement (SSQL_ID = 312036659).

 

Total estimated I/O cost for statement 1 (at line 1): 0.

 

 

QUERY PLAN FOR STATEMENT 1 (at line 0).

 

 

    STEP 1

        The type of query is DECLARE.

 

Total estimated I/O cost for statement 1 (at line 0): 0.

 

 

QUERY PLAN FOR STATEMENT 2 (at line 1).

Optimized using Parallel Mode

 

 

    STEP 1

        The type of query is SELECT.

 

        3 operator(s) under root

 

       |ROOT:EMIT Operator (VA = 3)

       |

       |   |SCALAR AGGREGATE Operator (VA = 2)

       |   |  Evaluate Ungrouped COUNT AGGREGATE.

       |   |

       |   |   |RESTRICT Operator (VA = 1)(3)(0)(0)(0)(0)

       |   |   |

       |   |   |   |SCAN Operator (VA = 0)

       |   |   |   |  FROM TABLE

       |   |   |   |  t

       |   |   |   |  Using Clustered Index.

       |   |   |   |  Index : t_ui

       |   |   |   |  Forward Scan.

       |   |   |   |  Positioning by key.

       |   |   |   |  Index contains all needed columns. Base table will not be read.

       |   |   |   |  Keys are:

       |   |   |   |    OBJECT_ID ASC

       |   |   |   |  Using I/O Size 64 Kbytes for index leaf pages.

       |   |   |   |  With LRU Buffer Replacement Strategy for index leaf pages.

 

 

Total estimated I/O cost for statement 2 (at line 1): 322792.

 

 

 

OK so no base table is touched

 

Let us do similar thing by creating an index on OBJECT_ID in that ‘t’ table imported from
the said table and creaed in Hive

 

 

create index t_ui on table t (object_id) as 'COMPACT' WITH DEFERRED REBUILD;

alter index t_ui on t rebuild;

analyze table t compute statistics;

 

 

I am going to run the same query in Hive. However, I only see a table scan below and no mention
of that index. May be I am missing something here?

 

0: jdbc:hive2://rhes564:10010/default> explain select count(1) from t where OBJECT_ID <
100;

+------------------------------------------------------------------------------------------------------------------+--+

|                                                     Explain                            
                         |

+------------------------------------------------------------------------------------------------------------------+--+

| STAGE DEPENDENCIES:                                                                    
                         |

|   Stage-1 is a root stage                                                              
                         |

|   Stage-0 depends on stages: Stage-1                                                   
                         |

|                                                                                        
                         |

| STAGE PLANS:                                                                           
                         |

|   Stage: Stage-1                                                                       
                         |

|     Spark                                                                              
                         |

|       Edges:                                                                           
                         |

|         Reducer 2 <- Map 1 (GROUP, 1)                                               
                            |

|       DagName: hduser_20160105203204_8d987e9a-415a-476a-8bad-b9a5010e36bf:54           
                         |

|       Vertices:                                                                        
                         |

|         Map 1                                                                          
                         |

|             Map Operator Tree:                                                         
                         |

|                 TableScan                                                              
                         |

|                   alias: t                                                             
                         |

|                   Statistics: Num rows: 2074897 Data size: 64438212 Basic stats: COMPLETE
Column stats: NONE     |

|                   Filter Operator                                                      
                         |

|                     predicate: (object_id < 100) (type: boolean)                    
                            |

|                     Statistics: Num rows: 691632 Data size: 21479393 Basic stats: COMPLETE
Column stats: NONE    |

|                     Select Operator                                                    
                         |

|                       Statistics: Num rows: 691632 Data size: 21479393 Basic stats: COMPLETE
Column stats: NONE  |

|                       Group By Operator                                                
                         |

|                         aggregations: count(1)                                         
                         |

|                         mode: hash                                                     
                         |

|                         outputColumnNames: _col0                                       
                         |

|                         Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: NONE            |

|                         Reduce Output Operator                                         
                         |

|                           sort order:                                                  
                         |

|                           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: NONE          |

|                           value expressions: _col0 (type: bigint)                      
                         |

|         Reducer 2                                                                      
                         |

|             Reduce Operator Tree:                                                      
                         |

|               Group By Operator                                                        
                         |

|                 aggregations: count(VALUE._col0)                                       
                         |

|                 mode: mergepartial                                                     
                         |

|                 outputColumnNames: _col0                                               
                         |

|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE                    |

|                 File Output Operator                                                   
                         |

|                   compressed: false                                                    
                         |

|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE                  |

|                   table:                                                               
                         |

|                       input format: org.apache.hadoop.mapred.TextInputFormat           
                         |

|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                 |

|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe        
                         |

|                                                                                        
                         |

|   Stage: Stage-0                                                                       
                         |

|     Fetch Operator                                                                     
                         |

|       limit: -1                                                                        
                         |

|       Processor Tree:                                                                  
                         |

|         ListSink                                                                       
                         |

|                                                                                        
                         |

+------------------------------------------------------------------------------------------------------------------+--+

 

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.coHi, <https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>


 <https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
 

Your pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>


 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.


co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 

From: Jörn Franke [mailto:jornfranke@gmail.com] 
Sent: 05 January 2016 19:59
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?

 

Btw this is not Hive specific, but also for other relational database systems, such as Oracle
Exadata.


On 05 Jan 2016, at 20:57, Jörn Franke <jornfranke@gmail.com <mailto:jornfranke@gmail.com>
> wrote:

You can still use execution Engine mr for maintaining the index. Indeed with the ORC or parquet
format there are min/max indexes and bloom filters, but you need to sort your data appropriately
to benefit from performance. Alternatively you can create redundant tables sorted in different
order.

The "traditional" indexes can still make sense for data not in Orc or parquet format.

Keep in mind that for warehouse scenarios there are many other optimization methods in Hive.






Mime
View raw message