hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject Hive optimiser seems not to use for a covered query
Date Fri, 27 Mar 2015 08:33:39 GMT
Hi,

 

I am very new to hive optimiser

 

Here I have a table with 4 million rows imported from Oracle via sqoop/hive. In this table
object_id column is unique. Oracle table has primary key constraint on object_id column which
is basically a unique B-tree index.

 

I do a very simple query to see how many unique values are for object_id in table. The answer
is they are as many as number of rows.

 

So query like below in Oracle

 

SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM hddtester.tdash;

 

Should return 1. Now Oracle optimiser only needs to read the index key and work it out WITHOUT
touching the underlying table and it does that

 

----------------------------------------------------------

Plan hash value: 1988751498

 

-----------------------------------------------------------------------------------

| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |

-----------------------------------------------------------------------------------

|   0 | SELECT STATEMENT       |          |     1 |    26 | 13952   (1)| 00:02:48 |

|   1 |  SORT AGGREGATE        |          |     1 |    26 |            |          |

|   2 |   VIEW                 | VW_DAG_0 |  4000K|    99M| 13952   (1)| 00:02:48 |

|   3 |    SORT GROUP BY NOSORT|          |  4000K|    22M| 13952   (1)| 00:02:48 |

|   4 |     INDEX FULL SCAN    | TDASH_PK |  4000K|    22M| 13952   (1)| 00:02:48 |

-----------------------------------------------------------------------------------

 

Here it is shown as Operation Id = 4 “INDEX FULL SCAN”. Please note that the table itself
is not touched as expected

 

Now I have the same table “tdash” in Hive with a compact index on object_id. I have analysed
stats for table with “analyze table tdash compute statistics”. Now I do explain as below

 

hive> explain SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM tdash;

OK

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1

 

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: tdash

            Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column
stats: NONE

            Select Operator

              expressions: object_id (type: double)

              outputColumnNames: object_id

              Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column
stats: NONE

              Group By Operator

                aggregations: count(DISTINCT object_id), count(object_id)

                keys: object_id (type: double)

                mode: hash

                outputColumnNames: _col0, _col1, _col2

                Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE
Column stats: NONE

                Reduce Output Operator

                  key expressions: _col0 (type: double)

                  sort order: +

                  Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE
Column stats: NONE

                  value expressions: _col2 (type: bigint)

      Reduce Operator Tree:

        Group By Operator

          aggregations: count(DISTINCT KEY._col0:0._col0), count(VALUE._col1)

          mode: mergepartial

          outputColumnNames: _col0, _col1

          Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE

          Select Operator

            expressions: (_col0 / _col1) (type: double)

            outputColumnNames: _col0

            Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE

            File Output Operator

              compressed: false

              Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 

  Stage: Stage-0

    Fetch Operator

      limit: -1

      Processor Tree:

        ListSink

 

Time taken: 0.691 seconds, Fetched: 50 row(s)

 

Trying to understand above does keys: object_id (type: double) refers to use of index here?
I dropped that index and the same plan was produced! How Hive optimiser flag the index usage
in the plan. Do I need to update column statists as well as table

 

Many thanks

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Ltd,
its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.

 

From: Daniel Haviv [mailto:daniel.haviv@veracity-group.com] 
Sent: 26 March 2015 17:27
To: user@hive.apache.org
Subject: Understanding Hive's execution plan

 

Hi,

Can anyone direct me to a good explanation on understanding Hive's execution plan?

 

Thanks,

Daniel


Mime
View raw message