hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <pjayachand...@hortonworks.com>
Subject Re: Query Using Stats
Date Fri, 16 May 2014 23:52:57 GMT
Bryan,

The flag you are looking for is hive.compute.query.using.stats. By default this optimization
is disabled. You might need to enable it to use it. Also the min/max/sum metadata are not
looked up from the file but instead from metastore. Although file formats like ORC contains
stats, they are not used to answer metadata only queries. Hive considers metastore as the
only source of truth for answering such queries. You can look at this jira for further details
https://issues.apache.org/jira/browse/HIVE-5483

Thanks
Prasanth Jayachandran

On May 16, 2014, at 5:35 AM, Bryan Jeffrey <bryan.jeffrey@gmail.com> wrote:

> All,
> 
> I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
> 
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n root --hiveconf
hive.compute.query.using.stats=true -e "select min(seconds), max(seconds), range from data
where range > 1400204700 group by range"
> 
> 'range' above is our partition.  I would expect that this would provide a reasonably
fast response time by simply looking at the metadata for each file in a given partition (maybe
one mapper per range).  Instead we're seeing 140+ mappers, and the query takes a long time.
> 
> Here is the explain plan:
> 
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n root --hiveconf
hive.compute.query.using.stats=true -e "explain select min(seconds), max(seconds), range from
data where range > 1400204700 group by range"
> scan complete in 4ms
> Connecting to jdbc:hive2://server:10002/database
> Connected to: Apache Hive (version 0.13.0)
> Driver: Hive JDBC (version 0.13.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +---------------------------------------------------------------------------------------------------------------------+
> |                                                       Explain                     
                                 |
> +---------------------------------------------------------------------------------------------------------------------+
> | STAGE DEPENDENCIES:                                                               
                                 |
> |   Stage-1 is a root stage                                                         
                                 |
> |   Stage-0 is a root stage                                                         
                                 |
> |                                                                                   
                                 |
> | STAGE PLANS:                                                                      
                                 |
> |   Stage: Stage-1                                                                  
                                 |
> |     Map Reduce                                                                    
                                 |
> |       Map Operator Tree:                                                          
                                 |
> |           TableScan                                                               
                                 |
> |             alias: data                                                           
                                |
> |             Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: PARTIAL
Column stats: NONE         |
> |             Select Operator                                                       
                                 |
> |               expressions: range (type: int), seconds (type: bigint)              
                        |
> |               outputColumnNames: range, seconds                                   
                        |
> |               Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats:
PARTIAL Column stats: NONE       |
> |               Group By Operator                                                   
                                 |
> |                 aggregations: min(seconds), max(end_time_seconds)                 
                        |
> |                 keys: range (type: int)                                           
                                 |
> |                 mode: hash                                                        
                                 |
> |                 outputColumnNames: _col0, _col1, _col2                            
                                 |
> |                 Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats:
PARTIAL Column stats: NONE     |
> |                 Reduce Output Operator                                            
                                 |
> |                   key expressions: _col0 (type: int)                              
                                 |
> |                   sort order: +                                                   
                                 |
> |                   Map-reduce partition columns: _col0 (type: int)                 
                                 |
> |                   Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats:
COMPLETE Column stats: NONE  |
> |                   value expressions: _col1 (type: bigint), _col2 (type: bigint)   
                                 |
> |       Reduce Operator Tree:                                                       
                                 |
> |         Group By Operator                                                         
                                 |
> |           aggregations: min(VALUE._col0), max(VALUE._col1)                        
                                 |
> |           keys: KEY._col0 (type: int)                                             
                                 |
> |           mode: mergepartial                                                      
                                 |
> |           outputColumnNames: _col0, _col1, _col2                                  
                                 |
> |           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE
Column stats: NONE          |
> |           Select Operator                                                         
                                 |
> |             expressions: _col1 (type: bigint), _col2 (type: bigint), _col0 (type: int)
                             |
> |             outputColumnNames: _col0, _col1, _col2                                
                                 |
> |             Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE
Column stats: NONE        |
> |             File Output Operator                                                  
                                 |
> |               compressed: false                                                   
                                 |
> |               Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats:
COMPLETE Column stats: NONE      |
> |               table:                                                              
                                 |
> |                   input format: org.apache.hadoop.mapred.TextInputFormat          
                                 |
> |                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        |
> |                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
                                 |
> |                                                                                   
                                 |
> |   Stage: Stage-0                                                                  
                                 |
> |     Fetch Operator                                                                
                                 |
> |       limit: -1                                                                   
                                 |
> |                                                                                   
                                 |
> +---------------------------------------------------------------------------------------------------------------------+
> 50 rows selected (0.417 seconds)
> Beeline version 0.13.0 by Apache Hive
> Closing: 0: jdbc:hive2://viper:10002/intrepid
> 
> Can anyone enlighten me as to how this could be optimized?
> 
> Regards,
> 
> Bryan Jeffrey
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message