hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gunther Hagleitner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16937) INFORMATION_SCHEMA usability: everything is currently a string
Date Fri, 23 Jun 2017 18:53:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061376#comment-16061376
] 

Gunther Hagleitner commented on HIVE-16937:
-------------------------------------------

Interesting. Turns out that the JDBC handler was happily ignoring all types, returning string
for everything. [~jdere] you looked at the handler before, could you review the fix? cc [~thejas].
With the patch you get this:

{noformat}POSTHOOK: query: describe sys.tab_col_stats                                    
                                                                                         
                                  
POSTHOOK: type: DESCTABLE                                                                
                                                                                         
                        
POSTHOOK: Input: sys@tab_col_stats                                                       
                                                                                         
                        
cs_id                 bigint                from deserializer                            
                                                                                         
                        
db_name               string                from deserializer                            
                                                                                         
                        
table_name            string                from deserializer                            
                                                                                         
                        
column_name           string                from deserializer                            
                                                                                         
                        
column_type           string                from deserializer                            
                                                                                         
                        
tbl_id                bigint                from deserializer                            
                                                                                         
                        
long_low_value        bigint                from deserializer                            
                                                                                         
                        
long_high_value       bigint                from deserializer                            
                                                                                         
                        
double_high_value     double                from deserializer                            
                                                                                         
                        
double_low_value      double                from deserializer                            
                                                                                         
                        
big_decimal_low_value string                from deserializer                            
                                                                                         
                        
big_decimal_high_value  string                from deserializer                          
                                                                                         
                        
num_nulls             bigint                from deserializer                            
                                                                                         
                        
num_distincts         bigint                from deserializer                            
                                                                                         
                        
avg_col_len           double                from deserializer                            
                                                                                         
                        
max_col_len           bigint                from deserializer                            
                                                                                         
                        
num_trues             bigint                from deserializer                            
                                                                                         
                        
num_falses            bigint                from deserializer                            
                                                                                         
                        
last_analyzed         bigint                from deserializer                
{noformat}

> INFORMATION_SCHEMA usability: everything is currently a string
> --------------------------------------------------------------
>
>                 Key: HIVE-16937
>                 URL: https://issues.apache.org/jira/browse/HIVE-16937
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Carter Shanklin
>            Assignee: Gunther Hagleitner
>         Attachments: HIVE-16937.1.patch
>
>
> HIVE-1010 adds an information schema to Hive, also taking the opportunity to expose some
non-standard but valuable things like statistics in a SYS table.
> A challenge I have noted with the SYS table is that all statistic counts are exposed
as string types rather than numerics.
> {code}
> hive> show create table sys.tab_col_stats;
> OK
> CREATE TABLE `sys.tab_col_stats`(
>   `cs_id` string COMMENT 'from deserializer',
>   `db_name` string COMMENT 'from deserializer',
>   `table_name` string COMMENT 'from deserializer',
>   `column_name` string COMMENT 'from deserializer',
>   `column_type` string COMMENT 'from deserializer',
>   `tbl_id` string COMMENT 'from deserializer',
>   `long_low_value` string COMMENT 'from deserializer',
>   `long_high_value` string COMMENT 'from deserializer',
>   `double_high_value` string COMMENT 'from deserializer',
>   `double_low_value` string COMMENT 'from deserializer',
>   `big_decimal_low_value` string COMMENT 'from deserializer',
>   `big_decimal_high_value` string COMMENT 'from deserializer',
>   `num_nulls` string COMMENT 'from deserializer',
>   `num_distincts` string COMMENT 'from deserializer',
>   `avg_col_len` string COMMENT 'from deserializer',
>   `max_col_len` string COMMENT 'from deserializer',
>   `num_trues` string COMMENT 'from deserializer',
>   `num_falses` string COMMENT 'from deserializer',
>   `last_analyzed` string COMMENT 'from deserializer')
> ROW FORMAT SERDE
>   'org.apache.hive.storage.jdbc.JdbcSerDe'
> STORED BY
>   'org.apache.hive.storage.jdbc.JdbcStorageHandler'
> {code}
> So you might run this query to try and find the column(s) which have the most distinct
values.
> {code}
> select
>   db_name, table_name, column_name
> from
>   sys.tab_col_stats
> where
>   num_distincts = ( select max(num_distincts) from sys.tab_col_stats );
> {code}
> Unfortunately this maximum is based on string sorting so it's not likely what you really
want.
> It would be better to use numeric types where appropriate such as all the numbers in
tab_col_stats, and most likely bigints should be used for stats like # rows, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message