hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carter Shanklin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-16937) INFORMATION_SCHEMA usability: everything is currently a string
Date Thu, 22 Jun 2017 03:38:00 GMT
Carter Shanklin created HIVE-16937:
--------------------------------------

             Summary: INFORMATION_SCHEMA usability: everything is currently a string
                 Key: HIVE-16937
                 URL: https://issues.apache.org/jira/browse/HIVE-16937
             Project: Hive
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Carter Shanklin


HIVE-1010 adds an information schema to Hive, also taking the opportunity to expose some non-standard
but valuable things like statistics in a SYS table.

A challenge I have noted with the SYS table is that all statistic counts are exposed as string
types rather than numerics.

{code}
hive> show create table sys.tab_col_stats;
OK
CREATE TABLE `sys.tab_col_stats`(
  `cs_id` string COMMENT 'from deserializer',
  `db_name` string COMMENT 'from deserializer',
  `table_name` string COMMENT 'from deserializer',
  `column_name` string COMMENT 'from deserializer',
  `column_type` string COMMENT 'from deserializer',
  `tbl_id` string COMMENT 'from deserializer',
  `long_low_value` string COMMENT 'from deserializer',
  `long_high_value` string COMMENT 'from deserializer',
  `double_high_value` string COMMENT 'from deserializer',
  `double_low_value` string COMMENT 'from deserializer',
  `big_decimal_low_value` string COMMENT 'from deserializer',
  `big_decimal_high_value` string COMMENT 'from deserializer',
  `num_nulls` string COMMENT 'from deserializer',
  `num_distincts` string COMMENT 'from deserializer',
  `avg_col_len` string COMMENT 'from deserializer',
  `max_col_len` string COMMENT 'from deserializer',
  `num_trues` string COMMENT 'from deserializer',
  `num_falses` string COMMENT 'from deserializer',
  `last_analyzed` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hive.storage.jdbc.JdbcSerDe'
STORED BY
  'org.apache.hive.storage.jdbc.JdbcStorageHandler'
{code}

So you might run this query to try and find the column(s) which have the most distinct values.
{code}
select
  db_name, table_name, column_name
from
  sys.tab_col_stats
where
  num_distincts = ( select max(num_distincts) from sys.tab_col_stats );
{code}

Unfortunately this maximum is based on string sorting so it's not likely what you really want.

It would be better to use numeric types where appropriate such as all the numbers in tab_col_stats,
and most likely bigints should be used for stats like # rows, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message