hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Mollitor (Jira)" <j...@apache.org>
Subject [jira] [Created] (HIVE-23054) Capture Total Byte Size in Column Statistics
Date Thu, 19 Mar 2020 13:47:00 GMT
David Mollitor created HIVE-23054:
-------------------------------------

             Summary: Capture Total Byte Size in Column Statistics
                 Key: HIVE-23054
                 URL: https://issues.apache.org/jira/browse/HIVE-23054
             Project: Hive
          Issue Type: Improvement
          Components: CBO, Statistics
            Reporter: David Mollitor


Store a counter in HMS column statics for the total number of bytes (raw) in each column.

Right now, there is no good way to merge the average column length when performing an INSERT
statement into a table.  Right now, the code just selects the maximum value, however, if inserting
a single records with a long length (128 bytes) into a table that has millions of strings
with an average length of 4, the average length for the entire data set gets boosted to 128.

{code:java}
aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), newData.getAvgColLen()));
{code}

https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34

Store the total raw size of all the data in each column.  Between the total raw size, and
the average length, one can compute the real average length when merging the exiting data
and the newly inserted data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message