hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "BELUGA BEHR (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-16887) Parquet rawDataSize Under Reported
Date Mon, 12 Jun 2017 21:52:00 GMT
BELUGA BEHR created HIVE-16887:
----------------------------------

             Summary: Parquet rawDataSize Under Reported
                 Key: HIVE-16887
                 URL: https://issues.apache.org/jira/browse/HIVE-16887
             Project: Hive
          Issue Type: Improvement
          Components: HiveServer2, Query Planning
    Affects Versions: 2.1.1
            Reporter: BELUGA BEHR
            Priority: Minor


{code:sql}
CREATE TABLE `test_stats`(
	  `a` int, 
	  `b` int, 
	  `c` string, 
	  `d` bigint) stored as parquet;


INSERT INTO test_stats VALUES (1,31,"This is a test", 4567890);
ANALYZE TABLE test_stats COMPUTE STATISTICS FOR COLUMNS;
EXPLAIN SELECT * FROM test_stats WHERE c IS NULL;
{code}

{code}
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test_stats
            filterExpr: c is null (type: boolean)
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
            Filter Operator
              predicate: c is null (type: boolean)
              Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
              Select Operator
                expressions: a (type: int), b (type: int), null (type: string), d (type: bigint)
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats:
COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats:
COMPLETE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

For Parquet tables, when Hive looks at Table stats, it sees one size for the table, when it
looks at column stats, it sees another:

Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE

The row stats are much more accurate though I would expect them to be the same.

Total Table Size = (Num Rows) * (Average Row Size)

The rawDataSize is reported as 4 bytes. This is way under reporting the size of the data.
Perhaps it is 4 bytes in Parquet format, but when this data loads into a Spark HashTable Sink
or a Cache, this one row is going to be at least (4+4+8+14) 30 bytes.

In cases where we set {{hive.auto.convert.join.noconditionaltask.size}} to be 10mb and it
is based off of the rawDataSize, in Spark when stats are enabled, we will require 114 bytes
instead of the 4 bytes as reported in table stats (28.5x).  You can imagine a case where the
rawDataSize is reported as 10MB but the real amount of required memory to cache is  285MB!
 This may break an executor with limited memory or where multiple tables are being cached.

The Parquet SerDe should be reporting the total size of the data (uncompressed) so that we
know exactly how much data we will be loading into memory, much like they do it here:

https://github.com/apache/hive/blob/6b6a00ffb0dae651ef407a99bab00d5e74f0d6aa/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L544



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message