hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
Date Wed, 20 Feb 2019 14:09:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773046#comment-16773046
] 

Sahil Takiar commented on HIVE-20079:
-------------------------------------

FYI I don't think {{block.getTotalByteSize}} provides the size of the data when loaded into
memory. Talking to a few Parquet folks, no such method to get the raw data size exists. If
we want to implement this patch we will have to do something similar to what ORC does - https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L601

> Populate more accurate rawDataSize for parquet format
> -----------------------------------------------------
>
>                 Key: HIVE-20079
>                 URL: https://issues.apache.org/jira/browse/HIVE-20079
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 2.0.0
>            Reporter: Aihua Xu
>            Assignee: Antal Sinkovits
>            Priority: Major
>         Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, HIVE-20079.3.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 (that is the
number of fields) incorrectly. We need to populate correct data size so data can be split
properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
> 	COLUMN_STATS_ACCURATE	true
> 	numFiles            	1
> 	numRows             	2
> 	rawDataSize         	4
> 	totalSize           	373
> 	transient_lastDdlTime	1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message