hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <>
Subject Re: Hive/Tez ORC tables -- rawDataSize value
Date Thu, 23 Jun 2016 23:14:01 GMT
Please find answers inline.

On Jun 23, 2016, at 3:49 PM, Lalitha MV <<>>


I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1.

I created a hive table with text file size = ~141 Mb.
show tblproperties of this table (textfile):
numFiles        1
numRows 1000000
rawDataSize     141869803
totalSize       142869803

I then created a hive table, with orc compression from the above table. The compressed file
size is ~50 Mb.

show tblproperties for new table (orc):

numFiles        1
numRows 1000000
rawDataSize     471000000
totalSize       50444668

I had two sets of questions regarding this:

1. Why is the rawDataSize so high in case of ORC table (3.3 times the text file size).
How is the rawDataSize calculated in this case? (Is it the sum of each datatype size of the
columns, multiplied the numRows)?

Yes. That is correct. Raw data size = datatype size * numRows.

2. In Hive query plans, the estimated data size of the tables in each phase (map and reduce),
are equal to the rawDataSize. The number of reducers get caluclated from this size (atleast
in Tez, not in case of MR though). Isn't this wrong, shouldn't it pick the totalSize rather?
Is there a way to force Hive/Tez to pick the totalSize in query plans/ or atleast while calculating
the number of reducers?

Unlike some lazy text formats the row/column vectors returned by ORC are eagerly deserialized.
Also ORC by default compresses the data. So on-disk representation (totalSize) is not a direct
reflection of how we process it on the memory. Because of encoding and compression, on-disk
representation is way smaller than in-memory representation used by operator pipeline. That’s
the reason why raw data size is a better metric
for reducer estimation than on-disk file size. In case if the raw data size does not exist
then the optimizer will fallback to use totalSize.
Using totalSize for reducer estimation may overly underestimate the number of reducers required
for compressed tables.
On the other hand using raw data size may over estimate the number of reducers but Tez offsets
this issue by auto reducer parallelism feature ( which can
downsize the number of reducers based on the bytes emitted by previous map stage.

Thanks in advance.


View raw message