hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Hive/Tez ORC tables -- rawDataSize value
Date Thu, 23 Jun 2016 23:02:30 GMT
Hi,

Can you please send the output of

DESC FORMATTED <TABLE_NAME>

after running (if you have not so already)

ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMN

For both tables?


HTH,



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 June 2016 at 23:49, Lalitha MV <lalithamv92@gmail.com> wrote:

> Hi,
>
> I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1.
>
> I created a hive table with text file size = ~141 Mb.
> show tblproperties of this table (textfile):
> numFiles        1
> numRows 1000000
> rawDataSize     141869803
> totalSize       142869803
>
> I then created a hive table, with orc compression from the above table.
> The compressed file size is ~50 Mb.
>
> show tblproperties for new table (orc):
>
> numFiles        1
> numRows 1000000
> rawDataSize     471000000
> totalSize       50444668
>
> I had two sets of questions regarding this:
>
> 1. Why is the rawDataSize so high in case of ORC table (3.3 times the text
> file size).
> How is the rawDataSize calculated in this case? (Is it the sum of each
> datatype size of the columns, multiplied the numRows)?
>
> 2. In Hive query plans, the estimated data size of the tables in each
> phase (map and reduce), are equal to the rawDataSize. The number of
> reducers get caluclated from this size (atleast in Tez, not in case of MR
> though). Isn't this wrong, shouldn't it pick the totalSize rather? Is there
> a way to force Hive/Tez to pick the totalSize in query plans/ or atleast
> while calculating the number of reducers?
>
> Thanks in advance.
>
> Cheers,
> Lalitha
>

Mime
View raw message